|
|
Choosing A Webhost: |
[Boston.pm] XML::Twig does HTML RE: HTML parsing: msg#00015lang.perl.perl-mongers.boston
Tom, > It seems like what is missing is a module that provides a > regular-expression style language for matching against tags. It would > make screen scraping tasks almost trivial. Anyone know of a module like > this? > What's your favorite HTML parsing module? XML::Twig is the grep for XML (and bundles with xml_grep(1)). With it's new parse_html() option, XML::Twig will use Tree::Builder for you to convert HTML to it's internal rep of XML, protecting you from Tree::Builder's interface. You can make reg-ex-like Xpath-like queries on the HTML document with it and let it's pattern engine walk the tree looking for twigs that match your query. It supports an Xpath-like query language. http://search.cpan.org/search?query=XML-Twig&mode=dist Which references <<The XML::Twig page is at http://www.xmltwig.com/xmltwig/ It includes the development version of the module, a slightly better version of the documentation, examples, a tutorial and a: Processing XML efficiently with Perl and XML::Twig: http://www.xmltwig.com/xmltwig/tutorial/index.html >> Which has useful summary http://www.xmltwig.com/xmltwig/quick_ref.html [but read tutorial first]. It can work in either a stream/call-back-handler mode or a parse-then-search mode, and can work as a XML-aware SED (with inplace option!), can preserver or change encoding, etc. A very perl-friendly way to deal with XML. CAVEAT -- I haven't tried this new html-happy mode yet; I've wished for it in the past, when XML::twig rejected HTML that wasn't highly XHTML well-formed. Now with this new option, it probably accepts anything H:TB does and pretends it read a conformant XHTML document. I've got to try this too. -- Bill / n1vux Not speaking for the firm
|
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Re: [Boston.pm] HTML parsing, mirod |
|---|---|
| Next by Date: | [Boston.pm] What have you heard about S3 and EC2?, Alan Vogt |
| Previous by Thread: | Re: [Boston.pm] HTML parsing, Charlie Reitzel |
| Next by Thread: | Re: [Boston.pm] HTML parsing, Tom Metro |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
Free MagazinesCisco NewsReceive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business. subscribe Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field. subscribe The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business. subscribe Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company. subscribe Total Telecom Total Telecom is "The Economist of the communications industry". subscribe |