Please take our Survey
logo       

Choosing A Webhost:
A web hosting service is a type of Internet hosting service that allows individuals and organizations to provide their own website accessible via the World Wide Web. Web hosts are companies that provide space on a server they own for use by their clients as well as providing Internet connectivity, typically in a data center. Web hosts can also provide data center space and connectivity to the Internet for servers they do not own to be located in their data center, called colocation. more...

[Boston.pm] XML::Twig does HTML RE: HTML parsing: msg#00015

lang.perl.perl-mongers.boston

Subject: [Boston.pm] XML::Twig does HTML RE: HTML parsing

Tom,

> It seems like what is missing is a module that provides a
> regular-expression style language for matching against tags. It would
> make screen scraping tasks almost trivial. Anyone know of a module
like
> this?
> What's your favorite HTML parsing module?

XML::Twig is the grep for XML (and bundles with xml_grep(1)).

With it's new parse_html() option, XML::Twig will use Tree::Builder for
you to convert HTML to it's internal rep of XML, protecting you from
Tree::Builder's interface. You can make reg-ex-like Xpath-like queries
on the HTML document with it and let it's pattern engine walk the tree
looking for twigs that match your query. It supports an Xpath-like
query language.

http://search.cpan.org/search?query=XML-Twig&mode=dist
Which references
<<The XML::Twig page is at http://www.xmltwig.com/xmltwig/ It includes
the development version of the module, a slightly better version of the
documentation, examples, a tutorial and a: Processing XML efficiently
with Perl and XML::Twig:
http://www.xmltwig.com/xmltwig/tutorial/index.html >>

Which has useful summary http://www.xmltwig.com/xmltwig/quick_ref.html
[but read tutorial first].

It can work in either a stream/call-back-handler mode or a
parse-then-search mode, and can work as a XML-aware SED (with inplace
option!), can preserver or change encoding, etc. A very perl-friendly
way to deal with XML.

CAVEAT -- I haven't tried this new html-happy mode yet; I've wished for
it in the past, when XML::twig rejected HTML that wasn't highly XHTML
well-formed. Now with this new option, it probably accepts anything H:TB
does and pretends it read a conformant XHTML document. I've got to try
this too.

-- Bill / n1vux
Not speaking for the firm


<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

Recently Viewed:
qnx.openqnx.dev...    gcc.libstdc++.c...    solaris.opensol...    information-ret...    misc.misterhous...    web.catalyst.ge...    apache.webservi...    redhat.release....    hardware.lirc/2...    kernel.autofs/2...    technology.sust...    linux.vdr/2003-...    editors.lyx.gen...    org.user-groups...    netbsd.devel.pk...    xdg.devel/2004-...    version-control...    jakarta.slide.d...    debian.packages...    creativecommons...    ports.ppc.embed...    bug-tracking.bu...   
Home | blog view | USPTO Patent Archive | advertise | OSDir is an inevitable website. super tiny logo

Free Magazines

Cisco News
Receive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business.
subscribe

Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field.
subscribe

The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business.
subscribe

Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company.
subscribe

Total Telecom Total Telecom is "The Economist of the communications industry".
subscribe