logo       

Re: [Boston.pm] HTML parsing: msg#00013

Subject: Re: [Boston.pm] HTML parsing
On Mon, Mar 19, 2007 at 02:36:36PM -0400, Tom Metro wrote:
> Modules like HTML::TreeBuilder don't buy you much, as you're still
> left with the task of walking the tree and implementing a state
> machine.  HTML::Element, which is used with HTML::TreeBuilder to
> operate on nodes and traverse the tree, provides methods to test
> parent-child relationships ($h->is_inside('tag'),$h->look_down('tag'))
> and adjacency ($h->left(), $h->right()), which should make the job
> simpler, but in the example above they still may be of little help if
> the two tags you are looking for are merely "distant cousins."
> 
> The closest I found to meeting the requirements of my example is
> covered in the "Complex Criteria in Tree Scanning" in this article on
> using HTML::TreeBuilder:
> http://search.cpan.org/~petek/HTML-Tree-3.23/lib/HTML/Tree/Scanning.pod#Complex_Criteria_in_Tree_Scanning
> 
> where the look_down() method is used in conjunction with criteria
> specified as code references, which can then tease out complex
> relationships among tags. But this is just another way of hand rolling
> a state machine with a bit cleaner syntax.
> 
> 
> It seems like what is missing is a module that provides a
> regular-expression style language for matching against tags. It would
> make screen scraping tasks almost trivial. Anyone know of a module
> like this?
> 
> What's your favorite HTML parsing module?

I certainly use TreeBuilder a lot - not sure what kind of API you're
looking for?  Maybe something like XML::Twig's get_xpath?  Of course,
with the quality of HTML in the wild, it might be difficult to get it
loaded into an XML parser...

Dan

-- 
Dan Boger
dan-rlx3YLNxYWXQT0dZR+AlfA@xxxxxxxxxxxxxxxx


<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

Recently Viewed:
science.linguis...    culture.sf.lite...    video.mplayer.c...    yellowdog.gener...    ietf.rfc822/199...    emacs.help/2002...    redhat.release....    kernel.speakup/...    java.openejb.de...    debian.devel.gt...    xfree86.newbie/...    bug-tracking.ma...    pam/2003-05/msg...    games.devel.ope...    user-groups.lin...    music.pancham/2...    network.mq.deve...    web.html.genera...    arklinux.bugs/2...    linux.ecasound/...    qnx.openqnx.dev...    org.user-groups...    file-systems.sf...    trustix.contrib...   
Home | blog view | USPTO Patent Archive | advertise | OSDir is an inevitable website. super tiny logo

Free Magazines

Cisco News
Receive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business.
subscribe

Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field.
subscribe

The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business.
subscribe

Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company.
subscribe

Total Telecom Total Telecom is "The Economist of the communications industry".
subscribe