logo       

Re: Host specific parsing: msg#00281

nutch-user.lucene.apache.org

Subject: Re: Host specific parsing

Long time since I wrote  plugin.
You could simply embed different logic in the same plugin - cant you?


sudhi

--- On Tue, 7/28/09, Koch Martina <Koch@xxxxxxxxxxxxxx> wrote:

From: Koch Martina <Koch@xxxxxxxxxxxxxx>
Subject: Host specific parsing
To: "nutch-user@xxxxxxxxxxxxxxxxx" <nutch-user@xxxxxxxxxxxxxxxxx>
Date: Tuesday, July 28, 2009, 2:24 AM

Hi,

has anyone built a parsing plugin which decides on a per host basis how the
content of the document should be parsed?

For example, if the title of a document is in the first <h1>-tag of a page for
host1 , but the title for a document of host2 is in the third <h2>-tag, the
plugin would extract the title differently depending on the host.

In my opinion something like a dispatcher plugin would be needed:

-          Identify host of a document

-          Read and cache instructions on how to get the information for that
host (database or config file)

-          Execute host-specific plugin

Do you have any suggestions on how to implement such a scenario efficiently?
Has anyone implemented something similiar and can point out possible
performance issues or other critical issues to be considered?

Thanks in advance.

Kind regards,
Martina



<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | Mail Home | sitemap | FAQ | advertise