logo       

Host specific parsing: msg#00275

nutch-user.lucene.apache.org

Subject: Host specific parsing

Hi,

has anyone built a parsing plugin which decides on a per host basis how the
content of the document should be parsed?

For example, if the title of a document is in the first <h1>-tag of a page for
host1 , but the title for a document of host2 is in the third <h2>-tag, the
plugin would extract the title differently depending on the host.

In my opinion something like a dispatcher plugin would be needed:

- Identify host of a document

- Read and cache instructions on how to get the information for that
host (database or config file)

- Execute host-specific plugin

Do you have any suggestions on how to implement such a scenario efficiently?
Has anyone implemented something similiar and can point out possible
performance issues or other critical issues to be considered?

Thanks in advance.

Kind regards,
Martina
<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | Mail Home | sitemap | FAQ | advertise