|
Re: Host specific parsing: msg#00281nutch-user.lucene.apache.org
Long time since I wrote plugin. You could simply embed different logic in the same plugin - cant you? sudhi --- On Tue, 7/28/09, Koch Martina <Koch@xxxxxxxxxxxxxx> wrote: From: Koch Martina <Koch@xxxxxxxxxxxxxx> Subject: Host specific parsing To: "nutch-user@xxxxxxxxxxxxxxxxx" <nutch-user@xxxxxxxxxxxxxxxxx> Date: Tuesday, July 28, 2009, 2:24 AM Hi, has anyone built a parsing plugin which decides on a per host basis how the content of the document should be parsed? For example, if the title of a document is in the first <h1>-tag of a page for host1 , but the title for a document of host2 is in the third <h2>-tag, the plugin would extract the title differently depending on the host. In my opinion something like a dispatcher plugin would be needed: - Identify host of a document - Read and cache instructions on how to get the information for that host (database or config file) - Execute host-specific plugin Do you have any suggestions on how to implement such a scenario efficiently? Has anyone implemented something similiar and can point out possible performance issues or other critical issues to be considered? Thanks in advance. Kind regards, Martina
|
|
||||||||||||||||||||||||||
| News | Mail Home | sitemap | FAQ | advertise |