logo       

Re: how to exclude some external links: msg#00298

nutch-user.lucene.apache.org

Subject: Re: how to exclude some external links

On Thu, Jul 30, 2009 at 9:15 PM, <alxsss@xxxxxxx> wrote:

> I would like to know how can I modify nutch code to exclude external links
> with certain extensions. For example, if have in urls mydomain.com and my
> domain.com has a lot of links like mydomain.com/mylink.shtml, then I want
> nutch not to fetch(crawl) these kind of urls at all.

Can't you do this with the existing RegexURLFilter plugin? Make sure
urlfilter-regex is listed in plugin.includes, and that you've got the
property urlfilter.regex.file is set to a file (probably
regex-urlfilter.txt). Then you can list the extensions you want to
skip in that file.

--
http://www.linkedin.com/in/paultomblin

<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | Mail Home | sitemap | FAQ | advertise