logo       

Re: crawl-tool.xml: msg#00263

nutch-user.lucene.apache.org

Subject: Re: crawl-tool.xml

its not only confusing me,
its also confusing the author, FrankMcCown, of the nutch tutorial

http://wiki.apache.org/nutch/NutchTutorial


Crawl Command: Configuration

To configure things for the crawl command you must:

*

Create a directory with a flat file of root urls. For example, to
crawl the nutch site you might start with a file named urls/nutch
containing the url of just the Nutch home page. All other Nutch
pages should be reachable from this page. The urls/nutch file
would thus contain:

http://lucene.apache.org/nutch/

*

Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME
with the name of the domain you wish to crawl. For example, if you
wished to limit the crawl to the apache.org domain, the line
should read:

+^http://([a-z0-9]*\.)*apache.org/

This will include any url in the domain apache.org.

* Until someone could explain this...When I use the file
crawl-urlfilter.txt the filter doesn't work, instead of it use the file
conf/regex-urlfilter.txt and change the last line from "+." to "-."


reinhard schwab schrieb:
> i have tried the recrawl script of susam pal and have wondered why
> url filtering no longer works.
> http://wiki.apache.org/nutch/Crawl
>
> the mystery is
>
> only Crawl.java adds crawl-tool.xml to the NutchConfiguration.
>
> Configuration conf = NutchConfiguration.create();
> conf.addResource("crawl-tool.xml");
>
> Fetcher.java and all the other tools which filter the outlinks do not
> add this.
> this is really confusing me and i have spent some time to figure this out.
>
> regards
> reinhard
>
>
>
>
>
>
>
>
>

<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | Mail Home | sitemap | FAQ | advertise