|
Re: Why did my crawl fail?: msg#00261nutch-user.lucene.apache.org
You must have crawled for several times, and some of them failed before the parse phase. So the parse data was not generated. You'd better delete the whole directory file:/Users/ptomblin/nutch-1.0/crawl.blog, and recrawl it, then you will know the exact reason why it failed in the parse phase from the output information. Xiao On Fri, Jul 24, 2009 at 10:53 PM, Paul Tomblin<ptomblin@xxxxxxxxx> wrote: > I installed nutch 1.0 on my laptop last night and set it running to crawl my > blog with the command: bin/nutch crawl urls -dir crawl.blog -depth 10 > it was still running strong when I went to bed several hours later, and this > morning I woke up to this: > > activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=0 > Fetcher: done > CrawlDb update: starting > CrawlDb update: db: crawl.blog/crawldb > CrawlDb update: segments: [crawl.blog/segments/20090724010303] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: done > LinkDb: starting > LinkDb: linkdb: crawl.blog/linkdb > LinkDb: URL normalize: true > LinkDb: URL filter: true > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303 > Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: > Input path does not exist: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530/parse_data > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179) > at > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:129) > > > -- > http://www.linkedin.com/in/paultomblin >
|
|
||||||||||||||||||||||||||
|
|
|
| News | Mail Home | sitemap | FAQ | advertise |