logo       

Re: Why did my crawl fail?: msg#00261

nutch-user.lucene.apache.org

Subject: Re: Why did my crawl fail?

You must have crawled for several times, and some of them failed
before the parse phase. So the parse data was not generated.
You'd better delete the whole directory
file:/Users/ptomblin/nutch-1.0/crawl.blog, and recrawl it, then you
will know the exact reason why it failed in the parse phase from the
output information.

Xiao

On Fri, Jul 24, 2009 at 10:53 PM, Paul Tomblin<ptomblin@xxxxxxxxx> wrote:
> I installed nutch 1.0 on my laptop last night and set it running to crawl my
> blog with the command:  bin/nutch crawl urls -dir crawl.blog -depth 10
> it was still running strong when I went to bed several hours later, and this
> morning I woke up to this:
>
> activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl.blog/crawldb
> CrawlDb update: segments: [crawl.blog/segments/20090724010303]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> LinkDb: starting
> LinkDb: linkdb: crawl.blog/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303
> Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
> Input path does not exist:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530/parse_data
> at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
> at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
> at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
>
>
> --
> http://www.linkedin.com/in/paultomblin
>

Google Custom Search

News | Mail Home | sitemap | FAQ | advertise