logo       

RE: Why did my crawl fail?: msg#00258

nutch-user.lucene.apache.org

Subject: RE: Why did my crawl fail?

This is a very interesting issue. I guess that absence of parse_data means that
no content has been fetched. Am I wrong?

This happened in my crawls a few times. Theoretically (I am guessing again)
this may happen if all urls selected for fetching on this iteration are either
blocked by the filters, or failed to be fetched, for whatever reason.

I got around this problem by checking for presence of parse_data, and if it is
absent, deleting the segment. This seems to be working, but I am not 100% sure
that this is a good thing to do. Can I do this? Is it safe to do? Would
appreciate if someone with expert knowledge commented on this issue.

Regards,

Arkadi


> -----Original Message-----
> From: ptomblin@xxxxxxxxx [mailto:ptomblin@xxxxxxxxx] On Behalf Of Paul
> Tomblin
> Sent: Saturday, July 25, 2009 12:54 AM
> To: nutch-user
> Subject: Why did my crawl fail?
>
> I installed nutch 1.0 on my laptop last night and set it running to crawl
> my
> blog with the command: bin/nutch crawl urls -dir crawl.blog -depth 10
> it was still running strong when I went to bed several hours later, and
> this
> morning I woke up to this:
>
> activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl.blog/crawldb
> CrawlDb update: segments: [crawl.blog/segments/20090724010303]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> LinkDb: starting
> LinkDb: linkdb: crawl.blog/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303
> Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
> Input path does not exist:
> file:/Users/ptomblin/nutch-
> 1.0/crawl.blog/segments/20090723154530/parse_data
> at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:1
> 79)
> at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn
> putFormat.java:39)
> at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:19
> 0)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
>
>
> --
> http://www.linkedin.com/in/paultomblin

Google Custom Search

News | Mail Home | sitemap | FAQ | advertise