logo       

Re: Nutch 1.0 Fetch failure...: msg#00221

nutch-user.lucene.apache.org

Subject: Re: Nutch 1.0 Fetch failure...

Hi,

On Mon, Jul 20, 2009 at 19:55, Fred Kuipers<mr.fredk@xxxxxxxxx> wrote:
> Hello all,
>
> I'm attempting to index a large internal website with 6.7 m urls and I'm
> running into a map failure after fetching (for 5+ days):
>
> 2009-07-20 07:09:23,316 INFO Âfetcher.Fetcher - -activeThreads=0
> 2009-07-20 07:09:23,806 WARN Âmapred.LocalJobRunner - job_local_0005
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
> valid local directory for
> taskTracker/jobcache/job_local_0005/attempt_local_0005_m_000000_0/output/file.out
> Â Â Â at
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
> Â Â Â at
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
> Â Â Â at
> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
> Â Â Â at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1209)
> Â Â Â at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:867)
> Â Â Â at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
> Â Â Â at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
>
> hadoop-site.xml:
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <!--
> We need LOTS of memory... And we need to disable the gc overhead limit, per
> this page:
> http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html#par_gc.oom
> -->
> <property>
> Â<name>mapred.child.java.opts</name>
> Â<value>-Xmx4096m -XX:-UseGCOverheadLimit</value>
> </property>
>
> </configuration>
>
> nutch-site.xml (excluding http.agent directives for brevity):
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <!-- http.agent properties excluded -->
>
> <property>
> Â<name>http.timeout</name>
> Â<value>20000</value>
> Â<description>The default network timeout, in milliseconds.</description>
> </property>
>
> <property>
> Â<name>fetcher.threads.fetch</name>
> Â<value>20</value>
> Â<description>The number of FetcherThreads the fetcher should use.
> Â This is also determines the maximum number of requests that are
> Â made at once (each FetcherThread handles one connection).</description>
> </property>
>
> <property>
> Â<name>fetcher.threads.per.host</name>
> Â<value>20</value>
> Â<description>This number is the maximum number of threads that
> Â should be allowed to access a host at one time.</description>
> </property>
>
> <property>
> Â<name>fetcher.server.delay</name>
> Â<value>0.1</value>
> Â<description>The number of seconds the fetcher will delay between
> Âsuccessive requests to the same server.</description>
> </property>
>
> </configuration>
>
> Relevant environment variables:
> NUTCH_JAVA_HOME=/usr/lib/jvm/jre-1.7.0-icedtea.x86_64
> NUTCH_HEAPSIZE=3072
> JAVA_HOME=/usr/lib/jvm/jre-1.7.0-icedtea.x86_64
>
> I ran nutch with the following command/cwd:
> [/home/fred/nutch-1.0]$ bin/nutch crawl urls_wiki_mirror -dir
> crawl_wiki_mirror -threads 3 -depth 1
>
> The seed file in urls_wiki_mirror contains 6739469 urls... Those are the
> only urls I wish to crawl -- hence depth 1. The configuration I have set up
> allows me to crawl this local server with 3 fetchers at the same time at a
> rate that it doesn't overwhelm the server.
>
> I'm using defaults for temp directories. Thus, /tmp/hadoop-fred/ is the temp
> file location. The error message notes the following partial path:
> taskTracker/jobcache/job_local_0005/attempt_local_0005_m_000000_0/output/file.out
>
> I figure that equates to this full path:
> /tmp/hadoop-fred/mapred/local/taskTracker/jobcache/job_local_0005/attempt_local_0005_m_000000_0/output/
>
> The contents of this directory is spill[0-906].out... Nothing else. No
> file.out. There is 68G of data in this folder (ie. it looks to have
> downloaded everything i need)... There is 9+ GB of free space on the
> filesystem -- is it possible this is insufficient?
>

It is possible that you ran out of space, it is also possible that you ran into
a hadoop bug. From the logs, it doesn't seem like a nutch bug.

> So, what happened? Is there a way I can recover without re-crawling?
>

You can try this tool:

http://issues.apache.org/jira/browse/NUTCH-451

There is no guarantee that it will work though.

> I am running on a Fedora Core 8 virtual machine with two cores, 4 GB memory.
>
> Let me know if any more information is needed...
>

Can you try crawling in smaller units? i.e, crawl 1m docs then crawl
the second 1m docs, etc?

> Thanks,
> /FjK
>



--
DoÄacan GÃney

Google Custom Search

News | Mail Home | sitemap | FAQ | advertise