logo       

Using Nutch (w/custom plugin) to crawl vs. custom Lucene app: msg#00272

nutch-user.lucene.apache.org

Subject: Using Nutch (w/custom plugin) to crawl vs. custom Lucene app

Hi,

I've been familiarizing myself with Nutch, in preparation for putting together
a proof-of-concept (POC) that we are wanting. Basically, we have some files of
proprietary file type, and we want to be able to search on specific "fields"
within these files. The files are physically stored on the local filesystem.

Thus far, I've gotten an initial Nutch instance working, and also a 2nd Nutch
instance, configured for crawling the local filesystem. These test instances
just use the "out-of-box" Nutch and Nutch plugins, e.g., the PDF plugin, just
to allow me to get familiar with Nutch software.

Having done that, my original idea was to write some Nutch plugins that could
be used with a Nutch crawl.

However, we already have some previously-built apps that basically "crawl"
(e.g., they do a recursive directory search on the local filesystem) the local
filesystem and finds all of these files. These are Java apps that we
previously built for various purposes.

So, I'm wondering if it might make more sense (and I think may be easier) to
take one of those existing apps, and, basically, just enhance them to build
Lucene indexes, which could then be used by the Nutch web app (as a web-based
search web app)?

As I said, I'm really new to Nutch, and also to Lucene, but from what I've
researched so far, it *looks like* it'd be fairly easy to extend some of
existing apps to generate Lucene indexes, and I have some questions:

- If my custom Java app can be extended to "just" build indexes using Lucene,
is that all that it needs to do in order for these to work ok with Nutch web
app?

- Am I underestimating the effort needed to build the Lucene indexes that the
Nutch web app could use?

I was wondering if anyone here, has had to go through a similar situation
(Nutch plugin for custom file type vs. custom crawl app to build Lucene indexes
that the Nutch web app can use)?

Any other thoughts on all of this would be greatly appreciated from the
Nutch/Lucene experts here!!

Thanks,
Jim

<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | Mail Home | sitemap | FAQ | advertise