Nutch is an open source Java implementation of a search engine. It
provides all of the tools you need to run your own search engine. But why
would anyone want to run their own search engine? After all, there's
always Google. There are at least three reasons.
Transparency. Nutch is open source, so anyone can see how the ranking
algorithms work. With commercial search engines, the precise details of
the algorithms are secret so you can never know why a particular search
result is ranked as it is. Furthermore, some search engines allow rankings
to be based on payments, rather than on the relevance of the site's
contents. Nutch is a good fit for academic and government organizations,
where the perception of fairness of rankings may be more important.
Understanding. We don't have the source code to Google, so Nutch is
probably the best we have. It's interesting to see how a large search
engine works. Nutch has been built using ideas from academia and industry:
for instance, core parts of Nutch are currently being re-implemented to
use the Map Reduce distributed processing model, which emerged from Google
Labs last year. And Nutch is attractive for researchers who want to try
out new search algorithms, since it is so easy to extend.
Extensibility. Don't like the way other search engines display their
results? Write your own search engine--using Nutch! Nutch is very
flexible: it can be customized and incorporated into your application. For
developers, Nutch is a great platform for adding search to heterogeneous
collections of information, and being able to customize the search
interface, or extend the out-of-the-box functionality through the plugin
mechanism. For example, you can integrate it into your site to add a
search capability.
Nutch installations typically operate at one of three scales: local
filesystem, intranet, or whole web. All three have different
characteristics. For instance, crawling a local filesystem is reliable
compared to the other two, since network errors don't occur and caching
copies of the page content is unnecessary (and actually a waste of disk
space). Whole-web crawling lies at the other extreme. Crawling billions of
pages creates a whole host of engineering problems to be solved: which
pages do we start with? How do we partition the work between a set of
crawlers? How often do we re-crawl? How do we cope with broken links,
unresponsive sites, and unintelligible or duplicate content? There is
another set of challenges to solve to deliver scalable search--how do we
cope with hundreds of concurrent queries on such a large dataset? Building
a whole-web search engine is a major investment. In " Building Nutch: Open
Source Search," authors Mike Cafarella and Doug Cutting (the prime movers
behind Nutch) conclude that:
... a complete system might cost anywhere between $800 per month for two-
search-per-second performance over 100 million pages, to $30,000 per month
for 50-page-per-second performance over 1 billion pages.
This series of two articles shows you how to use Nutch at the more modest
intranet scale (note that you may see this term being used to cover sites
that are actually on the public internet--the point is the size of the
crawl being undertaken, which ranges from a single site to tens, or
possibly hundreds, of sites). This first article concentrates on crawling:
the architecture of the Nutch crawler, how to run a crawl, and
understanding what it generates. The second looks at searching, and shows
you how to run the Nutch search application, ways to customize it, and
considerations for running a real-world system.
-----------------------------------------------------
xml-interest: A list for discussing XML technologies in the Java Platform.
To post, mailto:xml-interest@xxxxxxxxxxxx
Archives at: http://archives.java.sun.com/xml-interest.html
To unsubscribe, mailto:listserv@xxxxxxxxxxxx the following message;
signoff xml-interest.
|