logo       
Google Custom Search
    AddThis Social Bookmark Button

Meditation: throw-away mentality: msg#00027

Subject: Meditation: throw-away mentality
Greets,

Every time a document gets added to Lucene/Plucene, lots and lots of objects get created then destroyed. This makes sense for Documents and Fields, but it's worth exploring whether we can apply the principle of "reduce, reuse, recycle" to all those DocumentWriters, FieldsWriters, and TermInfosWriters. I'll focus on DocumentWriter for simplicity's sake.

There are two reasons why a fresh DocumentWriter is required for every document.

First, there are a number of states which Lucene allows to be changed at the IndexWriter level which must propagate down to the DocumentWriter instance: maxFieldLength, similarity, and termIndexInterval may all be modified in the middle of an indexing pass, and if a new field or a newly-redefined field is encountered, fieldInfos must be updated, as described in my last meditation. The behavior of DocumentWriter changes based on these states, so when the IndexWriter gets updated, DocumentWriter has to change, too. It's easier to create a new DocumentWriter each time the IndexWriter's add_document method is called than it would be to install the necessary apparatus in each setter for propagating changes to a static DocumentWriter.

The problem goes away if those states are set once when the IndexWriter is initialized, then fixed. There is no performance penalty for doing this. If you want to apply different Similarity models to different documents, create multiple IndexWriters, then merge the indexes via IndexWriter's add_indexes method -- the merging process is exactly the same. Of course, this is an API change...

Second, Lucene builds indexes by writing each document to its own mini-inverted-index, then merging indexes of increasing size on a schedule determined by mergefactor. Since each document must be written to its own unique segment, the segment name must propagate to DocumentWriter, and unique I/O streams based on that segment name must be opened, written, and closed. It would be possible to supply the segment name as an argument to add_document instead of the constructor, but one way or another, the segment name has to propagate, and lots of I/O streams have to pass through their life cycle.

This problem goes away if you create an indexer class which departs from the 1:1 document:inverted-index model. Instead of opening new I/ O streams for each document, you open one set of I/O streams and write multiple documents.

This is what Kinosearch's Kindexer does -- it writes all documents into a single segment, and only merges segments after the last document has been added and the output segment has been finalized. Of course, this means a radical, radical change to the indexing process...

That's what it'll take if we want to stop throwing away all those writers.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



Try Searching:
servers, voip, java, networking, microsoft ...
<Prev in Thread] Current Thread [Next in Thread>