|
|
Meditation: throw-away mentality: msg#00027
|
Subject: |
Meditation: throw-away mentality |
Greets,
Every time a document gets added to Lucene/Plucene, lots and lots of
objects get created then destroyed. This makes sense for Documents
and Fields, but it's worth exploring whether we can apply the
principle of "reduce, reuse, recycle" to all those DocumentWriters,
FieldsWriters, and TermInfosWriters. I'll focus on DocumentWriter
for simplicity's sake.
There are two reasons why a fresh DocumentWriter is required for
every document.
First, there are a number of states which Lucene allows to be changed
at the IndexWriter level which must propagate down to the
DocumentWriter instance: maxFieldLength, similarity, and
termIndexInterval may all be modified in the middle of an indexing
pass, and if a new field or a newly-redefined field is encountered,
fieldInfos must be updated, as described in my last meditation. The
behavior of DocumentWriter changes based on these states, so when the
IndexWriter gets updated, DocumentWriter has to change, too. It's
easier to create a new DocumentWriter each time the IndexWriter's
add_document method is called than it would be to install the
necessary apparatus in each setter for propagating changes to a
static DocumentWriter.
The problem goes away if those states are set once when the
IndexWriter is initialized, then fixed. There is no performance
penalty for doing this. If you want to apply different Similarity
models to different documents, create multiple IndexWriters, then
merge the indexes via IndexWriter's add_indexes method -- the merging
process is exactly the same. Of course, this is an API change...
Second, Lucene builds indexes by writing each document to its own
mini-inverted-index, then merging indexes of increasing size on a
schedule determined by mergefactor. Since each document must be
written to its own unique segment, the segment name must propagate to
DocumentWriter, and unique I/O streams based on that segment name
must be opened, written, and closed. It would be possible to supply
the segment name as an argument to add_document instead of the
constructor, but one way or another, the segment name has to
propagate, and lots of I/O streams have to pass through their life
cycle.
This problem goes away if you create an indexer class which departs
from the 1:1 document:inverted-index model. Instead of opening new I/
O streams for each document, you open one set of I/O streams and
write multiple documents.
This is what Kinosearch's Kindexer does -- it writes all documents
into a single segment, and only merges segments after the last
document has been added and the output segment has been finalized.
Of course, this means a radical, radical change to the indexing
process...
That's what it'll take if we want to stop throwing away all those
writers.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
|
Try Searching:
servers, voip, java, networking, microsoft ...
|
|
|
| |