Greets,
I'd gotten the Minty benchmark down into the 16 second range,
knocking 2 seconds off by moving the deserialization algo to XS.
However my app had basically inlined everything in TermInfosWriter
into the one write_postings() method. It was already a little messy,
and that was before I started trying to figure out how to re-enable
skipdata.
I concluded there was no choice but to reproduce TermInfosWriter. I
did that, cutting as many corners as possible: I inlined writeTerm,
and merged Term and TermInfo into a single hash (not even an
object). Boom, we're back to over 20 seconds.
Java Lucene finishes indexing in 9 seconds. (This is with
mergeFactor set to 1000, which is only fair because KinoSearch has a
high -mem_threshold setting for Sort::External, so the "external"
sort actually gets run "internally", that is in RAM. If I set
mergeFactor to 10, Java Lucene takes 13 seconds to finish indexing.)
If I make Term and TermInfo separate objects again, use accessor
methods rather than direct hash access to the variables, and abstract
out writeTerm again, it's no mystery what's going to happen.
Multiply those 4 seconds times every other place in Plucene where you
have objects instead of procedural programming, accessors instead of
direct access, etc, and you're well on your way to 270 seconds.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
|