Greets,
Currently, Plucene::Store only contains two modules: InputStream and
OutputStream; the store package in Java Lucene ~1.9 has many:
BufferedIndexInput.java
BufferedIndexOutput.java
Directory.java
FSDirectory.java # contains FSIndexInput, FSIndexOutput
IndexInput.java
IndexOutput.java
InputStream.java # deprecated
Lock.java
MMapDirectory.java # contains MMapIndexInput, MultiMMapIndexInput
OutputStream.java # deprecated
RAMDirectory.java
RAMFile.java
RAMInputStream.java
RAMOutputStream.java
In Java Lucene, InputStream and OutputStream have been deprecated in
favor of IndexInput and IndexOutput. This was done because
InputStream and OutputStream require buffering, making it impossible
to create unbuffered implementations. MMapDirectory, a memory-mapped
implementation of the abstract Directory class, needs unbuffered IO.
There hasn't been a lot of discussion on the Lucene list about
MMapDirectory, but I gather that it is there primarily to maximize
performance in extremely demanding deployments where many queries
must be fielded concurrently. There is actually an :mmap PerlIO
layer available <http://perldoc.perl.org/PerlIO.html>; at this time I
don't plan to port MMapDirectory, but it may not be too hard to add
it at some point in the future. The other two Java Lucene Directory
classes both use buffering.
In Java Lucene's buffered IO classes, data is read into a 1024-byte
buffer, from which readByte and other methods grab data. The
seemingly straightforward way to port this is to keep a scalar buffer
inside a hash-based object:
$obj = bless {
fh = $fh,
buffer => '',
buffer_position => 0,
buffer_start => 0,
# etc...
}, __PACKAGE__
However, because of the way PerlIO works, manual buffer maintenance
is probably unnecessary and even counter-productive. PerlIO *is*
buffered by default. When you request 8 bytes from a filehandle,
Perl grabs 4096 (on my machine), and stores that in a buffer that you
never see or care about. When you request more bytes from the
filehandle, Perl serves it from the buffer if it can, and if it needs
to grab more from the system, the buffer-filling happens automatically.
The difference between the buffered and unbuffered Stream classes in
Java Lucene is pretty substantial. In Perl, that plumbing is already
in place, and the classes would differ only in a setting on the
filehandle. If we duplicate Java Lucene's buffering code, we'll
actually be putting in a 3rd buffer, behind kernel buffering and
PerlIO buffering. I doubt that will benefit us. Therefore...
The two abstract base classes for Plucene IO will be:
Plucene::Store::InStream
Plucene::Store::OutStream
They will not contain any obvious buffer-management apparatus. There
will be no "buffer" member variable or refill() method. But because
PerlIO is buffered by default, these will be buffered by default.
In Java Lucene, FSIndexInput and RAMIndexInput inherit from both
IndexInput and BufferedIndexInput; in Plucene, FSInStream and
RAMInStream will inherit only from InStream. The OutStream
subclasses will follow the same pattern.
If at some point in the future it makes sense to create an unbuffered
MMapInStream class, that can also inherit from InStream; -- PerlIO's
buffering will be turned off automatically simply by enabling
the :mmap layer.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
|