logo       
Google Custom Search
    AddThis Social Bookmark Button

Porting org.apache.lucene.store: msg#00016

Subject: Porting org.apache.lucene.store
Greets,

Currently, Plucene::Store only contains two modules: InputStream and OutputStream; the store package in Java Lucene ~1.9 has many:

    BufferedIndexInput.java
    BufferedIndexOutput.java
    Directory.java
    FSDirectory.java    # contains FSIndexInput, FSIndexOutput
    IndexInput.java
    IndexOutput.java
    InputStream.java    # deprecated
    Lock.java
    MMapDirectory.java  # contains MMapIndexInput, MultiMMapIndexInput
    OutputStream.java   # deprecated
    RAMDirectory.java
    RAMFile.java
    RAMInputStream.java
    RAMOutputStream.java

In Java Lucene, InputStream and OutputStream have been deprecated in favor of IndexInput and IndexOutput. This was done because InputStream and OutputStream require buffering, making it impossible to create unbuffered implementations. MMapDirectory, a memory-mapped implementation of the abstract Directory class, needs unbuffered IO.

There hasn't been a lot of discussion on the Lucene list about MMapDirectory, but I gather that it is there primarily to maximize performance in extremely demanding deployments where many queries must be fielded concurrently. There is actually an :mmap PerlIO layer available <http://perldoc.perl.org/PerlIO.html>; at this time I don't plan to port MMapDirectory, but it may not be too hard to add it at some point in the future. The other two Java Lucene Directory classes both use buffering.

In Java Lucene's buffered IO classes, data is read into a 1024-byte buffer, from which readByte and other methods grab data. The seemingly straightforward way to port this is to keep a scalar buffer inside a hash-based object:

    $obj = bless {
        fh = $fh,
        buffer => '',
        buffer_position => 0,
        buffer_start    => 0,
        # etc...
     }, __PACKAGE__

However, because of the way PerlIO works, manual buffer maintenance is probably unnecessary and even counter-productive. PerlIO *is* buffered by default. When you request 8 bytes from a filehandle, Perl grabs 4096 (on my machine), and stores that in a buffer that you never see or care about. When you request more bytes from the filehandle, Perl serves it from the buffer if it can, and if it needs to grab more from the system, the buffer-filling happens automatically.

The difference between the buffered and unbuffered Stream classes in Java Lucene is pretty substantial. In Perl, that plumbing is already in place, and the classes would differ only in a setting on the filehandle. If we duplicate Java Lucene's buffering code, we'll actually be putting in a 3rd buffer, behind kernel buffering and PerlIO buffering. I doubt that will benefit us. Therefore...

The two abstract base classes for Plucene IO will be:

    Plucene::Store::InStream
    Plucene::Store::OutStream

They will not contain any obvious buffer-management apparatus. There will be no "buffer" member variable or refill() method. But because PerlIO is buffered by default, these will be buffered by default.

In Java Lucene, FSIndexInput and RAMIndexInput inherit from both IndexInput and BufferedIndexInput; in Plucene, FSInStream and RAMInStream will inherit only from InStream. The OutStream subclasses will follow the same pattern.

If at some point in the future it makes sense to create an unbuffered MMapInStream class, that can also inherit from InStream; -- PerlIO's buffering will be turned off automatically simply by enabling the :mmap layer.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



Try Searching:
servers, voip, java, networking, microsoft ...
<Prev in Thread] Current Thread [Next in Thread>