Greetings,
I've tried several techniques for loading index data into ram, but each
seems to have drawbacks from annoying to debilitating. As a test case,
I'm working on a low-powered system (Mac G4 Cube 450 running OS X
10.3.4 with 1.5 Gb ram), but the principle should extend outwards. The
chunk of data I'm working with is 380 Mb or so. It is not a Plucene
index, but a solid binary file containing statistical data for a custom
search engine I've been coding. I may go back to Plucene; I may submit
pieces of this project as Plucene plugins; in any case, I still have to
deal with getting data into ram in order to squeeze maximum
performance. Here are the options I've explored so far.
1) Preload data using mod_perl and a startup.pl that calls a module
where the data is loaded. In theory, if a given variable is not
modified, it can be shared among all child processes. In general
practice, variables tend to become "dirty" as they are modified by the
child processes, in which case the variable is moved into a private
memory space. (For a good long explanation, see
http://www.serverwatch.com/tutorials/article.php/10825_1132061_1 ) In
my specific case, I have never been able to get the data chunk to stay
shared for even a single session. I'm not sure if that's because of its
raw size, or because of the way I load the data in from a filehandle.
Baffling.
2) Use IPC::Shareable to load a variable into shared memory. I had to
hack some system settings to even get parts of this to work - SHMMAX,
SHMMIN, SHMALL, SHMSEG, SHMMNI in /etc/rc. Even when I tweaked the
settings to allow shared memory segments of over a Gb in size, I could
never get any one segment to hold more than 65506 bytes. Maybe
IPC::Shareable can't deal with the fact that in OS X some of those
settings are in bytes while others are in pages of 4096 bytes each.
Shared variables worked as advtertised in test scripts for small pieces
of data, but I never got anywhere near the size I wanted. I thought
about busting it up into several thousand scalars, but that'll mean a
lot of extra overhead. (For details on how to hack shared memory
settings for various Unix flavors, see
http://afni.nimh.nih.gov/old/afni/parallize.html ).
3) The last option is a straightup ramdisk, which is an undocumented
but easy to use feature in OS X. The ram disk technique actually
works. I was hoping not to have to go through system calls to access
data; I have this nagging feeling that OS X's virtual memory / disk
caching intelligence is going to do something less than ideal, like
caching data from the ram drive in hard-disk swap. But if I can't find
a solution that allows me to preload data into a Perl variable, I'll go
with what I got. (related links:
http://www.macosxhints.com/article.php?story=20020530084607311
http://www.hartill.net/OSX/ramdisk )
Any other options I should consider?
-- Marvin Humphrey
|