logo       
Google Custom Search
    AddThis Social Bookmark Button
-->

Loading large amounts of data into ram: msg#00031

Subject: Loading large amounts of data into ram
Greetings,

I've tried several techniques for loading index data into ram, but each seems to have drawbacks from annoying to debilitating. As a test case, I'm working on a low-powered system (Mac G4 Cube 450 running OS X 10.3.4 with 1.5 Gb ram), but the principle should extend outwards. The chunk of data I'm working with is 380 Mb or so. It is not a Plucene index, but a solid binary file containing statistical data for a custom search engine I've been coding. I may go back to Plucene; I may submit pieces of this project as Plucene plugins; in any case, I still have to deal with getting data into ram in order to squeeze maximum performance. Here are the options I've explored so far.

1) Preload data using mod_perl and a startup.pl that calls a module where the data is loaded. In theory, if a given variable is not modified, it can be shared among all child processes. In general practice, variables tend to become "dirty" as they are modified by the child processes, in which case the variable is moved into a private memory space. (For a good long explanation, see http://www.serverwatch.com/tutorials/article.php/10825_1132061_1 ) In my specific case, I have never been able to get the data chunk to stay shared for even a single session. I'm not sure if that's because of its raw size, or because of the way I load the data in from a filehandle. Baffling.

2) Use IPC::Shareable to load a variable into shared memory. I had to hack some system settings to even get parts of this to work - SHMMAX, SHMMIN, SHMALL, SHMSEG, SHMMNI in /etc/rc. Even when I tweaked the settings to allow shared memory segments of over a Gb in size, I could never get any one segment to hold more than 65506 bytes. Maybe IPC::Shareable can't deal with the fact that in OS X some of those settings are in bytes while others are in pages of 4096 bytes each. Shared variables worked as advtertised in test scripts for small pieces of data, but I never got anywhere near the size I wanted. I thought about busting it up into several thousand scalars, but that'll mean a lot of extra overhead. (For details on how to hack shared memory settings for various Unix flavors, see http://afni.nimh.nih.gov/old/afni/parallize.html ).

3) The last option is a straightup ramdisk, which is an undocumented but easy to use feature in OS X. The ram disk technique actually works. I was hoping not to have to go through system calls to access data; I have this nagging feeling that OS X's virtual memory / disk caching intelligence is going to do something less than ideal, like caching data from the ram drive in hard-disk swap. But if I can't find a solution that allows me to preload data into a Perl variable, I'll go with what I got. (related links: http://www.macosxhints.com/article.php?story=20020530084607311 http://www.hartill.net/OSX/ramdisk )

Any other options I should consider?

-- Marvin Humphrey


<Prev in Thread] Current Thread [Next in Thread>