RE: Question about streaming to memorymapped files
Antoine, fair point. I just ran some perf stats using FileOutputStream vs my growing mmap impl.
It seems in most cases you are correct, their runtimes are basically equivalent. The only time mmap beats it significantly is if there are many Flush calls. I have a parameter to control how many rows to buffer before finishing a record batch and writing it out. Note that my mmap impl currently doubles its size every time its requested to grow
Testing on writing 5 double columns on 10 million rows I get the following:
From: Antoine Pitrou [mailto:antoine@xxxxxxxxxx]
Sent: Friday, May 11, 2018 4:54 AM
Subject: Re: Question about streaming to memorymapped files
If you write your own auto-growing memory mapped file implementation,
I'd be curious about performance measurements vs. FileOutputStream (and
mremap() and truncate() calls are not free. Also, at some point you'll
want to unmap data already written to prevent the map from growing
Le 09/05/2018 à 17:55, Ambalu, Robert a écrit :
> I don’t use the output stream objects directly though right? Just to take a step back a bit, what im trying to do is to generate streaming rows to a table in realtime ( with the ability to control how many rows to batch up before writing out a recordbatch )
> My understanding is that to properly stream table data I need to:
> a) create an outputstream instance
> b) create a RecordBatchStreamWriter binding my strmea object to it
> c) create a RecordBatchBuilder. As rows are added, add it to the record batch builder. When we're ready to flush, call Flust on the batchbuilder to create a record batch and pass the batch to the RecordBatchStreamWriter.
> I was hoping use MemoryMappedFile for a but since it doesn’t support dynamically growing the mmap file I'll have to write my own impl
> -----Original Message-----
> From: Antoine Pitrou [mailto:antoine@xxxxxxxxxx]
> Sent: Wednesday, May 09, 2018 11:42 AM
> To: dev@xxxxxxxxxxxxxxxx
> Subject: Re: Question about streaming to memorymapped files
> As for buffering data before making a call to write(): in Arrow 0.10.0
> you'll be able to use BufferedOutputStream for this:
> Le 09/05/2018 à 17:39, Ambalu, Robert a écrit :
>> I don’t have any offhand, no, but I would imagine that direct file writes will at some point need to make a system call, which is expensive ( fwrite might buffer before eventually making the sys call, looks like FileOutputStream uses the raw system write for every write call).
>> The current MMap io interface isn’t usable as a streaming output unfortunately, though I suppose I could just implement my own
>> -----Original Message-----
>> From: Antoine Pitrou [mailto:solipsis@xxxxxxxxxx]
>> Sent: Wednesday, May 09, 2018 11:11 AM
>> To: dev@xxxxxxxxxxxxxxxx
>> Subject: Re: Question about streaming to memorymapped files
>> Do you know of any benchmark numbers / performance studies about this?
>> While it's true that a memory-mapped file avoids explicit system calls,
>> I've heard file I/O is quite well optimized, at least on Linux,
>> On Wed, 9 May 2018 14:47:53 +0000
>> "Ambalu, Robert" <Robert.Ambalu@xxxxxxxxxxx> wrote:
>>> Antoine, thanks for the quick reply.
>>> You can actually grow memorymapped files with a mremap call ( and I think a seek/write on the file ), I do this in my applications and it works fine.
>>> I want the efficiency of writing via memory maps, so would prefer to avoid FileOutputStream
>>> -----Original Message-----
>>> From: Antoine Pitrou [mailto:antoine@xxxxxxxxxx]
>>> Sent: Wednesday, May 09, 2018 10:37 AM
>>> To: dev@xxxxxxxxxxxxxxxx
>>> Subject: Re: Question about streaming to memorymapped files
>>> If you don't know the output size upfront then should probably use a
>>> FileOutputStream instead. By definition, memory mapped files must have
>>> a fixed size (since they are mapped to a fixed area in virtual memory).
>>> Le 09/05/2018 à 16:31, Ambalu, Robert a écrit :
>>>> Hey, I'm looking into streaming table updates into a memory mapped file ( C++ )
>>>> I think I have everything I need ( MemoryMappedFile output streamer, RecordBatchStreamWriter ) but I don't understand how to properly create the memmap file. It looks like it requires you to preset a size to the file when you create it, but since ill be streaming I don't actually know how big a file im going to need...
>>>> Am I missing some other API point here? Any reason why size is required up front and the memmap doesn't auto-grow as needed?
>>>> Thanks in advance
>>>> - Rob
>>>> DISCLAIMER: This e-mail message and any attachments are intended solely for the use of the individual or entity to which it is addressed and may contain information that is confidential or legally privileged. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, copying or other use of this message or its attachments is strictly prohibited. If you have received this message in error, please notify the sender immediately and permanently delete this message and any attachments.