[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Table of tensors with Arrow

hi Yevgeni,

The approach you describe is not unreasonable if the objective is to
embed ndarrays in a Parquet file.
On Tue, Oct 23, 2018 at 1:45 AM Yevgeni Litvin <selitvin@xxxxxxxxx> wrote:
> In Petastorm we operate with tables of tensors. We are trying to map this
> data structure into
> Arrow's primitives. One way is to use pa.array of BinaryValue type while
> using
> FixedSizeBufferWriter to serialize a pa.Tensor type into it and deserialize
> it on read. This
> feels somewhat ackward and I guess does not achieve the zero-copy
> behavior(?)
> This is what we do to deserialize the tensor from a single binary value:
>         buffer = value.as_py()
>         reader = pa.BufferReader(memoryview(buffer))
>         tensor = pa.read_tensor(reader)
>         n = tensor.to_numpy()

The call `value.as_py()` causes a copy -- this should be avoidable. I
opened https://issues.apache.org/jira/browse/ARROW-3592 about adding
an option to get a zero-copy buffer from such a value

> And this is how a numpy array is serialized into a BinaryValue written to a
> parquet store:
>         tensor = pa.Tensor.from_numpy(array)
>         buffer = pa.allocate_buffer(pa.get_tensor_size(tensor))
>         stream = pa.FixedSizeBufferWriter(buffer)
>         pa.write_tensor(tensor, stream)
>         bytes = bytearray(buffer.to_pybytes())

The `to_pybytes()` call should not be necessary here

In [2]: buf = pa.allocate_buffer(10)

In [3]: buf
Out[3]: <pyarrow.lib.Buffer at 0x7f4209af64c8>

In [4]: bytearray(buf)
Out[4]: bytearray(b'x\x88G6B\x7f\x00\x00x\x88')

In [5]: bytearray(buf.to_pybytes())
Out[5]: bytearray(b'x\x88G6B\x7f\x00\x00x\x88')

It would be useful to provide some conveniences for building a
BinaryArray containing serialized arrow::Tensor values (or other
serializable objects)

> Is there a better, more Arrow native approach, to model our data?
> Thanks!
> - Yevgeni