OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Table of tensors with Arrow


In Petastorm we operate with tables of tensors. We are trying to map this
data structure into
Arrow's primitives. One way is to use pa.array of BinaryValue type while
using
FixedSizeBufferWriter to serialize a pa.Tensor type into it and deserialize
it on read. This
feels somewhat ackward and I guess does not achieve the zero-copy
behavior(?)

This is what we do to deserialize the tensor from a single binary value:

        buffer = value.as_py()
        reader = pa.BufferReader(memoryview(buffer))
        tensor = pa.read_tensor(reader)
        n = tensor.to_numpy()


And this is how a numpy array is serialized into a BinaryValue written to a
parquet store:

        tensor = pa.Tensor.from_numpy(array)
        buffer = pa.allocate_buffer(pa.get_tensor_size(tensor))
        stream = pa.FixedSizeBufferWriter(buffer)
        pa.write_tensor(tensor, stream)
        bytes = bytearray(buffer.to_pybytes())

Is there a better, more Arrow native approach, to model our data?

Thanks!

- Yevgeni