[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How to model massive nested data

hi Tyler,

I am not sure the Arrow Java libraries have yet been used for
interacting with larger than memory datasets, but this would be a good
opportunity to try to get this working.

In the C++ libraries, any Arrow data structures can easily reference
memory-mapped data on disk; none of the data needs to be in-memory.

There has been some discussion of adding binary, string, and list
types with 64-bit offsets for extremely large values:
https://issues.apache.org/jira/browse/ARROW-750. Adding this to the
columnar format seems inevitable, so while it isn't there now does not
mean it is out of scope.


On Thu, May 10, 2018 at 4:31 PM, Martin Durant
<martin.durant@xxxxxxxxxxx> wrote:
> This is not directly relevant here, but has anyone looked into oamap (
> https://github.com/diana-hep/oamap ), which is capable of using numba to
> compile python functions which traverse nested data structures down to the
> basic leaf nodes, without creating intermediate python objects. Then the person
> doing the analysis may not need to go to C++ at all. oamap has POC loaders
> for arrow and parquet, but it’s original focus was ROOT, from the high-energy
> physics world.
> —
> Martin Durant
> martin.durant@xxxxxxxxxxx