[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[DISCUSS] Developing a standard memory layout for in-memory records / "row-oriented" data

hi folks,

Some time ago I opened ARROW-1790 based on some discussions I'd had
with users on mailing list or in person about how to deal with data
similar to a C array of struct types. Indeed, while we have Structs in
the Arrow columnar format, our structs are "fully shredded" columnar

Many systems such as Apache Impala (TupleRow, used in row batches),
Apache Kudu (used in client RPCs), Apache Spark (off-heap "unsafe row"
aka Tungsten), NumPy (structured dtypes), and others have in-memory
data structures supporting record oriented data. As far as I know,
there is not an open standard for this type of data.

The purpose of developing this within Apache Arrow would serve a
couple purposes:

* To have an open standard for in-memory records under ASF community
governance. Achieving consensus in this setting would have a lot of
long-term value and accelerate adoption

* To provide a means to embed sequences of records in the Arrow columnar format

In light of efforts to create LLVM codegen infrastructure for Arrow
(Gandiva), it would stand to reason that we could develop LLVM IR for
manipulating columns of records in a coherent algebraic expression
framework. For example: efficient LLVM code generation for "shredding"
or "pivoting" records into fully-shredded columnar format.

If this sounds interesting to the community, I could help to kickstart
a design process which would likely take a significant amount of time.
The requirements could be complex (i.e. we might want to support
variable-size record fields while also providing random access
guarantees). We could use the ASF's Confluence wiki to house the
documents and facilitate discussion.