osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Contribute "RowSet" mechanism from Apache Drill?


New Jira sounds good.

Many times algorithms interact directly with vectors but there also many
times this is not the case. Would be great to see more detail about an
example use. Maybe propose as a new module so people can use if they want
but don't have to consume unless they need to?

On Mon, Aug 27, 2018 at 6:28 PM Paul Rogers <par0328@xxxxxxxxx.invalid>
wrote:

> Hi Jacques,
>
> Thanks much for the note. I wonder, when reading data into, or out of,
> Arrow, are not the interfaces often row-wise? For example, it is somewhat
> difficult to read a CSV file column-wise. Similarly, when serving a BI tool
> (for tables or charts), data must be presented row-wise. (JDBC, for
> example, is a row-wise interface.) The abstractions help with these cases.
>
> Perhaps much of the emphasis in Arrow is in cross-tool compatibility in
> which data is passed column-wise as a set of vectors? The abstractions
> wouldn't be needed in this data transfer case.
>
> The batch size component is an essential part of row-wise loading. When
> reading data into vectors, even from Parquet, we found it necessary to 1)
> control the overall amount of memory used by the batch, and 2) read the
> same number of rows for every column. The RowSet abstractions encapsulate
> this coordinated cross-column work.
>
> The memory limits in the "RowSet" abstraction are not estimates. (There
> was a separate Drill project for that, which is why it might be confusing.)
> Instead, the memory limits are based on knowing the current write offset
> into each vector.  In Drill, when a vector becomes full, we automatically
> resize the vector by doubling the memory for that vector. The RowSet
> abstraction tracks when doubling the vector would exceed the "budget" set
> for that vector or batch. When the limit occurs, the abstraction marks the
> batch complete. (The "overflow" row is saved for later to avoid exceeding
> the limit, and to keep the details of overflow hidden from the client.) The
> same logic can be applied, I would assume, to whatever memory allocation
> technique is used in Arrow, if Arrow has evolved beyond Drill's technique.
>
> A size estimate (when available) helps by allowing the client code to
> pre-allocate vectors to their final size. Doing so avoids growing vectors
> during data loads. In this case, the abstractions simply pack data into
> those pre-allocated vectors until one of them becomes full.
>
> The idea of separating memory from reading/writing is sound. In fact,
> that's how the code is structured. The memory-unaware version is heavily
> used in unit tests where we know how much memory is used. The memory-aware
> version is used in production to handle whatever strange data sets present
> themselves.
>
> Of course, none of this was clear from my terse description. I'll go ahead
> and create a JIRA ticket to provide additional context and to gather
> detailed comments so we can figure out the best way to proceed.
>
> Thanks,
>
> - Paul
>
>
>
>     On Monday, August 27, 2018, 5:52:19 PM PDT, Jacques Nadeau <
> jacques@xxxxxxxxxx> wrote:
>
>  This seems like it could be a useful addition. In general, our experience
> with writing Arrow structures is that the most optimal path is using
> columnar interaction rather than rowwise. That being said, most people
> start out by interacting with Arrow rowwise first and having an interface
> like this could be helpful in allowing people to start writing Arrow
> datasets with less effort and mistakes.
>
> In terms of record batch sizing/estimations, I think that should probably
> be uncoupled from writing/reading vectors.
>
>
>
> On Mon, Aug 27, 2018 at 7:00 AM Li Jin <ice.xelloss@xxxxxxxxx> wrote:
>
> > Hi Paul,
> >
> > Thank you for the email. I think this is interesting.
> >
> > Arrow (Java API) currently doesn't have the capability of automatically
> > limiting the memory size of record batches. In Spark we have similar
> needs
> > to limit the size of record batches and have talked about implementing
> some
> > kind of size estimator for record batches but haven't started to work on
> > it.
> >
> > I personally think it makes sense for Arrow to incorporate such
> > capabilities.
> >
> >
> >
> > On Mon, Aug 27, 2018 at 1:33 AM Paul Rogers <par0328@xxxxxxxxx.invalid>
> > wrote:
> >
> > > Hi All,
> > >
> > > Over in the Apache Drill project, we developed some handy vector
> > > reader/writer abstractions. I wonder if they might be of interest to
> > Apache
> > > Arrow. Key contributions of the "RowSet" abstractions:
> > >
> > > * Control row batch size: the aggregate memory taken by a set of
> vectors
> > > (and all their sub-vectors for structured types.)
> > > * Control the maximum per-vector size.
> > > * Simple, highly optimized read/write interface that handles vector
> > offset
> > > accounting, even for deeply nested types.
> > > * Minimize vector internal fragmentation (wasted space.)
> > >
> > > More information is available in [1]. Arrow improved and simplified
> > > Drill's original vector and metadata abstractions. As a result, work
> > would
> > > be required to port the RowSet code from Drill's version of these
> classes
> > > to the Arrow versions.
> > >
> > > Does Arrow already have a similar solution? If not, would the above be
> > > useful for Arrow?
> > >
> > > Thanks,
> > > - Paul
> > >
> > >
> > > Apache Drill PMC member
> > > Co-author of the upcoming O'Reilly book "Learning Apache Drill"
> > > [1]
> > >
> https://github.com/paul-rogers/drill/wiki/RowSet-Abstractions-for-Arrow
> > >
> > >
> > >
> >
>