osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Contribute "RowSet" mechanism from Apache Drill?


Hey Paul, it looks like ScalarReader is simply a renamed version of
FieldReader, which the Arrow Vector module already contains.

On Mon, Sep 3, 2018 at 4:07 PM Paul Rogers <par0328@xxxxxxxxx.invalid>
wrote:

> Filed a JIRA ticket: ARROW-3164.
>
> The original e-mail linked to a wiki that explains the Row Set abstraction
> in the Drill context. The ticket points to a new GitHub wiki that discusses
> the abstraction in the Arrow context, including examples. The Wiki also
> explains the motivation: the challenges Drill faced when reading
> row-oriented data into vectors and reading that data out, and how those may
> apply to the Arrow context.
>
> Looks like recent Arrow work has greatly improved the interoperability
> features of the project. Still, at some point, code must write data into
> vectors and read data out. Often the interface is row-oriented. If that
> code is in Java, the Row Set abstractions can help.
>
> A new top-level Java module is a great idea. Looks like there might be
> some dependency issues to resolve to leverage material in the "vector"
> module which we'll resolve as we hit them.
>
> Here is a quick example on the read side. Jacques recently posted a code
> example to retrieve data from an vector of VARCHAR columns:
>
> int recordIndexToRead = ...
> ListVector lv = ...
> ArrowBuf offsetVector = lv.getOffsetBuffer();
> VarCharVector vc = lv.getDataVector();
> int listStart = offsetVector.getInt((recordIndexToRead ) * 4) ;
> int listEnd = offsetVector.getInt((recordIndexToRead + 1) * 4);
> NullableVarCharHolder nvh = new NullableVarCharHolder();
> for(int i = listStart; i < listEnd; i++){
>   vc.get(i, nvh);
>   // do something with data.
> }
>
> Here is how to iterate over a record batch, accessing a single VARCHAR
> column, using the Row Set abstractions. The e-mail mentioned a byte array,
> so let's use that here:
>
> RowSet rowSet = // create row set from record batch
> RowSetReader reader = rowSet.reader();
> ScalarReader vcReader = reader.scalar("colName"); // Get your VARCHAR
> column
> while (reader.next()) {
>   byte data[] = vcReader.getBytes();
>   // Do something with the data
> }
>
> Data can also be retrieved as a Java String, if that is more convenient in
> this use case:
>
>   String data = vcReader.getString();
>
> In either case, if the value is a SQL NULL, the above methods (because
> they return Java objects) will return a Java null. (For primitive types,
> you can call the ScalarReader.isNull() method.)
>
> Thanks,
> - Paul
>
>
>
>     On Thursday, August 30, 2018, 7:44:51 PM PDT, Jacques Nadeau <
> jacques@xxxxxxxxxx> wrote:
>
>  New Jira sounds good.
>
> Many times algorithms interact directly with vectors but there also many
> times this is not the case. Would be great to see more detail about an
> example use. Maybe propose as a new module so people can use if they want
> but don't have to consume unless they need to?
>
> On Mon, Aug 27, 2018 at 6:28 PM Paul Rogers <par0328@xxxxxxxxx.invalid>
> wrote:
>
> > Hi Jacques,
> >
> > Thanks much for the note. I wonder, when reading data into, or out of,
> > Arrow, are not the interfaces often row-wise? For example, it is somewhat
> > difficult to read a CSV file column-wise. Similarly, when serving a BI
> tool
> > (for tables or charts), data must be presented row-wise. (JDBC, for
> > example, is a row-wise interface.) The abstractions help with these
> cases.
> >
> > Perhaps much of the emphasis in Arrow is in cross-tool compatibility in
> > which data is passed column-wise as a set of vectors? The abstractions
> > wouldn't be needed in this data transfer case.
> >
> > The batch size component is an essential part of row-wise loading. When
> > reading data into vectors, even from Parquet, we found it necessary to 1)
> > control the overall amount of memory used by the batch, and 2) read the
> > same number of rows for every column. The RowSet abstractions encapsulate
> > this coordinated cross-column work.
> >
> > The memory limits in the "RowSet" abstraction are not estimates. (There
> > was a separate Drill project for that, which is why it might be
> confusing.)
> > Instead, the memory limits are based on knowing the current write offset
> > into each vector.  In Drill, when a vector becomes full, we automatically
> > resize the vector by doubling the memory for that vector. The RowSet
> > abstraction tracks when doubling the vector would exceed the "budget" set
> > for that vector or batch. When the limit occurs, the abstraction marks
> the
> > batch complete. (The "overflow" row is saved for later to avoid exceeding
> > the limit, and to keep the details of overflow hidden from the client.)
> The
> > same logic can be applied, I would assume, to whatever memory allocation
> > technique is used in Arrow, if Arrow has evolved beyond Drill's
> technique.
> >
> > A size estimate (when available) helps by allowing the client code to
> > pre-allocate vectors to their final size. Doing so avoids growing vectors
> > during data loads. In this case, the abstractions simply pack data into
> > those pre-allocated vectors until one of them becomes full.
> >
> > The idea of separating memory from reading/writing is sound. In fact,
> > that's how the code is structured. The memory-unaware version is heavily
> > used in unit tests where we know how much memory is used. The
> memory-aware
> > version is used in production to handle whatever strange data sets
> present
> > themselves.
> >
> > Of course, none of this was clear from my terse description. I'll go
> ahead
> > and create a JIRA ticket to provide additional context and to gather
> > detailed comments so we can figure out the best way to proceed.
> >
> > Thanks,
> >
> > - Paul
> >
> >
> >
> >    On Monday, August 27, 2018, 5:52:19 PM PDT, Jacques Nadeau <
> > jacques@xxxxxxxxxx> wrote:
> >
> >  This seems like it could be a useful addition. In general, our
> experience
> > with writing Arrow structures is that the most optimal path is using
> > columnar interaction rather than rowwise. That being said, most people
> > start out by interacting with Arrow rowwise first and having an interface
> > like this could be helpful in allowing people to start writing Arrow
> > datasets with less effort and mistakes.
> >
> > In terms of record batch sizing/estimations, I think that should probably
> > be uncoupled from writing/reading vectors.
> >
> >
> >
> > On Mon, Aug 27, 2018 at 7:00 AM Li Jin <ice.xelloss@xxxxxxxxx> wrote:
> >
> > > Hi Paul,
> > >
> > > Thank you for the email. I think this is interesting.
> > >
> > > Arrow (Java API) currently doesn't have the capability of automatically
> > > limiting the memory size of record batches. In Spark we have similar
> > needs
> > > to limit the size of record batches and have talked about implementing
> > some
> > > kind of size estimator for record batches but haven't started to work
> on
> > > it.
> > >
> > > I personally think it makes sense for Arrow to incorporate such
> > > capabilities.
> > >
> > >
> > >
> > > On Mon, Aug 27, 2018 at 1:33 AM Paul Rogers <par0328@xxxxxxxxx.invalid
> >
> > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > Over in the Apache Drill project, we developed some handy vector
> > > > reader/writer abstractions. I wonder if they might be of interest to
> > > Apache
> > > > Arrow. Key contributions of the "RowSet" abstractions:
> > > >
> > > > * Control row batch size: the aggregate memory taken by a set of
> > vectors
> > > > (and all their sub-vectors for structured types.)
> > > > * Control the maximum per-vector size.
> > > > * Simple, highly optimized read/write interface that handles vector
> > > offset
> > > > accounting, even for deeply nested types.
> > > > * Minimize vector internal fragmentation (wasted space.)
> > > >
> > > > More information is available in [1]. Arrow improved and simplified
> > > > Drill's original vector and metadata abstractions. As a result, work
> > > would
> > > > be required to port the RowSet code from Drill's version of these
> > classes
> > > > to the Arrow versions.
> > > >
> > > > Does Arrow already have a similar solution? If not, would the above
> be
> > > > useful for Arrow?
> > > >
> > > > Thanks,
> > > > - Paul
> > > >
> > > >
> > > > Apache Drill PMC member
> > > > Co-author of the upcoming O'Reilly book "Learning Apache Drill"
> > > > [1]
> > > >
> > https://github.com/paul-rogers/drill/wiki/RowSet-Abstractions-for-Arrow
> > > >
> > > >
> > > >
> > >
> >
>