osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Refactoring the Rust API


On Sun, May 20, 2018 at 2:32 PM, Andy Grove <andygrove73@xxxxxxxxx> wrote:
> Wes,
>
> Lots of good questions. I'll answer as much as I can but I'm really not
> familiar with the C++ implementation and have built the Rust implementation
> based on the specs. One of my goals for the IPC and integration testing is
> to get familiar with the other implementations so that I can make sure the
> Rust implementation becomes more aligned.
>
> Maybe before I tackle IPC I should get familiar with the C++ and/or Java
> API and start comparing how things are done. Java would be easier for me
> given my background but is that as far along as the C++ API?
>
> To answer your specific questions as best I can:
>
> - Memory pool: There was a contribution of a pluggable pool that isn't
> integrated with the rest of the code base. We should integrate it or remove
> it. I filed ARROW-2620 for this. A pluggable memory pool isn't something
> that I'm personally interested in since there are upcoming features in Rust
> for this anyway as far as I know.
>
> - Arrays: I didn't know 32 bit support was a requirement. Array len() and
> null_count() actually return usize now (that changed as part of the
> refactor) so I assume that should work on 32 bit but I will need to confirm
> that.
>
> - Lists: The spec only has an example for List with a fixed width primitive
> so I'm still a little unclear on how lists and nested lists should really
> work with other types. I'll have to study the C++ impl I guess to
> understand this better. As a general comment, it would be nice to add more
> detail to the specs and I'm happy to contribute to that as I learn more.
>
> - List<T> vs ListArray<T>: These should probably be combined but I need to
> learn more about List. I'm actually working on a PoC for my DataFusion
> project right now that requires lists of user-defined types so this is
> becoming a priority for me anyway.

The relevant section in the spec is:

https://github.com/apache/arrow/blob/master/format/Layout.md#list-type

This states that a List is achieved by combining a sequence of offsets
with another array of any type (primitive or nested). It would be
helpful if you could review this section carefully and identify what
parts are not clear so that we can improve the language. Adding more
examples is fine, but I'd like the spec to be intelligible without
having to scrutinize examples.

So if you have an array of say Struct<a: T1, b: T2>, you can make it
into List<Struct<a: T1, b: T2>> by adding a buffer of offsets and a
validity bitmap for the list slots. I think it would be a good idea to
add a list-of-nested-type to the spec to make the compositional aspect
as clear as possible.

>
> - Should min/max be methods on PrimitiveArray vs top-level functions: The
> issue is that min/max are strongly typed so they work well for
> PrimitiveArray<T>. I suppose they could be added to the Array trait and
> they would return another Array. I'll take a look.
>
> - Null support: This is sketchy at best right now, but NULL support is the
> last feature that I need before releasing the next version of my project so
> this is a priority for me and I'll be fixing this up over the next 2 weeks.
>
> * What is a "BufferArray": This got renamed to PrimitiveArray and is simply
> an array of fixed-width primitives.
>
> - Struct: column vs field: I filed ARROW-2617 to address this.
>
> - Bitmap 1 vs 0: A flag makes sense. I filed ARROW-2618.
>
> - JSON serde should nove to separate module: Makes sense. ARROW-2619.
>
> - List / PrimitiveList / ListBuilder : See earlier comments on my (lack of)
> understanding on List types
>
> I hope that helps.
>
> Thanks,
>
> Andy.
>
>
> On Fri, May 18, 2018 at 3:08 PM, Wes McKinney <wesmckinn@xxxxxxxxx> wrote:
>
>> hi Andy,
>>
>> I gave a read through the Rust implementation. I have never programmed
>> in Rust (hope to change that someday!), so some of the programming
>> constructs are lost on me, but I focused on the Arrow columnar
>> questions. My notes / questions follow
>>
>> cheers
>> Wes
>>
>> ## High level comments
>>
>> * Do you plan to provide support for memory pools in the same style as
>> C++? I
>>   see some work already along these lines, but it will impact the API in a
>>   number of places
>>
>> ## array.rs
>>
>> L35
>>
>> * Should len and null_count always return u64, even on 32-bit arch?
>> * Do you intend to support zero-copy slicing in the way that C++ supports
>> it?
>>   Probably doesn't need to get done now
>>
>> L47 (ListArray)
>>
>> * In the format docs, for List<T>, T can be any type, including other
>> nested
>>   types. What does this mean here?
>> * What is the difference between List<T> and ListArray<T> in Rust, can you
>> add
>>   a comment?
>>
>> L157
>>
>> * Are there benefits to having min and max be methods on PrimitiveArray
>>   vs. top-level functions?
>>
>> L214
>>
>> * This Add operation does not account for nulls
>>
>> L247
>>
>> * What is a "BufferArray"?
>>
>> L295
>>
>> * Maybe call Struct attribute "columns" instead "fields"?
>> * Same with changing column(i) -> field(i)
>>
>> ## bitmap.rs
>>
>> L35
>>
>> * I would expect a bitmap to be initialized with all 0's instead of all
>> 1's, or
>>   perhaps this should be a flag to do one or the other
>>
>> ## builder.rs
>>
>> * Similar question as above re: usize vs u64
>>
>> ## datatypes.rs
>>
>> L68
>>
>> * It feels like JSON serde for data types, fields, schemas belongs in its
>> own
>>   module (this thinking might just be influenced by the design of the
>> Java/C++
>>   libraries)
>>
>> ## list.rs
>>
>> L31
>>
>> * Seems like you might want to call this PrimitiveList, since it only
>> supports
>>   primitive child types. How will lists of arbitrary arrays be handled?
>> (per
>>   notes above, and per mailing list discussions)
>> * The inner cells of a list may be null; this interface does not appear to
>>   support that
>>
>> ## list_builder.rs
>>
>> * Same comment, should this be PrimitiveListBuilder?
>> * How to build lists with a nested type child?
>> * How to append null lists?
>>
>> On Fri, May 4, 2018 at 6:07 AM, Andy Grove <andygrove73@xxxxxxxxx> wrote:
>> > Here's my blog post explaining the refactor:
>> > https://andygrove.io/2018/05/apache-arrow-traits-generics/
>> >
>> > The Reddit thread is going to be here for anyone wanting to see the
>> > feedback:
>> > https://www.reddit.com/r/rust/comments/8gy45t/refactoring_
>> apache_arrow_to_use_traits_and/
>> >
>> > Thanks,
>> >
>> > Andy.
>> >
>> > On Wed, May 2, 2018 at 5:10 PM, Andy Grove <andygrove73@xxxxxxxxx>
>> wrote:
>> >
>> >> There was an interesting blog post posted to Reddit a couple days ago
>> that
>> >> is very relevant to this refactor. The author is building a dataframe
>> >> library in Rust and started out with an enum to represent arrays and
>> then
>> >> moved to using generic traits with the enum.
>> >>
>> >> https://www.reddit.com/r/rust/comments/8g2274/dataframes_
>> >> traits_enums_generics_and_dynamic/
>> >>
>> >> I don't think that approach would work for us and I'm tempted to write a
>> >> up a blog post myself explaining the current refactor and why it is
>> needed.
>> >> I think I'll try and do that this weekend. I'm keen to get a discussion
>> >> going around the refactor to make sure we don't need to do another
>> refactor
>> >> in the future.
>> >>
>> >> Andy.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Sun, Apr 29, 2018 at 9:59 AM, Andy Grove <andygrove73@xxxxxxxxx>
>> wrote:
>> >>
>> >>> So it turns out this refactor isn't as disruptive as I thought and I
>> >>> mostly have it working already.
>> >>>
>> >>> The buffer/builder/list types barely change at all other than the fact
>> >>> that we no longer need all those macros after moving to generics.
>> >>>
>> >>> It really is only array.rs that is pretty much a rewrite.
>> >>>
>> >>> Also, in my earlier email I got my dates wrong. I am aiming to have
>> this
>> >>> PR ready by Monday May 7th. The real test for me is integrating it with
>> >>> DataFusion to make sure I haven't missed anything.
>> >>>
>> >>> Here's the branch where I'm working on this:
>> >>> https://github.com/andygrove/arrow/tree/refactor_rust_api
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Andy.
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Sat, Apr 28, 2018 at 2:10 PM, Andy Grove <andygrove73@xxxxxxxxx>
>> >>> wrote:
>> >>>
>> >>>> I filed a PR to track this (https://issues.apache.org/jir
>> >>>> a/browse/ARROW-2521) but thought it was worth raising on the mailing
>> >>>> list too.
>> >>>>
>> >>>> I am running into limitations now of the way that Array is represented
>> >>>> as an enum and I am unable to implement List<List<T>> with the current
>> >>>> design.
>> >>>>
>> >>>> When Krisztian Szucs and I were working on the initial code we had two
>> >>>> different approaches and we went with this enum approach at the time
>> >>>> because we weren't able to make the other approach (traits +
>> generics) work.
>> >>>>
>> >>>> Now that I'm further along the Rust learning curve, I can make the
>> trait
>> >>>> + generic approach work and I'm currently prototyping in a separate
>> repo,
>> >>>> and it is looking good so far. I have been able to create a struct
>> array
>> >>>> containing different type fields including List<List<T>>.
>> >>>>
>> >>>> I think I'm ready to start the refactor for real in my fork. We only
>> >>>> have ~1k LOC so I don't think it will take too long, but because I'm
>> doing
>> >>>> this in my spare time I am going to estimate that I will have it
>> complete
>> >>>> in just over one week, aiming for having it complete by 4/30.
>> >>>>
>> >>>> I think it's fine to continue merging small PRs in the meanwhile but I
>> >>>> think we should hold off any major changes in the coming week.
>> >>>>
>> >>>> Thanks,
>> >>>>
>> >>>> Andy.
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>
>> >>
>>