osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [JAVA] Arrow performance measurement


Hi Li - thanks for setting up the jiras. I'll prepare the pull requests
soon.

Hi Wes, I think that is a good idea to write a blog about performance, with
pointing to the improvements in the code base (fixes and microbenchmarks).
I can prepare a first draft as we go along. This will definitely serve well
to the large java community.

Cheers,
--
Animesh


On Thu, 11 Oct 2018, 17:55 Li Jin, <ice.xelloss@xxxxxxxxx> wrote:

> I have created these as the first step. Animesh, feel free to submit PR for
> these. I will look into your micro benchmarks soon.
>
>
>
>    1. [image: Improvement] ARROW-3497[Java] Add user documentation for
>    achieving better performance
>    <https://jira.apache.org/jira/browse/ARROW-3497>
>    2. [image: Improvement] ARROW-3496[Java] Add microbenchmark code to Java
>    <https://jira.apache.org/jira/browse/ARROW-3496>
>    3. [image: Improvement] ARROW-3495[Java] Optimize bit operations
>    performance <https://jira.apache.org/jira/browse/ARROW-3495>
>    4. [image: Improvement] ARROW-3493[Java] Document
> BOUNDS_CHECKING_ENABLED
>    <https://jira.apache.org/jira/browse/ARROW-3493>
>
>
> On Thu, Oct 11, 2018 at 10:00 AM Li Jin <ice.xelloss@xxxxxxxxx> wrote:
>
> > Hi Wes and Animesh,
> >
> > Thanks for the analysis and discussion. I am happy to looking into this.
> I
> > will create some Jiras soon.
> >
> > Li
> >
> > On Thu, Oct 11, 2018 at 5:49 AM Wes McKinney <wesmckinn@xxxxxxxxx>
> wrote:
> >
> >> hey Animesh,
> >>
> >> Thank you for doing this analysis. If you'd like to share some of the
> >> analysis more broadly e.g. on the Apache Arrow blog or social media,
> >> let us know.
> >>
> >> Seems like there might be a few follow ups here for the Arrow Java
> >> community:
> >>
> >> * Documentation about achieving better performance
> >> * Writing some microperformance benchmarks
> >> * Making some improvements to the code to facilitate better performance
> >>
> >> Feel free to create some JIRA issues. Are any Java developers
> >> interested in digging a little more into these issues?
> >>
> >> Thanks,
> >> Wes
> >> On Tue, Oct 9, 2018 at 7:18 AM Animesh Trivedi
> >> <animesh.trivedi@xxxxxxxxx> wrote:
> >> >
> >> > Hi Wes and all,
> >> >
> >> > Here is another round of updates:
> >> >
> >> > Quick recap - previously we established that for 1kB binary blobs,
> Arrow
> >> > can deliver > 160 Gbps performance from in-memory buffers.
> >> >
> >> > In this round I looked at the performance of materializing "integers".
> >> In
> >> > my benchmarks, I found that with careful optimizations/code-rewriting
> we
> >> > can push the performance of integer reading from 5.42 Gbps/core to
> 13.61
> >> > Gbps/core (~2.5x). The peak performance with 16 cores, scale up to
> 110+
> >> > Gbps. Key things to do is:
> >> >
> >> > 1) Disable memory access checks in Arrow and Netty buffers. This gave
> >> > significant performance boost. However, for such an important
> >> performance
> >> > flag, it is very poorly documented
> >> > ("drill.enable_unsafe_memory_access=true").
> >> >
> >> > 2) Materialize values from Validity and Value direct buffers instead
> of
> >> > calling getInt() function on the IntVector. This is implemented as a
> new
> >> > Unsafe reader type (
> >> >
> >>
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L31
> >> > )
> >> >
> >> > 3) Optimize bitmap operation to check if a bit is set or not (
> >> >
> >>
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L23
> >> > )
> >> >
> >> > A detailed write up of these steps is available here:
> >> >
> >>
> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-09-arrow-int.md
> >> >
> >> > I have 2 follow-up questions:
> >> >
> >> > 1) Regarding the `isSet` function, why does it has to calculate number
> >> of
> >> > bits set? (
> >> >
> >>
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L797
> >> ).
> >> > Wouldn't just checking if the result of the AND operation is zero or
> >> not be
> >> > sufficient? Like what I did :
> >> >
> >>
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L28
> >> >
> >> >
> >> > 2) What is the reason behind this bitmap generation optimization here
> >> >
> >>
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java#L179
> >> > ? At this point when this function is called, the bitmap vector is
> >> already
> >> > read from the storage, and contains the right values (either all null,
> >> all
> >> > set, or whatever). Generating this mask here for the special cases
> when
> >> the
> >> > values are all NULL or all set (this was the case in my benchmark),
> can
> >> be
> >> > slower than just returning what one has read from the storage.
> >> >
> >> > Collectively optimizing these two bitmap operations give more than 1
> >> Gbps
> >> > gains in my bench-marking code.
> >> >
> >> > Cheers,
> >> > --
> >> > Animesh
> >> >
> >> >
> >> > On Thu, Oct 4, 2018 at 12:52 PM Wes McKinney <wesmckinn@xxxxxxxxx>
> >> wrote:
> >> >
> >> > > See e.g.
> >> > >
> >> > >
> >> > >
> >>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/ipc-read-write-test.cc#L222
> >> > >
> >> > >
> >> > > On Thu, Oct 4, 2018 at 6:48 AM Animesh Trivedi
> >> > > <animesh.trivedi@xxxxxxxxx> wrote:
> >> > > >
> >> > > > Primarily write the same microbenchmark as I have in Java in C++
> for
> >> > > table
> >> > > > reading and value materialization. So just an example of
> equivalent
> >> > > > ArrowFileReader example code in C++. Unit tests are a good
> starting
> >> > > point,
> >> > > > thanks for the tip :)
> >> > > >
> >> > > > On Thu, Oct 4, 2018 at 12:39 PM Wes McKinney <wesmckinn@xxxxxxxxx
> >
> >> > > wrote:
> >> > > >
> >> > > > > > 3. Are there examples of Arrow in C++ read/write code that I
> can
> >> > > have a
> >> > > > > look?
> >> > > > >
> >> > > > > What kind of code are you looking for? I would direct you to
> >> relevant
> >> > > > > unit tests that exhibit certain functionality, but it depends on
> >> what
> >> > > > > you are trying to do
> >> > > > > On Wed, Oct 3, 2018 at 9:45 AM Animesh Trivedi
> >> > > > > <animesh.trivedi@xxxxxxxxx> wrote:
> >> > > > > >
> >> > > > > > Hi all - quick update on the performance investigation:
> >> > > > > >
> >> > > > > > - I spent some time looking at performance profile for a
> binary
> >> blob
> >> > > > > column
> >> > > > > > (1024 bytes of byte[]) and found a few favorable settings for
> >> > > delivering
> >> > > > > up
> >> > > > > > to 168 Gbps from in-memory reading benchmark on 16 cores.
> These
> >> > > settings
> >> > > > > > (NUMA, JVM settings, Arrow holder API, and batch size, etc.)
> are
> >> > > > > documented
> >> > > > > > here:
> >> > > > > >
> >> > > > >
> >> > >
> >>
> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-03-arrow-binary.md
> >> > > > > > - these setting also help to improved the last number that
> >> reported
> >> > > (but
> >> > > > > > not by much) for the in-memory TPC-DS store_sales table from
> ~39
> >> > > Gbps up
> >> > > > > to
> >> > > > > > ~45-47 Gbps (note: this number is just in-memory benchmark,
> >> i.e.,
> >> > > w/o any
> >> > > > > > networking or storage links)
> >> > > > > >
> >> > > > > > A few follow up questions that I have:
> >> > > > > > 1. Arrow reads a batch size worth of data in one go. Are there
> >> any
> >> > > > > > recommended batch sizes? In my investigation, small batch size
> >> help
> >> > > with
> >> > > > > a
> >> > > > > > better cache profile but increase number of instructions
> >> required
> >> > > (more
> >> > > > > > looping). Larger one do otherwise. Somehow ~10MB/thread seem
> to
> >> be
> >> > > the
> >> > > > > best
> >> > > > > > performing configuration, which is also a bit counter
> intuitive
> >> as
> >> > > for 16
> >> > > > > > threads this will lead to 160 MB of memory footprint. May be
> >> this is
> >> > > also
> >> > > > > > tired to the memory management logic which is my next
> question.
> >> > > > > > 2. Arrow use's netty's memory manager. (i) what are decent
> netty
> >> > > memory
> >> > > > > > management settings for "io.netty.allocator.*" parameters? I
> >> don't
> >> > > find
> >> > > > > any
> >> > > > > > decent write-up on them; (ii) Is there a provision for
> ArrowBuf
> >> being
> >> > > > > > re-used once a batch is consumed? As it looks for now, read
> read
> >> > > > > allocates
> >> > > > > > a new buffer to read the whole batch size.
> >> > > > > > 3. Are there examples of Arrow in C++ read/write code that I
> can
> >> > > have a
> >> > > > > > look?
> >> > > > > >
> >> > > > > > Cheers,
> >> > > > > > --
> >> > > > > > Animesh
> >> > > > > >
> >> > > > > >
> >> > > > > > On Wed, Sep 19, 2018 at 8:49 PM Wes McKinney <
> >> wesmckinn@xxxxxxxxx>
> >> > > > > wrote:
> >> > > > > >
> >> > > > > > > On Wed, Sep 19, 2018 at 2:13 PM Animesh Trivedi
> >> > > > > > > <animesh.trivedi@xxxxxxxxx> wrote:
> >> > > > > > > >
> >> > > > > > > > Hi Johan, Wes, and Jacques - many thanks for your
> comments:
> >> > > > > > > >
> >> > > > > > > > @Johan -
> >> > > > > > > > 1. I also do not suspect that there is any inherent
> >> drawback in
> >> > > Java
> >> > > > > or
> >> > > > > > > C++
> >> > > > > > > > due to the Arrow format. I mentioned C++ because Wes
> >> pointed out
> >> > > that
> >> > > > > > > Java
> >> > > > > > > > routines are not the most optimized ones (yet!). And
> >> naturally
> >> > > one
> >> > > > > would
> >> > > > > > > > expect better performance in a native language with all
> >> > > > > > > pointer/memory/SIMD
> >> > > > > > > > instruction optimizations that you mentioned. As far as I
> >> know,
> >> > > the
> >> > > > > > > > off-heap buffers are managed in ArrowBuf which implements
> an
> >> > > abstract
> >> > > > > > > netty
> >> > > > > > > > class. But there is nothing unusual, i.e., netty specific,
> >> about
> >> > > > > these
> >> > > > > > > > unsafe routines, they are used by many projects. Though
> >> there is
> >> > > cost
> >> > > > > > > > associated with materializing on-heap Java values from
> >> off-heap
> >> > > > > memory
> >> > > > > > > > regions. I need to benchmark that more carefully.
> >> > > > > > > >
> >> > > > > > > > 2. When you say "I've so far always been able to get
> similar
> >> > > > > performance
> >> > > > > > > > numbers" - do you mean the same performance of my case 3
> >> where 16
> >> > > > > cores
> >> > > > > > > > drive close to 40 Gbps, or the same performance between
> >> your C++
> >> > > and
> >> > > > > Java
> >> > > > > > > > benchmarks. Do you have some write-up? I would be
> >> interested to
> >> > > read
> >> > > > > up
> >> > > > > > > :)
> >> > > > > > > >
> >> > > > > > > > 3. "Can you get to 100 Gbps starting from primitive arrays
> >> in
> >> > > Java"
> >> > > > > ->
> >> > > > > > > that
> >> > > > > > > > is a good idea. Let me try and report back.
> >> > > > > > > >
> >> > > > > > > > @Wes -
> >> > > > > > > >
> >> > > > > > > > Is there some benchmark template for C++ routines I can
> >> have a
> >> > > look?
> >> > > > > I
> >> > > > > > > > would be happy to get some input from Java-Arrow experts
> on
> >> how
> >> > > to
> >> > > > > write
> >> > > > > > > > these benchmarks more efficiently. I will have a closer
> >> look at
> >> > > the
> >> > > > > JIRA
> >> > > > > > > > tickets that you mentioned.
> >> > > > > > > >
> >> > > > > > > > So, for now I am focusing on the case 3, which is about
> >> > > establishing
> >> > > > > > > > performance when reading from a local in-memory I/O stream
> >> that I
> >> > > > > > > > implemented (
> >> > > > > > > >
> >> > > > > > >
> >> > > > >
> >> > >
> >>
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/MemoryIOChannel.java
> >> > > > > > > ).
> >> > > > > > > > In this case I first read data from parquet files, convert
> >> them
> >> > > into
> >> > > > > > > Arrow,
> >> > > > > > > > and write-out to a MemoryIOChannel, and then read back
> from
> >> it.
> >> > > So,
> >> > > > > the
> >> > > > > > > > performance has nothing to do with Crail or HDFS in the
> >> case 3.
> >> > > > > Once, I
> >> > > > > > > > establish the base performance in this setup (which is
> >> around ~40
> >> > > > > Gbps
> >> > > > > > > with
> >> > > > > > > > 16 cores) I will add Crail to the mix. Perhaps Crail I/O
> >> streams
> >> > > can
> >> > > > > take
> >> > > > > > > > ArrowBuf as src/dst buffers. That should be doable.
> >> > > > > > >
> >> > > > > > > Right, in any case what you are testing is the performance
> of
> >> > > using a
> >> > > > > > > particular Java accessor layer to JVM off-heap Arrow memory
> >> to sum
> >> > > the
> >> > > > > > > non-null values of each column. I'm not sure that a single
> >> > > bandwidth
> >> > > > > > > number produced by this benchmark is very informative for
> >> people
> >> > > > > > > contemplating what memory format to use in their system due
> >> to the
> >> > > > > > > current state of the implementation (Java) and workload
> >> measured
> >> > > > > > > (summing the non-null values with a naive algorithm). I
> would
> >> guess
> >> > > > > > > that a C++ version with raw pointers and a loop-unrolled,
> >> > > branch-free
> >> > > > > > > vectorized sum is going to be a lot faster.
> >> > > > > > >
> >> > > > > > > >
> >> > > > > > > > @Jacques -
> >> > > > > > > >
> >> > > > > > > > That is a good point that "Arrow's implementation is more
> >> > > focused on
> >> > > > > > > > interacting with the structure than transporting it".
> >> However,
> >> > > in any
> >> > > > > > > > distributed system one needs to move data/structure around
> >> - as
> >> > > far
> >> > > > > as I
> >> > > > > > > > understand that is another goal of the project. My
> >> investigation
> >> > > > > started
> >> > > > > > > > within the context of Spark/SQL data processing. Spark
> >> converts
> >> > > > > incoming
> >> > > > > > > > data into its own in-memory UnsafeRow representation for
> >> > > processing.
> >> > > > > So
> >> > > > > > > > naturally the performance of this data ingestion pipeline
> >> cannot
> >> > > > > > > outperform
> >> > > > > > > > the read performance of the used file format. I
> benchmarked
> >> > > Parquet,
> >> > > > > ORC,
> >> > > > > > > > Avro, JSON (for the specific TPC-DS store_sales table).
> And
> >> then
> >> > > > > > > curiously
> >> > > > > > > > benchmarked Arrow as well because its design choices are a
> >> better
> >> > > > > fit for
> >> > > > > > > > modern high-performance RDMA/NVMe/100+Gbps devices I am
> >> > > > > investigating.
> >> > > > > > > From
> >> > > > > > > > this point of view, I am trying to find out can Arrow be
> >> the file
> >> > > > > format
> >> > > > > > > > for the next generation of storage/networking devices (see
> >> Apache
> >> > > > > Crail
> >> > > > > > > > project) delivering close to the hardware speed
> >> reading/writing
> >> > > > > rates. As
> >> > > > > > > > Wes pointed out that a C++ library implementation should
> be
> >> > > memory-IO
> >> > > > > > > > bound, so what would it take to deliver the same
> >> performance in
> >> > > Java
> >> > > > > ;)
> >> > > > > > > > (and then, from across the network).
> >> > > > > > > >
> >> > > > > > > > I hope this makes sense.
> >> > > > > > > >
> >> > > > > > > > Cheers,
> >> > > > > > > > --
> >> > > > > > > > Animesh
> >> > > > > > > >
> >> > > > > > > > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau <
> >> > > jacques@xxxxxxxxxx>
> >> > > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > My big question is what is the use case and how/what are
> >> you
> >> > > > > trying to
> >> > > > > > > > > compare? Arrow's implementation is more focused on
> >> interacting
> >> > > > > with the
> >> > > > > > > > > structure than transporting it. Generally speaking, when
> >> we're
> >> > > > > working
> >> > > > > > > with
> >> > > > > > > > > Arrow data we frequently are just interacting with
> memory
> >> > > > > locations and
> >> > > > > > > > > doing direct operations. If you have a layer that
> >> supports that
> >> > > > > type of
> >> > > > > > > > > semantic, create a movement technique that depends on
> >> that.
> >> > > Arrow
> >> > > > > > > doesn't
> >> > > > > > > > > force a particular API since the data itself is defined
> >> by its
> >> > > > > > > in-memory
> >> > > > > > > > > layout so if you have a custom use or pattern, just work
> >> with
> >> > > the
> >> > > > > > > in-memory
> >> > > > > > > > > structures.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney <
> >> > > wesmckinn@xxxxxxxxx>
> >> > > > > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > > > hi Animesh,
> >> > > > > > > > > >
> >> > > > > > > > > > Per Johan's comments, the C++ library is essentially
> >> going
> >> > > to be
> >> > > > > > > > > > IO/memory bandwidth bound since you're interacting
> with
> >> raw
> >> > > > > pointers.
> >> > > > > > > > > >
> >> > > > > > > > > > I'm looking at your code
> >> > > > > > > > > >
> >> > > > > > > > > > private void consumeFloat4(FieldVector fv) {
> >> > > > > > > > > >     Float4Vector accessor = (Float4Vector) fv;
> >> > > > > > > > > >     int valCount = accessor.getValueCount();
> >> > > > > > > > > >     for(int i = 0; i < valCount; i++){
> >> > > > > > > > > >         if(!accessor.isNull(i)){
> >> > > > > > > > > >             float4Count+=1;
> >> > > > > > > > > >             checksum+=accessor.get(i);
> >> > > > > > > > > >         }
> >> > > > > > > > > >     }
> >> > > > > > > > > > }
> >> > > > > > > > > >
> >> > > > > > > > > > You'll want to get a Java-Arrow expert from Dremio to
> >> advise
> >> > > you
> >> > > > > the
> >> > > > > > > > > > fastest way to iterate over this data -- my
> >> understanding is
> >> > > that
> >> > > > > > > much
> >> > > > > > > > > > code in Dremio interacts with the wrapped Netty
> ArrowBuf
> >> > > objects
> >> > > > > > > > > > rather than going through the higher level APIs.
> You're
> >> also
> >> > > > > dropping
> >> > > > > > > > > > performance because memory mapping is not yet
> >> implemented in
> >> > > > > Java,
> >> > > > > > > see
> >> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3191.
> >> > > > > > > > > >
> >> > > > > > > > > > Furthermore, the IPC reader class you are using could
> >> be made
> >> > > > > more
> >> > > > > > > > > > efficient. I described the problem in
> >> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3192 --
> >> this
> >> > > will be
> >> > > > > > > > > > required as soon as we have the ability to do memory
> >> mapping
> >> > > in
> >> > > > > Java
> >> > > > > > > > > >
> >> > > > > > > > > > Could Crail use the Arrow data structures in its
> runtime
> >> > > rather
> >> > > > > than
> >> > > > > > > > > > copying? If not, how are Crail's runtime data
> structures
> >> > > > > different?
> >> > > > > > > > > >
> >> > > > > > > > > > - Wes
> >> > > > > > > > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
> >> > > > > > > > > > <J.W.Peltenburg@xxxxxxxxxx> wrote:
> >> > > > > > > > > > >
> >> > > > > > > > > > > Hello Animesh,
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > I browsed a bit in your sources, thanks for sharing.
> >> We
> >> > > have
> >> > > > > > > performed
> >> > > > > > > > > > some similar measurements to your third case in the
> >> past for
> >> > > > > C/C++ on
> >> > > > > > > > > > collections of various basic types such as primitives
> >> and
> >> > > > > strings.
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > I can say that in terms of consuming data from the
> >> Arrow
> >> > > format
> >> > > > > > > versus
> >> > > > > > > > > > language native collections in C++, I've so far always
> >> been
> >> > > able
> >> > > > > to
> >> > > > > > > get
> >> > > > > > > > > > similar performance numbers (e.g. no drawbacks due to
> >> the
> >> > > Arrow
> >> > > > > > > format
> >> > > > > > > > > > itself). Especially when accessing the data through
> >> Arrow's
> >> > > raw
> >> > > > > data
> >> > > > > > > > > > pointers (and using for example std::string_view-like
> >> > > > > constructs).
> >> > > > > > > > > > >
> >> > > > > > > > > > > In C/C++ the fast data structures are engineered in
> >> such a
> >> > > way
> >> > > > > > > that as
> >> > > > > > > > > > little pointer traversals are required and they take
> up
> >> an as
> >> > > > > small
> >> > > > > > > as
> >> > > > > > > > > > possible memory footprint. Thus each memory access is
> >> > > relatively
> >> > > > > > > > > efficient
> >> > > > > > > > > > (in terms of obtaining the data of interest). The same
> >> can
> >> > > > > > > absolutely be
> >> > > > > > > > > > said for Arrow, if not even more efficient in some
> cases
> >> > > where
> >> > > > > object
> >> > > > > > > > > > fields are of variable length.
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > In the JVM case, the Arrow data is stored off-heap.
> >> This
> >> > > > > requires
> >> > > > > > > the
> >> > > > > > > > > > JVM to interface to it through some calls to Unsafe
> >> hidden
> >> > > under
> >> > > > > the
> >> > > > > > > > > Netty
> >> > > > > > > > > > layer (but please correct me if I'm wrong, I'm not an
> >> expert
> >> > > on
> >> > > > > > > this).
> >> > > > > > > > > > Those calls are the only reason I can think of that
> >> would
> >> > > > > degrade the
> >> > > > > > > > > > performance a bit compared to a pure JAva case. I
> don't
> >> know
> >> > > if
> >> > > > > the
> >> > > > > > > > > Unsafe
> >> > > > > > > > > > calls are inlined during JIT compilation. If they
> >> aren't they
> >> > > > > will
> >> > > > > > > > > increase
> >> > > > > > > > > > access latency to any data a little bit.
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > I don't have a similar machine so it's not easy to
> >> relate
> >> > > my
> >> > > > > > > numbers to
> >> > > > > > > > > > yours, but if you can get that data consumed with 100
> >> Gbps
> >> > > in a
> >> > > > > pure
> >> > > > > > > Java
> >> > > > > > > > > > case, I don't see any reason (resulting from Arrow
> >> format /
> >> > > > > off-heap
> >> > > > > > > > > > storage) why you wouldn't be able to get at least
> really
> >> > > close.
> >> > > > > Can
> >> > > > > > > you
> >> > > > > > > > > get
> >> > > > > > > > > > to 100 Gbps starting from primitive arrays in Java
> with
> >> your
> >> > > > > > > consumption
> >> > > > > > > > > > functions in the first place?
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > I'm interested to see your progress on this.
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > Kind regards,
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > Johan Peltenburg
> >> > > > > > > > > > >
> >> > > > > > > > > > > ________________________________
> >> > > > > > > > > > > From: Animesh Trivedi <animesh.trivedi@xxxxxxxxx>
> >> > > > > > > > > > > Sent: Wednesday, September 19, 2018 2:08:50 PM
> >> > > > > > > > > > > To: dev@xxxxxxxxxxxxxxxx; dev@xxxxxxxxxxxxxxxx
> >> > > > > > > > > > > Subject: [JAVA] Arrow performance measurement
> >> > > > > > > > > > >
> >> > > > > > > > > > > Hi all,
> >> > > > > > > > > > >
> >> > > > > > > > > > > A week ago, Wes and I had a discussion about the
> >> > > performance
> >> > > > > of the
> >> > > > > > > > > > > Arrow/Java implementation on the Apache Crail
> >> (Incubating)
> >> > > > > mailing
> >> > > > > > > > > list (
> >> > > > > > > > > > >
> >> > > > > > >
> >> > >
> >> http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser
> >> > > > > > > > > ).
> >> > > > > > > > > > In
> >> > > > > > > > > > > a nutshell: I am investigating the performance of
> >> various
> >> > > file
> >> > > > > > > formats
> >> > > > > > > > > > > (including Arrow) on high-performance NVMe and
> >> > > > > RDMA/100Gbps/RoCE
> >> > > > > > > > > setups.
> >> > > > > > > > > > I
> >> > > > > > > > > > > benchmarked how long does it take to materialize
> >> values
> >> > > (ints,
> >> > > > > > > longs,
> >> > > > > > > > > > > doubles) of the store_sales table, the largest table
> >> in the
> >> > > > > TPC-DS
> >> > > > > > > > > > dataset
> >> > > > > > > > > > > stored on different file formats. Here is a write-up
> >> on
> >> > > this -
> >> > > > > > > > > > >
> >> > > https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I
> >> > > > > > > found
> >> > > > > > > > > > that
> >> > > > > > > > > > > between a pair of machine connected over a 100 Gbps
> >> link,
> >> > > Arrow
> >> > > > > > > (using
> >> > > > > > > > > > as a
> >> > > > > > > > > > > file format on HDFS) delivered close to ~30 Gbps of
> >> > > bandwidth
> >> > > > > with
> >> > > > > > > all
> >> > > > > > > > > 16
> >> > > > > > > > > > > cores engaged. Wes pointed out that (i) Arrow is
> >> in-memory
> >> > > IPC
> >> > > > > > > format,
> >> > > > > > > > > > and
> >> > > > > > > > > > > has not been optimized for storage interfaces/APIs
> >> like
> >> > > HDFS;
> >> > > > > (ii)
> >> > > > > > > the
> >> > > > > > > > > > > performance I am measuring is for the java
> >> implementation.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Wes, I hope I summarized our discussion correctly.
> >> > > > > > > > > > >
> >> > > > > > > > > > > That brings us to this email where I promised to
> >> follow up
> >> > > on
> >> > > > > the
> >> > > > > > > Arrow
> >> > > > > > > > > > > mailing list to understand and optimize the
> >> performance of
> >> > > > > > > Arrow/Java
> >> > > > > > > > > > > implementation on high-performance devices. I wrote
> a
> >> small
> >> > > > > > > stand-alone
> >> > > > > > > > > > > benchmark (
> >> > > > > https://github.com/animeshtrivedi/benchmarking-arrow)
> >> > > > > > > with
> >> > > > > > > > > > three
> >> > > > > > > > > > > implementations of WritableByteChannel,
> >> SeekableByteChannel
> >> > > > > > > interfaces:
> >> > > > > > > > > > >
> >> > > > > > > > > > > 1. Arrow data is stored in HDFS/tmpfs - this gives
> me
> >> ~30
> >> > > Gbps
> >> > > > > > > > > > performance
> >> > > > > > > > > > > 2. Arrow data is stored in Crail/DRAM - this gives
> me
> >> > > ~35-36
> >> > > > > Gbps
> >> > > > > > > > > > > performance
> >> > > > > > > > > > > 3. Arrow data is stored in on-heap byte[] - this
> >> gives me
> >> > > ~39
> >> > > > > Gbps
> >> > > > > > > > > > > performance
> >> > > > > > > > > > >
> >> > > > > > > > > > > I think the order makes sense. To better understand
> >> the
> >> > > > > > > performance of
> >> > > > > > > > > > > Arrow/Java we can focus on the option 3.
> >> > > > > > > > > > >
> >> > > > > > > > > > > The key question I am trying to answer is "what
> would
> >> it
> >> > > take
> >> > > > > for
> >> > > > > > > > > > > Arrow/Java to deliver 100+ Gbps of performance"? Is
> it
> >> > > > > possible? If
> >> > > > > > > > > yes,
> >> > > > > > > > > > > then what is missing/or mis-interpreted by me? If
> >> not, then
> >> > > > > where
> >> > > > > > > is
> >> > > > > > > > > the
> >> > > > > > > > > > > performance lost? Does anyone have any performance
> >> > > measurements
> >> > > > > > > for C++
> >> > > > > > > > > > > implementation? if they have seen/expect better
> >> numbers.
> >> > > > > > > > > > >
> >> > > > > > > > > > > As a next step, I will profile the read path of
> >> Arrow/Java
> >> > > for
> >> > > > > the
> >> > > > > > > > > option
> >> > > > > > > > > > > 3. I will report my findings.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Any thoughts and feedback on this investigation are
> >> very
> >> > > > > welcome.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Cheers,
> >> > > > > > > > > > > --
> >> > > > > > > > > > > Animesh
> >> > > > > > > > > > >
> >> > > > > > > > > > > PS~ Cross-posting on the dev@xxxxxxxxxxxxxxxx list
> as
> >> > > well.
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > >
> >> > > > >
> >> > >
> >>
> >
>
>