osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [JAVA] Arrow performance measurement


See e.g.

https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/ipc-read-write-test.cc#L222


On Thu, Oct 4, 2018 at 6:48 AM Animesh Trivedi
<animesh.trivedi@xxxxxxxxx> wrote:
>
> Primarily write the same microbenchmark as I have in Java in C++ for table
> reading and value materialization. So just an example of equivalent
> ArrowFileReader example code in C++. Unit tests are a good starting point,
> thanks for the tip :)
>
> On Thu, Oct 4, 2018 at 12:39 PM Wes McKinney <wesmckinn@xxxxxxxxx> wrote:
>
> > > 3. Are there examples of Arrow in C++ read/write code that I can have a
> > look?
> >
> > What kind of code are you looking for? I would direct you to relevant
> > unit tests that exhibit certain functionality, but it depends on what
> > you are trying to do
> > On Wed, Oct 3, 2018 at 9:45 AM Animesh Trivedi
> > <animesh.trivedi@xxxxxxxxx> wrote:
> > >
> > > Hi all - quick update on the performance investigation:
> > >
> > > - I spent some time looking at performance profile for a binary blob
> > column
> > > (1024 bytes of byte[]) and found a few favorable settings for delivering
> > up
> > > to 168 Gbps from in-memory reading benchmark on 16 cores. These settings
> > > (NUMA, JVM settings, Arrow holder API, and batch size, etc.) are
> > documented
> > > here:
> > >
> > https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-03-arrow-binary.md
> > > - these setting also help to improved the last number that reported (but
> > > not by much) for the in-memory TPC-DS store_sales table from ~39 Gbps up
> > to
> > > ~45-47 Gbps (note: this number is just in-memory benchmark, i.e., w/o any
> > > networking or storage links)
> > >
> > > A few follow up questions that I have:
> > > 1. Arrow reads a batch size worth of data in one go. Are there any
> > > recommended batch sizes? In my investigation, small batch size help with
> > a
> > > better cache profile but increase number of instructions required (more
> > > looping). Larger one do otherwise. Somehow ~10MB/thread seem to be the
> > best
> > > performing configuration, which is also a bit counter intuitive as for 16
> > > threads this will lead to 160 MB of memory footprint. May be this is also
> > > tired to the memory management logic which is my next question.
> > > 2. Arrow use's netty's memory manager. (i) what are decent netty memory
> > > management settings for "io.netty.allocator.*" parameters? I don't find
> > any
> > > decent write-up on them; (ii) Is there a provision for ArrowBuf being
> > > re-used once a batch is consumed? As it looks for now, read read
> > allocates
> > > a new buffer to read the whole batch size.
> > > 3. Are there examples of Arrow in C++ read/write code that I can have a
> > > look?
> > >
> > > Cheers,
> > > --
> > > Animesh
> > >
> > >
> > > On Wed, Sep 19, 2018 at 8:49 PM Wes McKinney <wesmckinn@xxxxxxxxx>
> > wrote:
> > >
> > > > On Wed, Sep 19, 2018 at 2:13 PM Animesh Trivedi
> > > > <animesh.trivedi@xxxxxxxxx> wrote:
> > > > >
> > > > > Hi Johan, Wes, and Jacques - many thanks for your comments:
> > > > >
> > > > > @Johan -
> > > > > 1. I also do not suspect that there is any inherent drawback in Java
> > or
> > > > C++
> > > > > due to the Arrow format. I mentioned C++ because Wes pointed out that
> > > > Java
> > > > > routines are not the most optimized ones (yet!). And naturally one
> > would
> > > > > expect better performance in a native language with all
> > > > pointer/memory/SIMD
> > > > > instruction optimizations that you mentioned. As far as I know, the
> > > > > off-heap buffers are managed in ArrowBuf which implements an abstract
> > > > netty
> > > > > class. But there is nothing unusual, i.e., netty specific, about
> > these
> > > > > unsafe routines, they are used by many projects. Though there is cost
> > > > > associated with materializing on-heap Java values from off-heap
> > memory
> > > > > regions. I need to benchmark that more carefully.
> > > > >
> > > > > 2. When you say "I've so far always been able to get similar
> > performance
> > > > > numbers" - do you mean the same performance of my case 3 where 16
> > cores
> > > > > drive close to 40 Gbps, or the same performance between your C++ and
> > Java
> > > > > benchmarks. Do you have some write-up? I would be interested to read
> > up
> > > > :)
> > > > >
> > > > > 3. "Can you get to 100 Gbps starting from primitive arrays in Java"
> > ->
> > > > that
> > > > > is a good idea. Let me try and report back.
> > > > >
> > > > > @Wes -
> > > > >
> > > > > Is there some benchmark template for C++ routines I can have a look?
> > I
> > > > > would be happy to get some input from Java-Arrow experts on how to
> > write
> > > > > these benchmarks more efficiently. I will have a closer look at the
> > JIRA
> > > > > tickets that you mentioned.
> > > > >
> > > > > So, for now I am focusing on the case 3, which is about establishing
> > > > > performance when reading from a local in-memory I/O stream that I
> > > > > implemented (
> > > > >
> > > >
> > https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/MemoryIOChannel.java
> > > > ).
> > > > > In this case I first read data from parquet files, convert them into
> > > > Arrow,
> > > > > and write-out to a MemoryIOChannel, and then read back from it. So,
> > the
> > > > > performance has nothing to do with Crail or HDFS in the case 3.
> > Once, I
> > > > > establish the base performance in this setup (which is around ~40
> > Gbps
> > > > with
> > > > > 16 cores) I will add Crail to the mix. Perhaps Crail I/O streams can
> > take
> > > > > ArrowBuf as src/dst buffers. That should be doable.
> > > >
> > > > Right, in any case what you are testing is the performance of using a
> > > > particular Java accessor layer to JVM off-heap Arrow memory to sum the
> > > > non-null values of each column. I'm not sure that a single bandwidth
> > > > number produced by this benchmark is very informative for people
> > > > contemplating what memory format to use in their system due to the
> > > > current state of the implementation (Java) and workload measured
> > > > (summing the non-null values with a naive algorithm). I would guess
> > > > that a C++ version with raw pointers and a loop-unrolled, branch-free
> > > > vectorized sum is going to be a lot faster.
> > > >
> > > > >
> > > > > @Jacques -
> > > > >
> > > > > That is a good point that "Arrow's implementation is more focused on
> > > > > interacting with the structure than transporting it". However, in any
> > > > > distributed system one needs to move data/structure around - as far
> > as I
> > > > > understand that is another goal of the project. My investigation
> > started
> > > > > within the context of Spark/SQL data processing. Spark converts
> > incoming
> > > > > data into its own in-memory UnsafeRow representation for processing.
> > So
> > > > > naturally the performance of this data ingestion pipeline cannot
> > > > outperform
> > > > > the read performance of the used file format. I benchmarked Parquet,
> > ORC,
> > > > > Avro, JSON (for the specific TPC-DS store_sales table). And then
> > > > curiously
> > > > > benchmarked Arrow as well because its design choices are a better
> > fit for
> > > > > modern high-performance RDMA/NVMe/100+Gbps devices I am
> > investigating.
> > > > From
> > > > > this point of view, I am trying to find out can Arrow be the file
> > format
> > > > > for the next generation of storage/networking devices (see Apache
> > Crail
> > > > > project) delivering close to the hardware speed reading/writing
> > rates. As
> > > > > Wes pointed out that a C++ library implementation should be memory-IO
> > > > > bound, so what would it take to deliver the same performance in Java
> > ;)
> > > > > (and then, from across the network).
> > > > >
> > > > > I hope this makes sense.
> > > > >
> > > > > Cheers,
> > > > > --
> > > > > Animesh
> > > > >
> > > > > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau <jacques@xxxxxxxxxx>
> > > > wrote:
> > > > >
> > > > > > My big question is what is the use case and how/what are you
> > trying to
> > > > > > compare? Arrow's implementation is more focused on interacting
> > with the
> > > > > > structure than transporting it. Generally speaking, when we're
> > working
> > > > with
> > > > > > Arrow data we frequently are just interacting with memory
> > locations and
> > > > > > doing direct operations. If you have a layer that supports that
> > type of
> > > > > > semantic, create a movement technique that depends on that. Arrow
> > > > doesn't
> > > > > > force a particular API since the data itself is defined by its
> > > > in-memory
> > > > > > layout so if you have a custom use or pattern, just work with the
> > > > in-memory
> > > > > > structures.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney <wesmckinn@xxxxxxxxx>
> > > > wrote:
> > > > > >
> > > > > > > hi Animesh,
> > > > > > >
> > > > > > > Per Johan's comments, the C++ library is essentially going to be
> > > > > > > IO/memory bandwidth bound since you're interacting with raw
> > pointers.
> > > > > > >
> > > > > > > I'm looking at your code
> > > > > > >
> > > > > > > private void consumeFloat4(FieldVector fv) {
> > > > > > >     Float4Vector accessor = (Float4Vector) fv;
> > > > > > >     int valCount = accessor.getValueCount();
> > > > > > >     for(int i = 0; i < valCount; i++){
> > > > > > >         if(!accessor.isNull(i)){
> > > > > > >             float4Count+=1;
> > > > > > >             checksum+=accessor.get(i);
> > > > > > >         }
> > > > > > >     }
> > > > > > > }
> > > > > > >
> > > > > > > You'll want to get a Java-Arrow expert from Dremio to advise you
> > the
> > > > > > > fastest way to iterate over this data -- my understanding is that
> > > > much
> > > > > > > code in Dremio interacts with the wrapped Netty ArrowBuf objects
> > > > > > > rather than going through the higher level APIs. You're also
> > dropping
> > > > > > > performance because memory mapping is not yet implemented in
> > Java,
> > > > see
> > > > > > > https://issues.apache.org/jira/browse/ARROW-3191.
> > > > > > >
> > > > > > > Furthermore, the IPC reader class you are using could be made
> > more
> > > > > > > efficient. I described the problem in
> > > > > > > https://issues.apache.org/jira/browse/ARROW-3192 -- this will be
> > > > > > > required as soon as we have the ability to do memory mapping in
> > Java
> > > > > > >
> > > > > > > Could Crail use the Arrow data structures in its runtime rather
> > than
> > > > > > > copying? If not, how are Crail's runtime data structures
> > different?
> > > > > > >
> > > > > > > - Wes
> > > > > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
> > > > > > > <J.W.Peltenburg@xxxxxxxxxx> wrote:
> > > > > > > >
> > > > > > > > Hello Animesh,
> > > > > > > >
> > > > > > > >
> > > > > > > > I browsed a bit in your sources, thanks for sharing. We have
> > > > performed
> > > > > > > some similar measurements to your third case in the past for
> > C/C++ on
> > > > > > > collections of various basic types such as primitives and
> > strings.
> > > > > > > >
> > > > > > > >
> > > > > > > > I can say that in terms of consuming data from the Arrow format
> > > > versus
> > > > > > > language native collections in C++, I've so far always been able
> > to
> > > > get
> > > > > > > similar performance numbers (e.g. no drawbacks due to the Arrow
> > > > format
> > > > > > > itself). Especially when accessing the data through Arrow's raw
> > data
> > > > > > > pointers (and using for example std::string_view-like
> > constructs).
> > > > > > > >
> > > > > > > > In C/C++ the fast data structures are engineered in such a way
> > > > that as
> > > > > > > little pointer traversals are required and they take up an as
> > small
> > > > as
> > > > > > > possible memory footprint. Thus each memory access is relatively
> > > > > > efficient
> > > > > > > (in terms of obtaining the data of interest). The same can
> > > > absolutely be
> > > > > > > said for Arrow, if not even more efficient in some cases where
> > object
> > > > > > > fields are of variable length.
> > > > > > > >
> > > > > > > >
> > > > > > > > In the JVM case, the Arrow data is stored off-heap. This
> > requires
> > > > the
> > > > > > > JVM to interface to it through some calls to Unsafe hidden under
> > the
> > > > > > Netty
> > > > > > > layer (but please correct me if I'm wrong, I'm not an expert on
> > > > this).
> > > > > > > Those calls are the only reason I can think of that would
> > degrade the
> > > > > > > performance a bit compared to a pure JAva case. I don't know if
> > the
> > > > > > Unsafe
> > > > > > > calls are inlined during JIT compilation. If they aren't they
> > will
> > > > > > increase
> > > > > > > access latency to any data a little bit.
> > > > > > > >
> > > > > > > >
> > > > > > > > I don't have a similar machine so it's not easy to relate my
> > > > numbers to
> > > > > > > yours, but if you can get that data consumed with 100 Gbps in a
> > pure
> > > > Java
> > > > > > > case, I don't see any reason (resulting from Arrow format /
> > off-heap
> > > > > > > storage) why you wouldn't be able to get at least really close.
> > Can
> > > > you
> > > > > > get
> > > > > > > to 100 Gbps starting from primitive arrays in Java with your
> > > > consumption
> > > > > > > functions in the first place?
> > > > > > > >
> > > > > > > >
> > > > > > > > I'm interested to see your progress on this.
> > > > > > > >
> > > > > > > >
> > > > > > > > Kind regards,
> > > > > > > >
> > > > > > > >
> > > > > > > > Johan Peltenburg
> > > > > > > >
> > > > > > > > ________________________________
> > > > > > > > From: Animesh Trivedi <animesh.trivedi@xxxxxxxxx>
> > > > > > > > Sent: Wednesday, September 19, 2018 2:08:50 PM
> > > > > > > > To: dev@xxxxxxxxxxxxxxxx; dev@xxxxxxxxxxxxxxxx
> > > > > > > > Subject: [JAVA] Arrow performance measurement
> > > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > A week ago, Wes and I had a discussion about the performance
> > of the
> > > > > > > > Arrow/Java implementation on the Apache Crail (Incubating)
> > mailing
> > > > > > list (
> > > > > > > >
> > > > http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser
> > > > > > ).
> > > > > > > In
> > > > > > > > a nutshell: I am investigating the performance of various file
> > > > formats
> > > > > > > > (including Arrow) on high-performance NVMe and
> > RDMA/100Gbps/RoCE
> > > > > > setups.
> > > > > > > I
> > > > > > > > benchmarked how long does it take to materialize values (ints,
> > > > longs,
> > > > > > > > doubles) of the store_sales table, the largest table in the
> > TPC-DS
> > > > > > > dataset
> > > > > > > > stored on different file formats. Here is a write-up on this -
> > > > > > > > https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I
> > > > found
> > > > > > > that
> > > > > > > > between a pair of machine connected over a 100 Gbps link, Arrow
> > > > (using
> > > > > > > as a
> > > > > > > > file format on HDFS) delivered close to ~30 Gbps of bandwidth
> > with
> > > > all
> > > > > > 16
> > > > > > > > cores engaged. Wes pointed out that (i) Arrow is in-memory IPC
> > > > format,
> > > > > > > and
> > > > > > > > has not been optimized for storage interfaces/APIs like HDFS;
> > (ii)
> > > > the
> > > > > > > > performance I am measuring is for the java implementation.
> > > > > > > >
> > > > > > > > Wes, I hope I summarized our discussion correctly.
> > > > > > > >
> > > > > > > > That brings us to this email where I promised to follow up on
> > the
> > > > Arrow
> > > > > > > > mailing list to understand and optimize the performance of
> > > > Arrow/Java
> > > > > > > > implementation on high-performance devices. I wrote a small
> > > > stand-alone
> > > > > > > > benchmark (
> > https://github.com/animeshtrivedi/benchmarking-arrow)
> > > > with
> > > > > > > three
> > > > > > > > implementations of WritableByteChannel, SeekableByteChannel
> > > > interfaces:
> > > > > > > >
> > > > > > > > 1. Arrow data is stored in HDFS/tmpfs - this gives me ~30 Gbps
> > > > > > > performance
> > > > > > > > 2. Arrow data is stored in Crail/DRAM - this gives me ~35-36
> > Gbps
> > > > > > > > performance
> > > > > > > > 3. Arrow data is stored in on-heap byte[] - this gives me ~39
> > Gbps
> > > > > > > > performance
> > > > > > > >
> > > > > > > > I think the order makes sense. To better understand the
> > > > performance of
> > > > > > > > Arrow/Java we can focus on the option 3.
> > > > > > > >
> > > > > > > > The key question I am trying to answer is "what would it take
> > for
> > > > > > > > Arrow/Java to deliver 100+ Gbps of performance"? Is it
> > possible? If
> > > > > > yes,
> > > > > > > > then what is missing/or mis-interpreted by me? If not, then
> > where
> > > > is
> > > > > > the
> > > > > > > > performance lost? Does anyone have any performance measurements
> > > > for C++
> > > > > > > > implementation? if they have seen/expect better numbers.
> > > > > > > >
> > > > > > > > As a next step, I will profile the read path of Arrow/Java for
> > the
> > > > > > option
> > > > > > > > 3. I will report my findings.
> > > > > > > >
> > > > > > > > Any thoughts and feedback on this investigation are very
> > welcome.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > --
> > > > > > > > Animesh
> > > > > > > >
> > > > > > > > PS~ Cross-posting on the dev@xxxxxxxxxxxxxxxx list as well.
> > > > > > >
> > > > > >
> > > >
> >