osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [JAVA] Arrow performance measurement


Hi Wes and all,

Here is another round of updates:

Quick recap - previously we established that for 1kB binary blobs, Arrow
can deliver > 160 Gbps performance from in-memory buffers.

In this round I looked at the performance of materializing "integers". In
my benchmarks, I found that with careful optimizations/code-rewriting we
can push the performance of integer reading from 5.42 Gbps/core to 13.61
Gbps/core (~2.5x). The peak performance with 16 cores, scale up to 110+
Gbps. Key things to do is:

1) Disable memory access checks in Arrow and Netty buffers. This gave
significant performance boost. However, for such an important performance
flag, it is very poorly documented
("drill.enable_unsafe_memory_access=true").

2) Materialize values from Validity and Value direct buffers instead of
calling getInt() function on the IntVector. This is implemented as a new
Unsafe reader type (
https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L31
)

3) Optimize bitmap operation to check if a bit is set or not (
https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L23
)

A detailed write up of these steps is available here:
https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-09-arrow-int.md

I have 2 follow-up questions:

1) Regarding the `isSet` function, why does it has to calculate number of
bits set? (
https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L797).
Wouldn't just checking if the result of the AND operation is zero or not be
sufficient? Like what I did :
https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L28


2) What is the reason behind this bitmap generation optimization here
https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java#L179
? At this point when this function is called, the bitmap vector is already
read from the storage, and contains the right values (either all null, all
set, or whatever). Generating this mask here for the special cases when the
values are all NULL or all set (this was the case in my benchmark), can be
slower than just returning what one has read from the storage.

Collectively optimizing these two bitmap operations give more than 1 Gbps
gains in my bench-marking code.

Cheers,
--
Animesh


On Thu, Oct 4, 2018 at 12:52 PM Wes McKinney <wesmckinn@xxxxxxxxx> wrote:

> See e.g.
>
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/ipc-read-write-test.cc#L222
>
>
> On Thu, Oct 4, 2018 at 6:48 AM Animesh Trivedi
> <animesh.trivedi@xxxxxxxxx> wrote:
> >
> > Primarily write the same microbenchmark as I have in Java in C++ for
> table
> > reading and value materialization. So just an example of equivalent
> > ArrowFileReader example code in C++. Unit tests are a good starting
> point,
> > thanks for the tip :)
> >
> > On Thu, Oct 4, 2018 at 12:39 PM Wes McKinney <wesmckinn@xxxxxxxxx>
> wrote:
> >
> > > > 3. Are there examples of Arrow in C++ read/write code that I can
> have a
> > > look?
> > >
> > > What kind of code are you looking for? I would direct you to relevant
> > > unit tests that exhibit certain functionality, but it depends on what
> > > you are trying to do
> > > On Wed, Oct 3, 2018 at 9:45 AM Animesh Trivedi
> > > <animesh.trivedi@xxxxxxxxx> wrote:
> > > >
> > > > Hi all - quick update on the performance investigation:
> > > >
> > > > - I spent some time looking at performance profile for a binary blob
> > > column
> > > > (1024 bytes of byte[]) and found a few favorable settings for
> delivering
> > > up
> > > > to 168 Gbps from in-memory reading benchmark on 16 cores. These
> settings
> > > > (NUMA, JVM settings, Arrow holder API, and batch size, etc.) are
> > > documented
> > > > here:
> > > >
> > >
> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-03-arrow-binary.md
> > > > - these setting also help to improved the last number that reported
> (but
> > > > not by much) for the in-memory TPC-DS store_sales table from ~39
> Gbps up
> > > to
> > > > ~45-47 Gbps (note: this number is just in-memory benchmark, i.e.,
> w/o any
> > > > networking or storage links)
> > > >
> > > > A few follow up questions that I have:
> > > > 1. Arrow reads a batch size worth of data in one go. Are there any
> > > > recommended batch sizes? In my investigation, small batch size help
> with
> > > a
> > > > better cache profile but increase number of instructions required
> (more
> > > > looping). Larger one do otherwise. Somehow ~10MB/thread seem to be
> the
> > > best
> > > > performing configuration, which is also a bit counter intuitive as
> for 16
> > > > threads this will lead to 160 MB of memory footprint. May be this is
> also
> > > > tired to the memory management logic which is my next question.
> > > > 2. Arrow use's netty's memory manager. (i) what are decent netty
> memory
> > > > management settings for "io.netty.allocator.*" parameters? I don't
> find
> > > any
> > > > decent write-up on them; (ii) Is there a provision for ArrowBuf being
> > > > re-used once a batch is consumed? As it looks for now, read read
> > > allocates
> > > > a new buffer to read the whole batch size.
> > > > 3. Are there examples of Arrow in C++ read/write code that I can
> have a
> > > > look?
> > > >
> > > > Cheers,
> > > > --
> > > > Animesh
> > > >
> > > >
> > > > On Wed, Sep 19, 2018 at 8:49 PM Wes McKinney <wesmckinn@xxxxxxxxx>
> > > wrote:
> > > >
> > > > > On Wed, Sep 19, 2018 at 2:13 PM Animesh Trivedi
> > > > > <animesh.trivedi@xxxxxxxxx> wrote:
> > > > > >
> > > > > > Hi Johan, Wes, and Jacques - many thanks for your comments:
> > > > > >
> > > > > > @Johan -
> > > > > > 1. I also do not suspect that there is any inherent drawback in
> Java
> > > or
> > > > > C++
> > > > > > due to the Arrow format. I mentioned C++ because Wes pointed out
> that
> > > > > Java
> > > > > > routines are not the most optimized ones (yet!). And naturally
> one
> > > would
> > > > > > expect better performance in a native language with all
> > > > > pointer/memory/SIMD
> > > > > > instruction optimizations that you mentioned. As far as I know,
> the
> > > > > > off-heap buffers are managed in ArrowBuf which implements an
> abstract
> > > > > netty
> > > > > > class. But there is nothing unusual, i.e., netty specific, about
> > > these
> > > > > > unsafe routines, they are used by many projects. Though there is
> cost
> > > > > > associated with materializing on-heap Java values from off-heap
> > > memory
> > > > > > regions. I need to benchmark that more carefully.
> > > > > >
> > > > > > 2. When you say "I've so far always been able to get similar
> > > performance
> > > > > > numbers" - do you mean the same performance of my case 3 where 16
> > > cores
> > > > > > drive close to 40 Gbps, or the same performance between your C++
> and
> > > Java
> > > > > > benchmarks. Do you have some write-up? I would be interested to
> read
> > > up
> > > > > :)
> > > > > >
> > > > > > 3. "Can you get to 100 Gbps starting from primitive arrays in
> Java"
> > > ->
> > > > > that
> > > > > > is a good idea. Let me try and report back.
> > > > > >
> > > > > > @Wes -
> > > > > >
> > > > > > Is there some benchmark template for C++ routines I can have a
> look?
> > > I
> > > > > > would be happy to get some input from Java-Arrow experts on how
> to
> > > write
> > > > > > these benchmarks more efficiently. I will have a closer look at
> the
> > > JIRA
> > > > > > tickets that you mentioned.
> > > > > >
> > > > > > So, for now I am focusing on the case 3, which is about
> establishing
> > > > > > performance when reading from a local in-memory I/O stream that I
> > > > > > implemented (
> > > > > >
> > > > >
> > >
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/MemoryIOChannel.java
> > > > > ).
> > > > > > In this case I first read data from parquet files, convert them
> into
> > > > > Arrow,
> > > > > > and write-out to a MemoryIOChannel, and then read back from it.
> So,
> > > the
> > > > > > performance has nothing to do with Crail or HDFS in the case 3.
> > > Once, I
> > > > > > establish the base performance in this setup (which is around ~40
> > > Gbps
> > > > > with
> > > > > > 16 cores) I will add Crail to the mix. Perhaps Crail I/O streams
> can
> > > take
> > > > > > ArrowBuf as src/dst buffers. That should be doable.
> > > > >
> > > > > Right, in any case what you are testing is the performance of
> using a
> > > > > particular Java accessor layer to JVM off-heap Arrow memory to sum
> the
> > > > > non-null values of each column. I'm not sure that a single
> bandwidth
> > > > > number produced by this benchmark is very informative for people
> > > > > contemplating what memory format to use in their system due to the
> > > > > current state of the implementation (Java) and workload measured
> > > > > (summing the non-null values with a naive algorithm). I would guess
> > > > > that a C++ version with raw pointers and a loop-unrolled,
> branch-free
> > > > > vectorized sum is going to be a lot faster.
> > > > >
> > > > > >
> > > > > > @Jacques -
> > > > > >
> > > > > > That is a good point that "Arrow's implementation is more
> focused on
> > > > > > interacting with the structure than transporting it". However,
> in any
> > > > > > distributed system one needs to move data/structure around - as
> far
> > > as I
> > > > > > understand that is another goal of the project. My investigation
> > > started
> > > > > > within the context of Spark/SQL data processing. Spark converts
> > > incoming
> > > > > > data into its own in-memory UnsafeRow representation for
> processing.
> > > So
> > > > > > naturally the performance of this data ingestion pipeline cannot
> > > > > outperform
> > > > > > the read performance of the used file format. I benchmarked
> Parquet,
> > > ORC,
> > > > > > Avro, JSON (for the specific TPC-DS store_sales table). And then
> > > > > curiously
> > > > > > benchmarked Arrow as well because its design choices are a better
> > > fit for
> > > > > > modern high-performance RDMA/NVMe/100+Gbps devices I am
> > > investigating.
> > > > > From
> > > > > > this point of view, I am trying to find out can Arrow be the file
> > > format
> > > > > > for the next generation of storage/networking devices (see Apache
> > > Crail
> > > > > > project) delivering close to the hardware speed reading/writing
> > > rates. As
> > > > > > Wes pointed out that a C++ library implementation should be
> memory-IO
> > > > > > bound, so what would it take to deliver the same performance in
> Java
> > > ;)
> > > > > > (and then, from across the network).
> > > > > >
> > > > > > I hope this makes sense.
> > > > > >
> > > > > > Cheers,
> > > > > > --
> > > > > > Animesh
> > > > > >
> > > > > > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau <
> jacques@xxxxxxxxxx>
> > > > > wrote:
> > > > > >
> > > > > > > My big question is what is the use case and how/what are you
> > > trying to
> > > > > > > compare? Arrow's implementation is more focused on interacting
> > > with the
> > > > > > > structure than transporting it. Generally speaking, when we're
> > > working
> > > > > with
> > > > > > > Arrow data we frequently are just interacting with memory
> > > locations and
> > > > > > > doing direct operations. If you have a layer that supports that
> > > type of
> > > > > > > semantic, create a movement technique that depends on that.
> Arrow
> > > > > doesn't
> > > > > > > force a particular API since the data itself is defined by its
> > > > > in-memory
> > > > > > > layout so if you have a custom use or pattern, just work with
> the
> > > > > in-memory
> > > > > > > structures.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney <
> wesmckinn@xxxxxxxxx>
> > > > > wrote:
> > > > > > >
> > > > > > > > hi Animesh,
> > > > > > > >
> > > > > > > > Per Johan's comments, the C++ library is essentially going
> to be
> > > > > > > > IO/memory bandwidth bound since you're interacting with raw
> > > pointers.
> > > > > > > >
> > > > > > > > I'm looking at your code
> > > > > > > >
> > > > > > > > private void consumeFloat4(FieldVector fv) {
> > > > > > > >     Float4Vector accessor = (Float4Vector) fv;
> > > > > > > >     int valCount = accessor.getValueCount();
> > > > > > > >     for(int i = 0; i < valCount; i++){
> > > > > > > >         if(!accessor.isNull(i)){
> > > > > > > >             float4Count+=1;
> > > > > > > >             checksum+=accessor.get(i);
> > > > > > > >         }
> > > > > > > >     }
> > > > > > > > }
> > > > > > > >
> > > > > > > > You'll want to get a Java-Arrow expert from Dremio to advise
> you
> > > the
> > > > > > > > fastest way to iterate over this data -- my understanding is
> that
> > > > > much
> > > > > > > > code in Dremio interacts with the wrapped Netty ArrowBuf
> objects
> > > > > > > > rather than going through the higher level APIs. You're also
> > > dropping
> > > > > > > > performance because memory mapping is not yet implemented in
> > > Java,
> > > > > see
> > > > > > > > https://issues.apache.org/jira/browse/ARROW-3191.
> > > > > > > >
> > > > > > > > Furthermore, the IPC reader class you are using could be made
> > > more
> > > > > > > > efficient. I described the problem in
> > > > > > > > https://issues.apache.org/jira/browse/ARROW-3192 -- this
> will be
> > > > > > > > required as soon as we have the ability to do memory mapping
> in
> > > Java
> > > > > > > >
> > > > > > > > Could Crail use the Arrow data structures in its runtime
> rather
> > > than
> > > > > > > > copying? If not, how are Crail's runtime data structures
> > > different?
> > > > > > > >
> > > > > > > > - Wes
> > > > > > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
> > > > > > > > <J.W.Peltenburg@xxxxxxxxxx> wrote:
> > > > > > > > >
> > > > > > > > > Hello Animesh,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I browsed a bit in your sources, thanks for sharing. We
> have
> > > > > performed
> > > > > > > > some similar measurements to your third case in the past for
> > > C/C++ on
> > > > > > > > collections of various basic types such as primitives and
> > > strings.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I can say that in terms of consuming data from the Arrow
> format
> > > > > versus
> > > > > > > > language native collections in C++, I've so far always been
> able
> > > to
> > > > > get
> > > > > > > > similar performance numbers (e.g. no drawbacks due to the
> Arrow
> > > > > format
> > > > > > > > itself). Especially when accessing the data through Arrow's
> raw
> > > data
> > > > > > > > pointers (and using for example std::string_view-like
> > > constructs).
> > > > > > > > >
> > > > > > > > > In C/C++ the fast data structures are engineered in such a
> way
> > > > > that as
> > > > > > > > little pointer traversals are required and they take up an as
> > > small
> > > > > as
> > > > > > > > possible memory footprint. Thus each memory access is
> relatively
> > > > > > > efficient
> > > > > > > > (in terms of obtaining the data of interest). The same can
> > > > > absolutely be
> > > > > > > > said for Arrow, if not even more efficient in some cases
> where
> > > object
> > > > > > > > fields are of variable length.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > In the JVM case, the Arrow data is stored off-heap. This
> > > requires
> > > > > the
> > > > > > > > JVM to interface to it through some calls to Unsafe hidden
> under
> > > the
> > > > > > > Netty
> > > > > > > > layer (but please correct me if I'm wrong, I'm not an expert
> on
> > > > > this).
> > > > > > > > Those calls are the only reason I can think of that would
> > > degrade the
> > > > > > > > performance a bit compared to a pure JAva case. I don't know
> if
> > > the
> > > > > > > Unsafe
> > > > > > > > calls are inlined during JIT compilation. If they aren't they
> > > will
> > > > > > > increase
> > > > > > > > access latency to any data a little bit.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I don't have a similar machine so it's not easy to relate
> my
> > > > > numbers to
> > > > > > > > yours, but if you can get that data consumed with 100 Gbps
> in a
> > > pure
> > > > > Java
> > > > > > > > case, I don't see any reason (resulting from Arrow format /
> > > off-heap
> > > > > > > > storage) why you wouldn't be able to get at least really
> close.
> > > Can
> > > > > you
> > > > > > > get
> > > > > > > > to 100 Gbps starting from primitive arrays in Java with your
> > > > > consumption
> > > > > > > > functions in the first place?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I'm interested to see your progress on this.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Kind regards,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Johan Peltenburg
> > > > > > > > >
> > > > > > > > > ________________________________
> > > > > > > > > From: Animesh Trivedi <animesh.trivedi@xxxxxxxxx>
> > > > > > > > > Sent: Wednesday, September 19, 2018 2:08:50 PM
> > > > > > > > > To: dev@xxxxxxxxxxxxxxxx; dev@xxxxxxxxxxxxxxxx
> > > > > > > > > Subject: [JAVA] Arrow performance measurement
> > > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > A week ago, Wes and I had a discussion about the
> performance
> > > of the
> > > > > > > > > Arrow/Java implementation on the Apache Crail (Incubating)
> > > mailing
> > > > > > > list (
> > > > > > > > >
> > > > >
> http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser
> > > > > > > ).
> > > > > > > > In
> > > > > > > > > a nutshell: I am investigating the performance of various
> file
> > > > > formats
> > > > > > > > > (including Arrow) on high-performance NVMe and
> > > RDMA/100Gbps/RoCE
> > > > > > > setups.
> > > > > > > > I
> > > > > > > > > benchmarked how long does it take to materialize values
> (ints,
> > > > > longs,
> > > > > > > > > doubles) of the store_sales table, the largest table in the
> > > TPC-DS
> > > > > > > > dataset
> > > > > > > > > stored on different file formats. Here is a write-up on
> this -
> > > > > > > > >
> https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I
> > > > > found
> > > > > > > > that
> > > > > > > > > between a pair of machine connected over a 100 Gbps link,
> Arrow
> > > > > (using
> > > > > > > > as a
> > > > > > > > > file format on HDFS) delivered close to ~30 Gbps of
> bandwidth
> > > with
> > > > > all
> > > > > > > 16
> > > > > > > > > cores engaged. Wes pointed out that (i) Arrow is in-memory
> IPC
> > > > > format,
> > > > > > > > and
> > > > > > > > > has not been optimized for storage interfaces/APIs like
> HDFS;
> > > (ii)
> > > > > the
> > > > > > > > > performance I am measuring is for the java implementation.
> > > > > > > > >
> > > > > > > > > Wes, I hope I summarized our discussion correctly.
> > > > > > > > >
> > > > > > > > > That brings us to this email where I promised to follow up
> on
> > > the
> > > > > Arrow
> > > > > > > > > mailing list to understand and optimize the performance of
> > > > > Arrow/Java
> > > > > > > > > implementation on high-performance devices. I wrote a small
> > > > > stand-alone
> > > > > > > > > benchmark (
> > > https://github.com/animeshtrivedi/benchmarking-arrow)
> > > > > with
> > > > > > > > three
> > > > > > > > > implementations of WritableByteChannel, SeekableByteChannel
> > > > > interfaces:
> > > > > > > > >
> > > > > > > > > 1. Arrow data is stored in HDFS/tmpfs - this gives me ~30
> Gbps
> > > > > > > > performance
> > > > > > > > > 2. Arrow data is stored in Crail/DRAM - this gives me
> ~35-36
> > > Gbps
> > > > > > > > > performance
> > > > > > > > > 3. Arrow data is stored in on-heap byte[] - this gives me
> ~39
> > > Gbps
> > > > > > > > > performance
> > > > > > > > >
> > > > > > > > > I think the order makes sense. To better understand the
> > > > > performance of
> > > > > > > > > Arrow/Java we can focus on the option 3.
> > > > > > > > >
> > > > > > > > > The key question I am trying to answer is "what would it
> take
> > > for
> > > > > > > > > Arrow/Java to deliver 100+ Gbps of performance"? Is it
> > > possible? If
> > > > > > > yes,
> > > > > > > > > then what is missing/or mis-interpreted by me? If not, then
> > > where
> > > > > is
> > > > > > > the
> > > > > > > > > performance lost? Does anyone have any performance
> measurements
> > > > > for C++
> > > > > > > > > implementation? if they have seen/expect better numbers.
> > > > > > > > >
> > > > > > > > > As a next step, I will profile the read path of Arrow/Java
> for
> > > the
> > > > > > > option
> > > > > > > > > 3. I will report my findings.
> > > > > > > > >
> > > > > > > > > Any thoughts and feedback on this investigation are very
> > > welcome.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > --
> > > > > > > > > Animesh
> > > > > > > > >
> > > > > > > > > PS~ Cross-posting on the dev@xxxxxxxxxxxxxxxx list as
> well.
> > > > > > > >
> > > > > > >
> > > > >
> > >
>