osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [JAVA] Arrow performance measurement


Hi Johan, Wes, and Jacques - many thanks for your comments:

@Johan -
1. I also do not suspect that there is any inherent drawback in Java or C++
due to the Arrow format. I mentioned C++ because Wes pointed out that Java
routines are not the most optimized ones (yet!). And naturally one would
expect better performance in a native language with all pointer/memory/SIMD
instruction optimizations that you mentioned. As far as I know, the
off-heap buffers are managed in ArrowBuf which implements an abstract netty
class. But there is nothing unusual, i.e., netty specific, about these
unsafe routines, they are used by many projects. Though there is cost
associated with materializing on-heap Java values from off-heap memory
regions. I need to benchmark that more carefully.

2. When you say "I've so far always been able to get similar performance
numbers" - do you mean the same performance of my case 3 where 16 cores
drive close to 40 Gbps, or the same performance between your C++ and Java
benchmarks. Do you have some write-up? I would be interested to read up :)

3. "Can you get to 100 Gbps starting from primitive arrays in Java" -> that
is a good idea. Let me try and report back.

@Wes -

Is there some benchmark template for C++ routines I can have a look? I
would be happy to get some input from Java-Arrow experts on how to write
these benchmarks more efficiently. I will have a closer look at the JIRA
tickets that you mentioned.

So, for now I am focusing on the case 3, which is about establishing
performance when reading from a local in-memory I/O stream that I
implemented (
https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/MemoryIOChannel.java).
In this case I first read data from parquet files, convert them into Arrow,
and write-out to a MemoryIOChannel, and then read back from it. So, the
performance has nothing to do with Crail or HDFS in the case 3. Once, I
establish the base performance in this setup (which is around ~40 Gbps with
16 cores) I will add Crail to the mix. Perhaps Crail I/O streams can take
ArrowBuf as src/dst buffers. That should be doable.

@Jacques -

That is a good point that "Arrow's implementation is more focused on
interacting with the structure than transporting it". However, in any
distributed system one needs to move data/structure around - as far as I
understand that is another goal of the project. My investigation started
within the context of Spark/SQL data processing. Spark converts incoming
data into its own in-memory UnsafeRow representation for processing. So
naturally the performance of this data ingestion pipeline cannot outperform
the read performance of the used file format. I benchmarked Parquet, ORC,
Avro, JSON (for the specific TPC-DS store_sales table). And then curiously
benchmarked Arrow as well because its design choices are a better fit for
modern high-performance RDMA/NVMe/100+Gbps devices I am investigating. From
this point of view, I am trying to find out can Arrow be the file format
for the next generation of storage/networking devices (see Apache Crail
project) delivering close to the hardware speed reading/writing rates. As
Wes pointed out that a C++ library implementation should be memory-IO
bound, so what would it take to deliver the same performance in Java ;)
(and then, from across the network).

I hope this makes sense.

Cheers,
--
Animesh

On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau <jacques@xxxxxxxxxx> wrote:

> My big question is what is the use case and how/what are you trying to
> compare? Arrow's implementation is more focused on interacting with the
> structure than transporting it. Generally speaking, when we're working with
> Arrow data we frequently are just interacting with memory locations and
> doing direct operations. If you have a layer that supports that type of
> semantic, create a movement technique that depends on that. Arrow doesn't
> force a particular API since the data itself is defined by its in-memory
> layout so if you have a custom use or pattern, just work with the in-memory
> structures.
>
>
>
> On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney <wesmckinn@xxxxxxxxx> wrote:
>
> > hi Animesh,
> >
> > Per Johan's comments, the C++ library is essentially going to be
> > IO/memory bandwidth bound since you're interacting with raw pointers.
> >
> > I'm looking at your code
> >
> > private void consumeFloat4(FieldVector fv) {
> >     Float4Vector accessor = (Float4Vector) fv;
> >     int valCount = accessor.getValueCount();
> >     for(int i = 0; i < valCount; i++){
> >         if(!accessor.isNull(i)){
> >             float4Count+=1;
> >             checksum+=accessor.get(i);
> >         }
> >     }
> > }
> >
> > You'll want to get a Java-Arrow expert from Dremio to advise you the
> > fastest way to iterate over this data -- my understanding is that much
> > code in Dremio interacts with the wrapped Netty ArrowBuf objects
> > rather than going through the higher level APIs. You're also dropping
> > performance because memory mapping is not yet implemented in Java, see
> > https://issues.apache.org/jira/browse/ARROW-3191.
> >
> > Furthermore, the IPC reader class you are using could be made more
> > efficient. I described the problem in
> > https://issues.apache.org/jira/browse/ARROW-3192 -- this will be
> > required as soon as we have the ability to do memory mapping in Java
> >
> > Could Crail use the Arrow data structures in its runtime rather than
> > copying? If not, how are Crail's runtime data structures different?
> >
> > - Wes
> > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
> > <J.W.Peltenburg@xxxxxxxxxx> wrote:
> > >
> > > Hello Animesh,
> > >
> > >
> > > I browsed a bit in your sources, thanks for sharing. We have performed
> > some similar measurements to your third case in the past for C/C++ on
> > collections of various basic types such as primitives and strings.
> > >
> > >
> > > I can say that in terms of consuming data from the Arrow format versus
> > language native collections in C++, I've so far always been able to get
> > similar performance numbers (e.g. no drawbacks due to the Arrow format
> > itself). Especially when accessing the data through Arrow's raw data
> > pointers (and using for example std::string_view-like constructs).
> > >
> > > In C/C++ the fast data structures are engineered in such a way that as
> > little pointer traversals are required and they take up an as small as
> > possible memory footprint. Thus each memory access is relatively
> efficient
> > (in terms of obtaining the data of interest). The same can absolutely be
> > said for Arrow, if not even more efficient in some cases where object
> > fields are of variable length.
> > >
> > >
> > > In the JVM case, the Arrow data is stored off-heap. This requires the
> > JVM to interface to it through some calls to Unsafe hidden under the
> Netty
> > layer (but please correct me if I'm wrong, I'm not an expert on this).
> > Those calls are the only reason I can think of that would degrade the
> > performance a bit compared to a pure JAva case. I don't know if the
> Unsafe
> > calls are inlined during JIT compilation. If they aren't they will
> increase
> > access latency to any data a little bit.
> > >
> > >
> > > I don't have a similar machine so it's not easy to relate my numbers to
> > yours, but if you can get that data consumed with 100 Gbps in a pure Java
> > case, I don't see any reason (resulting from Arrow format / off-heap
> > storage) why you wouldn't be able to get at least really close. Can you
> get
> > to 100 Gbps starting from primitive arrays in Java with your consumption
> > functions in the first place?
> > >
> > >
> > > I'm interested to see your progress on this.
> > >
> > >
> > > Kind regards,
> > >
> > >
> > > Johan Peltenburg
> > >
> > > ________________________________
> > > From: Animesh Trivedi <animesh.trivedi@xxxxxxxxx>
> > > Sent: Wednesday, September 19, 2018 2:08:50 PM
> > > To: dev@xxxxxxxxxxxxxxxx; dev@xxxxxxxxxxxxxxxx
> > > Subject: [JAVA] Arrow performance measurement
> > >
> > > Hi all,
> > >
> > > A week ago, Wes and I had a discussion about the performance of the
> > > Arrow/Java implementation on the Apache Crail (Incubating) mailing
> list (
> > > http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser
> ).
> > In
> > > a nutshell: I am investigating the performance of various file formats
> > > (including Arrow) on high-performance NVMe and RDMA/100Gbps/RoCE
> setups.
> > I
> > > benchmarked how long does it take to materialize values (ints, longs,
> > > doubles) of the store_sales table, the largest table in the TPC-DS
> > dataset
> > > stored on different file formats. Here is a write-up on this -
> > > https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I found
> > that
> > > between a pair of machine connected over a 100 Gbps link, Arrow (using
> > as a
> > > file format on HDFS) delivered close to ~30 Gbps of bandwidth with all
> 16
> > > cores engaged. Wes pointed out that (i) Arrow is in-memory IPC format,
> > and
> > > has not been optimized for storage interfaces/APIs like HDFS; (ii) the
> > > performance I am measuring is for the java implementation.
> > >
> > > Wes, I hope I summarized our discussion correctly.
> > >
> > > That brings us to this email where I promised to follow up on the Arrow
> > > mailing list to understand and optimize the performance of Arrow/Java
> > > implementation on high-performance devices. I wrote a small stand-alone
> > > benchmark (https://github.com/animeshtrivedi/benchmarking-arrow) with
> > three
> > > implementations of WritableByteChannel, SeekableByteChannel interfaces:
> > >
> > > 1. Arrow data is stored in HDFS/tmpfs - this gives me ~30 Gbps
> > performance
> > > 2. Arrow data is stored in Crail/DRAM - this gives me ~35-36 Gbps
> > > performance
> > > 3. Arrow data is stored in on-heap byte[] - this gives me ~39 Gbps
> > > performance
> > >
> > > I think the order makes sense. To better understand the performance of
> > > Arrow/Java we can focus on the option 3.
> > >
> > > The key question I am trying to answer is "what would it take for
> > > Arrow/Java to deliver 100+ Gbps of performance"? Is it possible? If
> yes,
> > > then what is missing/or mis-interpreted by me? If not, then where is
> the
> > > performance lost? Does anyone have any performance measurements for C++
> > > implementation? if they have seen/expect better numbers.
> > >
> > > As a next step, I will profile the read path of Arrow/Java for the
> option
> > > 3. I will report my findings.
> > >
> > > Any thoughts and feedback on this investigation are very welcome.
> > >
> > > Cheers,
> > > --
> > > Animesh
> > >
> > > PS~ Cross-posting on the dev@xxxxxxxxxxxxxxxx list as well.
> >
>