[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[JAVA] Arrow performance measurement

Hi all,

A week ago, Wes and I had a discussion about the performance of the
Arrow/Java implementation on the Apache Crail (Incubating) mailing list (
http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser). In
a nutshell: I am investigating the performance of various file formats
(including Arrow) on high-performance NVMe and RDMA/100Gbps/RoCE setups. I
benchmarked how long does it take to materialize values (ints, longs,
doubles) of the store_sales table, the largest table in the TPC-DS dataset
stored on different file formats. Here is a write-up on this -
https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I found that
between a pair of machine connected over a 100 Gbps link, Arrow (using as a
file format on HDFS) delivered close to ~30 Gbps of bandwidth with all 16
cores engaged. Wes pointed out that (i) Arrow is in-memory IPC format, and
has not been optimized for storage interfaces/APIs like HDFS; (ii) the
performance I am measuring is for the java implementation.

Wes, I hope I summarized our discussion correctly.

That brings us to this email where I promised to follow up on the Arrow
mailing list to understand and optimize the performance of Arrow/Java
implementation on high-performance devices. I wrote a small stand-alone
benchmark (https://github.com/animeshtrivedi/benchmarking-arrow) with three
implementations of WritableByteChannel, SeekableByteChannel interfaces:

1. Arrow data is stored in HDFS/tmpfs - this gives me ~30 Gbps performance
2. Arrow data is stored in Crail/DRAM - this gives me ~35-36 Gbps
3. Arrow data is stored in on-heap byte[] - this gives me ~39 Gbps

I think the order makes sense. To better understand the performance of
Arrow/Java we can focus on the option 3.

The key question I am trying to answer is "what would it take for
Arrow/Java to deliver 100+ Gbps of performance"? Is it possible? If yes,
then what is missing/or mis-interpreted by me? If not, then where is the
performance lost? Does anyone have any performance measurements for C++
implementation? if they have seen/expect better numbers.

As a next step, I will profile the read path of Arrow/Java for the option
3. I will report my findings.

Any thoughts and feedback on this investigation are very welcome.


PS~ Cross-posting on the dev@xxxxxxxxxxxxxxxx list as well.