[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [JAVA] Arrow performance measurement

hi Animesh,

Per Johan's comments, the C++ library is essentially going to be
IO/memory bandwidth bound since you're interacting with raw pointers.

I'm looking at your code

private void consumeFloat4(FieldVector fv) {
    Float4Vector accessor = (Float4Vector) fv;
    int valCount = accessor.getValueCount();
    for(int i = 0; i < valCount; i++){

You'll want to get a Java-Arrow expert from Dremio to advise you the
fastest way to iterate over this data -- my understanding is that much
code in Dremio interacts with the wrapped Netty ArrowBuf objects
rather than going through the higher level APIs. You're also dropping
performance because memory mapping is not yet implemented in Java, see

Furthermore, the IPC reader class you are using could be made more
efficient. I described the problem in
https://issues.apache.org/jira/browse/ARROW-3192 -- this will be
required as soon as we have the ability to do memory mapping in Java

Could Crail use the Arrow data structures in its runtime rather than
copying? If not, how are Crail's runtime data structures different?

- Wes
On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
<J.W.Peltenburg@xxxxxxxxxx> wrote:
> Hello Animesh,
> I browsed a bit in your sources, thanks for sharing. We have performed some similar measurements to your third case in the past for C/C++ on collections of various basic types such as primitives and strings.
> I can say that in terms of consuming data from the Arrow format versus language native collections in C++, I've so far always been able to get similar performance numbers (e.g. no drawbacks due to the Arrow format itself). Especially when accessing the data through Arrow's raw data pointers (and using for example std::string_view-like constructs).
> In C/C++ the fast data structures are engineered in such a way that as little pointer traversals are required and they take up an as small as possible memory footprint. Thus each memory access is relatively efficient (in terms of obtaining the data of interest). The same can absolutely be said for Arrow, if not even more efficient in some cases where object fields are of variable length.
> In the JVM case, the Arrow data is stored off-heap. This requires the JVM to interface to it through some calls to Unsafe hidden under the Netty layer (but please correct me if I'm wrong, I'm not an expert on this). Those calls are the only reason I can think of that would degrade the performance a bit compared to a pure JAva case. I don't know if the Unsafe calls are inlined during JIT compilation. If they aren't they will increase access latency to any data a little bit.
> I don't have a similar machine so it's not easy to relate my numbers to yours, but if you can get that data consumed with 100 Gbps in a pure Java case, I don't see any reason (resulting from Arrow format / off-heap storage) why you wouldn't be able to get at least really close. Can you get to 100 Gbps starting from primitive arrays in Java with your consumption functions in the first place?
> I'm interested to see your progress on this.
> Kind regards,
> Johan Peltenburg
> ________________________________
> From: Animesh Trivedi <animesh.trivedi@xxxxxxxxx>
> Sent: Wednesday, September 19, 2018 2:08:50 PM
> To: dev@xxxxxxxxxxxxxxxx; dev@xxxxxxxxxxxxxxxx
> Subject: [JAVA] Arrow performance measurement
> Hi all,
> A week ago, Wes and I had a discussion about the performance of the
> Arrow/Java implementation on the Apache Crail (Incubating) mailing list (
> http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser). In
> a nutshell: I am investigating the performance of various file formats
> (including Arrow) on high-performance NVMe and RDMA/100Gbps/RoCE setups. I
> benchmarked how long does it take to materialize values (ints, longs,
> doubles) of the store_sales table, the largest table in the TPC-DS dataset
> stored on different file formats. Here is a write-up on this -
> https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I found that
> between a pair of machine connected over a 100 Gbps link, Arrow (using as a
> file format on HDFS) delivered close to ~30 Gbps of bandwidth with all 16
> cores engaged. Wes pointed out that (i) Arrow is in-memory IPC format, and
> has not been optimized for storage interfaces/APIs like HDFS; (ii) the
> performance I am measuring is for the java implementation.
> Wes, I hope I summarized our discussion correctly.
> That brings us to this email where I promised to follow up on the Arrow
> mailing list to understand and optimize the performance of Arrow/Java
> implementation on high-performance devices. I wrote a small stand-alone
> benchmark (https://github.com/animeshtrivedi/benchmarking-arrow) with three
> implementations of WritableByteChannel, SeekableByteChannel interfaces:
> 1. Arrow data is stored in HDFS/tmpfs - this gives me ~30 Gbps performance
> 2. Arrow data is stored in Crail/DRAM - this gives me ~35-36 Gbps
> performance
> 3. Arrow data is stored in on-heap byte[] - this gives me ~39 Gbps
> performance
> I think the order makes sense. To better understand the performance of
> Arrow/Java we can focus on the option 3.
> The key question I am trying to answer is "what would it take for
> Arrow/Java to deliver 100+ Gbps of performance"? Is it possible? If yes,
> then what is missing/or mis-interpreted by me? If not, then where is the
> performance lost? Does anyone have any performance measurements for C++
> implementation? if they have seen/expect better numbers.
> As a next step, I will profile the read path of Arrow/Java for the option
> 3. I will report my findings.
> Any thoughts and feedback on this investigation are very welcome.
> Cheers,
> --
> Animesh
> PS~ Cross-posting on the dev@xxxxxxxxxxxxxxxx list as well.