osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Performance of BeamFnData between Python and Java


I'd assume you're compiling the code with Cython as well? (If you're
using the default containers, that should be fine.)
On Fri, Nov 9, 2018 at 12:09 AM Robert Bradshaw <robertwb@xxxxxxxxxx> wrote:
>
> Very cool to hear of this progress on Samza!
>
> Python protocol buffers are extraordinarily slow (lots of reflection,
> attributes lookups, and bit fiddling for serialization/deserialization
> that is certainly not Python's strong point). Each bundle processed
> involves multiple protos being constructed and sent/received (notably
> the particularly nested and branchy monitoring info one). While there
> are still some improvements that could be made for making bundles
> lighter-weight, amortizing this cost over many elements is essential
> for performance. (Note that elements within a bundle are packed into a
> single byte buffer, so avoid this overhead.)
>
> Also, it may be good to guarantee you're at least using the C++
> bindings: https://developers.google.com/protocol-buffers/docs/reference/python-generated
> (still slow, but not as slow).
>
> And, of course, due to the GIL one may want many python workers for
> multi-core machines.
>
> On Thu, Nov 8, 2018 at 9:18 PM Thomas Weise <thw@xxxxxxxxxx> wrote:
> >
> > We have been doing some end to end testing with Python and Flink (streaming). You could take a look at the following and possibly replicate it for your work:
> >
> > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/flink/flink_streaming_impulse.py
> >
> > We found that in order to get acceptable performance, we need larger bundles (we started with single element bundles). Default in the Flink runner now is to cap bundles at 1000 elements or 1 second, whatever comes first. With that, I have seen decent throughput for the pipeline above (~ 5000k elements per second with single SDK worker).
> >
> > The Flink runner also has support to run multiple SDK workers per Flink task manager.
> >
> > Thomas
> >
> >
> > On Thu, Nov 8, 2018 at 11:13 AM Xinyu Liu <xinyuliu.us@xxxxxxxxx> wrote:
> >>
> >> 19mb/s throughput is enough for us. Seems the bottleneck is the rate of RPC calls. Our message size is usually 1k ~ 10k. So if we can reach 19mb/s, we will be able to process ~4k qps, that meets our requirements. I guess increasing the size of the bundles will help. Do you guys have any results from running python with Flink? We are curious about the performance there.
> >>
> >> Thanks,
> >> Xinyu
> >>
> >> On Thu, Nov 8, 2018 at 10:13 AM Lukasz Cwik <lcwik@xxxxxxxxxx> wrote:
> >>>
> >>> This benchmark[1] shows that Python is getting about 19mb/s.
> >>>
> >>> Yes, running more python sdk_worker processes will improve performance since Python is limited to a single CPU core.
> >>>
> >>> [1] https://performance-dot-grpc-testing.appspot.com/explore?dashboard=5652536396611584&widget=490377658&container=1286539696
> >>>
> >>>
> >>>
> >>> On Wed, Nov 7, 2018 at 5:24 PM Xinyu Liu <xinyuliu.us@xxxxxxxxx> wrote:
> >>>>
> >>>> By looking at the gRPC dashboard published by the benchmark[1], it seems the streaming ping-pong operations per second for gRPC in python is around 2k ~ 3k qps. This seems quite low compared to gRPC performance in other languages, e.g. 600k qps for Java and Go. Is it expected to run multiple sdk_worker processes to improve performance?
> >>>>
> >>>> [1] https://performance-dot-grpc-testing.appspot.com/explore?dashboard=5652536396611584&widget=713624174&container=1012810333&maximized
> >>>>
> >>>> On Wed, Nov 7, 2018 at 11:14 AM Lukasz Cwik <lcwik@xxxxxxxxxx> wrote:
> >>>>>
> >>>>> gRPC folks provide a bunch of benchmarks for different scenarios: https://grpc.io/docs/guides/benchmarking.html
> >>>>> You would be most interested in the streaming throughput benchmarks since the Data API is written on top of the gRPC streaming APIs.
> >>>>>
> >>>>> 200KB/s does seem pretty small. Have you captured any Python profiles[1] and looked at them?
> >>>>>
> >>>>> 1: https://lists.apache.org/thread.html/f8488faede96c65906216c6b4bc521385abeddc1578c99b85937d2f2@%3Cdev.beam.apache.org%3E
> >>>>>
> >>>>>
> >>>>> On Wed, Nov 7, 2018 at 10:18 AM Hai Lu <lhaiesp@xxxxxxxxx> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> This is Hai from LinkedIn. I'm currently working on Portable API for Samza Runner. I was able to make Python work with Samza container reading from Kafka. However, I'm seeing severe performance issue with my set up, achieving only ~200KB throughput between the Samza runner in the Java side and the sdk_worker in the Python part.
> >>>>>>
> >>>>>> While I'm digging into this, I wonder whether some one has benchmarked the data channel between Java and Python and had some results how much throughput can be reached? Assuming single worker thread and single JobBundleFactory.
> >>>>>>
> >>>>>> I might be missing some very basic and naive gRPC setting which leads to this unsatisfactory results. So another question is whether are any good articles or documentations about gRPC tuning dedicated to IPC?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Hai
> >>>>>>
> >>>>>>