Re: Performance issue in Beam 2.4 onwards
I fully agree, that's why it makes sense to wait Eugene's feedback.
I remember we had some performance regression on the direct runner
identified thanks to Nexmark, but it has been addressed by reverting a
Good catch anyway !
On 09/07/2018 17:20, Vojtech Janota wrote:
> Hi Reuven,
> I'm not really complaining about DirectRunner. In fact it seems to me as
> if what previously was considered as part of the "expensive extra
> checks" done by the DirectRunner is now done within the
> beam-runners-core-java library. Considering that all objects involved
> are immutable (in our case at least) and simple assignment is
> sufficient, the serialization-deserialization really seems as unwanted
> and hugely expensive correctness check. If there was a problem with
> identity copy, wasn't DirectRunner supposed to reveal it?
> On Mon, Jul 9, 2018 at 4:46 PM, Reuven Lax <relax@xxxxxxxxxx
> <mailto:relax@xxxxxxxxxx>> wrote:
> Hi Vojita,
> One problem is that the DirectRunner is designed for testing, not
> for performance. The DirectRunner currently does many
> purposely-inefficient things, the point of which is to better expose
> potential bugs in tests. For example, the DirectRunner will randomly
> shuffle the order of PCollections to ensure that your code does not
> rely on ordering. All of this adds cost, because the current runner
> is designed for testing. There have been requests in the past for an
> "optimized" local runner, however we don't currently have such a thing.
> In this case, using coders to clone values is more correct. In a
> distributed environment using encode/decode is the only way to copy
> values, and the DirectRunner is trying to ensure that your code is
> correct in a distributed environment.
> On Mon, Jul 9, 2018 at 7:22 AM Vojtech Janota
> <vojta.janota@xxxxxxxxx <mailto:vojta.janota@xxxxxxxxx>> wrote:
> We are using Apache Beam in our project for some time now. Since
> our datasets are of modest size, we have so far used
> DirectRunner as the computation easily fits onto a single
> machine. Recently we upgraded Beam from 2.2 to 2.4 and found out
> that performance of our pipelines drastically deteriorated.
> Pipelines that took ~3 minutes with 2.2 do not finish within
> hours now. We tried to isolate the change that causes the
> slowdown and came to the commits into the
> "InMemoryStateInternals" class:
> * https://github.com/apache/beam/commit/32a427c
> * https://github.com/apache/beam/commit/8151d82
> In a nutshell where previously the copy() method simply assigned:
> that.value = this.value
> There is now coder encode/decode combo hidden behind:
> that.value = uncheckedClone(coder, this.value)
> Can somebody explain the purpose of this change? Is it meant as
> an additional "enforcement" point, similar to DirectRunner's
> enforceImmutability and enforceEncodability? Or is it something
> that is genuinely needed to provide correct behaviour of the
> Any hints or thoughts are appreciated.
Talend - http://www.talend.com