Re: Performance issue in Beam 2.4 onwards
Do you use specific/complex coders in your pipeline ?
I'm sure Eugene will propose some insights about this change: AFAIR, the
purpose is to have a cleaner use of coders and identify identity copy.
On 09/07/2018 16:22, Vojtech Janota wrote:
> We are using Apache Beam in our project for some time now. Since our
> datasets are of modest size, we have so far used DirectRunner as the
> computation easily fits onto a single machine. Recently we upgraded Beam
> from 2.2 to 2.4 and found out that performance of our pipelines
> drastically deteriorated. Pipelines that took ~3 minutes with 2.2 do not
> finish within hours now. We tried to isolate the change that causes the
> slowdown and came to the commits into the "InMemoryStateInternals" class:
> * https://github.com/apache/beam/commit/32a427c
> * https://github.com/apache/beam/commit/8151d82
> In a nutshell where previously the copy() method simply assigned:
> that.value = this.value
> There is now coder encode/decode combo hidden behind:
> that.value = uncheckedClone(coder, this.value)
> Can somebody explain the purpose of this change? Is it meant as an
> additional "enforcement" point, similar to DirectRunner's
> enforceImmutability and enforceEncodability? Or is it something that is
> genuinely needed to provide correct behaviour of the pipeline?
> Any hints or thoughts are appreciated.
Talend - http://www.talend.com