[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Performance issue in Beam 2.4 onwards


We are using Apache Beam in our project for some time now. Since our datasets are of modest size, we have so far used DirectRunner as the computation easily fits onto a single machine. Recently we upgraded Beam from 2.2 to 2.4 and found out that performance of our pipelines drastically deteriorated. Pipelines that took ~3 minutes with 2.2 do not finish within hours now. We tried to isolate the change that causes the slowdown and came to the commits into the "InMemoryStateInternals" class:

* https://github.com/apache/beam/commit/32a427c
* https://github.com/apache/beam/commit/8151d82

In a nutshell where previously the copy() method simply assigned:

  that.value = this.value

There is now coder encode/decode combo hidden behind:

  that.value = uncheckedClone(coder, this.value)

Can somebody explain the purpose of this change? Is it meant as an additional "enforcement" point, similar to DirectRunner's enforceImmutability and enforceEncodability? Or is it something that is genuinely needed to provide correct behaviour of the pipeline?

Any hints or thoughts are appreciated.