[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Stream collector serialization performance

Hi Mingliang,

first of all the POJO serializer is not very performant. Tuple or Row are better. If you want to improve the performance of a collect() between operators, you could also enable object reuse. You can read more about this here [1] (section "Issue 2: Object Reuse"), but make sure your implementation is correct because an operator could modify the objects of follwing operators.

I hope this helps.


[1] https://data-artisans.com/blog/curious-case-broken-benchmark-revisiting-apache-flink-vs-databricks-runtime

Am 15.08.18 um 09:06 schrieb 祁明良:
Hi all,

I’m currently using the keyed process function, I see there’s serialization happening when I collect the object / update the object to rocksdb. For me the performance of serialization seems to be the bottleneck.
By default, POJO serializer is used, and the timecost of collect / update to rocksdb is roughly 1:1, Then I switch to kryo by setting getConfig.enableForceKryo(). Now the timecost of update to rocksdb decreases significantly to roughly 0.3, but the collect method seems not improving. Can someone help to explain this?

  My Object looks somehow like this:

Class A {
String f1 // 20 * string fields
List<B> f2. // 20 * list of another POJO object
Int f3 // 20 * ints fields
Class B {
String f // 5 * string fields


This communication may contain privileged or other confidential information of Red. If you have received it in error, please advise the sender by reply e-mail and immediately delete the message and any attachments without copying or disclosing the contents. Thank you.