Parallelism is 1584.flink run -m yarn-cluster -yn 792 -ys 2 -ytm 14000 -yjm 114736 -p 1584
The code or the execution plan (ExecutionEnvironment.
getExecutionPlan()) of the job would be interesting.2018-08-08 10:26 GMT+02:00 Chesnay Schepler <chesnay@xxxxxxxxxx>:What have you tried so far to increase performance? (Did you try different combinations of -yn and -ys?)
Can you provide us with your application? What source/sink are you using?
On 08.08.2018 07:59, Ravi Bhushan Ratnakar wrote:
Currently I am working on a project where i need to write a Flink Batch Application which has to process hourly data around 400GB of compressed sequence file. After processing, it has write it as compressed parquet format in S3.
I have managed to write the application in Flink and able to run successfully process the whole hour data and write in Parquet format in S3. But the problem is this that it is not able to meet the performance of the existing application which is written using Spark Batch(running in production).
Current Spark BatchCluster size - Aws EMR - 1 Master + 100 worker node of m4.4xlarge ( 16vCpu, 64GB RAM), each instance with 160GB disk volumeInput data - Around 400GBTime Taken to process - Around 36 mins
Flink BatchCluster size - Aws EMR - 1 Master + 100 worker node of r4.4xlarge ( 16vCpu, 64GB RAM), each instance with 630GB disk volumeTransient Job - flink run -m yarn-cluster -yn 792 -ys 2 -ytm 14000 -yjm 114736Input data - Around 400GBTime Taken to process - Around 1 hour
I have given all the node memory to jobmanager just to make sure that there is a dedicated node for jobmanager so that it doesn't face any issue related to resources.
We are already running Flink Batch job with double RAM compare to Spark Batch however we are not able get the same performance.
Kindly suggest on this to achieve the same performance as we are getting from Spark Batch
Screen Shot 2018-08-08 at 14.18.52.png
Description: PNG image