Re: Re: How to optimize the performance of Beam on Spark(Internet mail)

    Thanks for you reply.
    Our team plan to use Beam instead of Spark, So I'm testing the performance of Beam API.
    I'm coding some example through Spark API and Beam API , like "WordCount" , "Join",  "OrderBy",  "Union" ...
    I use the same Resources and configuration to run these Job.   
   Tim said I should remove "withNumShards(1)" and set spark.default.parallelism=32. I did it and tried again, but Beam job still running very slowly.
    Here is My Beam code and Spark code:
   Beam "WordCount":
   Spark "WordCount":

   I will try the other example later.
Date: 2018-09-18 22:43
Subject: Re: How to optimize the performance of Beam on Spark(Internet mail)


The first huge difference is the fact that the spark runner still uses
RDD whereas directly using spark, you are using dataset. A bunch of
optimization in spark are related to dataset.

I started a large refactoring of the spark runner to leverage Spark 2.x
(and dataset).
It's not yet ready as it includes other improvements (the portability
layer with Job API, a first check of state API, ...).

Anyway, by Spark wordcount, you mean the one included in the spark
distribution ?


On 18/09/2018 08:39, devinduan(段丁瑞) wrote:
> Hi,
>     I'm testing Beam on Spark. 
>     I use spark example code WordCount processing 1G data file, cost 1
> minutes.
>     However, I use Beam example code WordCount processing the same file,
> cost 30minutes.
>     My Spark parameter is :  --deploy-mode client  --executor-memory 1g
> --num-executors 1 --driver-memory 1g
>     My Spark version is 2.3.1,  Beam version is 2.5
>     Is there any optimization method?
> Thank you.

Jean-Baptiste Onofré
