Thanks for your help!I will test other examples of Beam On Spark in the future and then feed back the results.Regardsdevin
Thanks for the details.
I will take a look later tomorrow (I have another issue to investigate
on the Spark runner today for Beam 2.7.0 release).
On 19/09/2018 08:31, devinduan(段丁瑞) wrote:
> I test 300MB data file.
> Use command like:
> ./spark-submit --master yarn --deploy-mode client --class
> com.test.BeamTest --executor-memory 1g --num-executors 1 --driver-memory 1g
> I set only one exeuctor. so task run in sequence . One task cost 10s.
> However, Spark task cost only 0.4s
> *From:* Jean-Baptiste Onofré <mailto:jb@xxxxxxxxxxxx>
> *Date:* 2018-09-19 12:22
> *To:* dev@xxxxxxxxxxxxxxx <mailto:dev@xxxxxxxxxxxxxxx>
> *Subject:* Re: How to optimize the performance of Beam on
> Spark(Internet mail)
> did you compare the stages in the Spark UI in order to identify which
> stage is taking time ?
> You use spark-submit in both cases for the bootstrapping ?
> I will do a test here as well.
> On 19/09/2018 05:34, devinduan(段丁瑞) wrote:
> > Hi,
> > Thanks for you reply.
> > Our team plan to use Beam instead of Spark, So I'm testing the
> > performance of Beam API.
> > I'm coding some example through Spark API and Beam API , like
> > "WordCount" , "Join", "OrderBy", "Union" ...
> > I use the same Resources and configuration to run these Job.
> > Tim said I should remove "withNumShards(1)" and
> > set spark.default.parallelism=32. I did it and tried again, but
> Beam job
> > still running very slowly.
> > Here is My Beam code and Spark code:
> > Beam "WordCount":
> > Spark "WordCount":
> > I will try the other example later.
> > Regards
> > devin
> > *From:* Jean-Baptiste Onofré <mailto:jb@xxxxxxxxxxxx>
> > *Date:* 2018-09-18 22:43
> > *To:* dev@xxxxxxxxxxxxxxx <mailto:dev@xxxxxxxxxxxxxxx>
> > *Subject:* Re: How to optimize the performance of Beam on
> > Spark(Internet mail)
> > Hi,
> > The first huge difference is the fact that the spark runner
> still uses
> > RDD whereas directly using spark, you are using dataset. A
> bunch of
> > optimization in spark are related to dataset.
> > I started a large refactoring of the spark runner to leverage
> Spark 2.x
> > (and dataset).
> > It's not yet ready as it includes other improvements (the
> > layer with Job API, a first check of state API, ...).
> > Anyway, by Spark wordcount, you mean the one included in the spark
> > distribution ?
> > Regards
> > JB
> > On 18/09/2018 08:39, devinduan(段丁瑞) wrote:
> > > Hi，
> > > I'm testing Beam on Spark.
> > > I use spark example code WordCount processing 1G data
> file, cost 1
> > > minutes.
> > > However, I use Beam example code WordCount processing
> the same
> > file,
> > > cost 30minutes.
> > > My Spark parameter is : --deploy-mode client
> > --executor-memory 1g
> > > --num-executors 1 --driver-memory 1g
> > > My Spark version is 2.3.1, Beam version is 2.5
> > > Is there any optimization method?
> > > Thank you.
> > >
> > >
> > --
> > Jean-Baptiste Onofré
> > jbonofre@xxxxxxxxxx
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> Jean-Baptiste Onofré
> Talend - http://www.talend.com
Talend - http://www.talend.com