The first huge difference is the fact that the spark runner still uses
RDD whereas directly using spark, you are using dataset. A bunch of
optimization in spark are related to dataset.
I started a large refactoring of the spark runner to leverage Spark 2.x
It's not yet ready as it includes other improvements (the portability
layer with Job API, a first check of state API, ...).
Anyway, by Spark wordcount, you mean the one included in the spark
On 18/09/2018 08:39, devinduan(段丁瑞) wrote:
> I'm testing Beam on Spark.
> I use spark example code WordCount processing 1G data file, cost 1
> However, I use Beam example code WordCount processing the same file,
> cost 30minutes.
> My Spark parameter is : --deploy-mode client --executor-memory 1g
> --num-executors 1 --driver-memory 1g
> My Spark version is 2.3.1, Beam version is 2.5
> Is there any optimization method?
> Thank you.
Talend - http://www.talend.com