September 2018 | Newsletter
What’s been done
CI improvement (by: Etienne Chauchot)
For each new commit on master Nexmark suite is run in both batch and streaming mode in Spark, Flink, Cloud Dataflow (thanks to Andrew) and dashboards graphs are produced to track functional and performance regressions.
Elasticsearch IO Supports Version 6 (by: Dat Tran)
Elasticsearch IO now supports version 6.x in addition to version 2.x and 5.x.
See the merged PR for more details.
KuduIO Added (by: Tim Robertson)
Apache Beam master now has KuduIO that will be released with Beam 2.7.0.
See BEAM-2661 for more details.
What we’re working on...
Flink Portable Runner (by: Ankur Goenka, Maximilian Michels, Thomas Weise, Ryan Williams)
Support for streaming side inputs merged
Portable Compatibility Matrix tests pass in streaming mode
Many more ValidatesRunner tests pass (ValidatesRunner is a comprehensive suite for Beam test pipelines)
Python Pipelines can be tested without bringing up a JobServer first (it is started in a container)
Experimental support for executing the SDK harnesses in a process instead of a Docker container
Bug fixes to Beam discovered during working on the portability
State and Timer Support in Python SDK (by: Charles Chen, Robert Bradshaw)
This change adds the reference DirectRunner implementation of the Python User State and Timers API. With this change, a user can execute DoFns with state and timers on the DirectRunner.
New IO - HadoopOutputFormatIO (by: Alexey Romanenko)
Adding support of MapReduce OutputFormat.
See BEAM-5310 for more details.
High-level Java 8 DSL (by: David Moravek, Vaclav Plajt, Marek Simunek)
Performance improvements for HDFS file writing operations (by: Tim Robertson)
Autocreate directories when doing an HDFS rename
See PR for more details
Recognition of non-code contributions (by: Gris Cuevas)
Got consensus about recognizing non-code contributions
See discussion for more details
Planned launch date: Beam Summit London (October 2nd)
Weekly Community Updates (by: Gris Cuevas)
Some of the project’s subcomponents run weekly updates in the mailing list, we’ll be consolidating best practices to share a weekly community update with all project related must knows in a shell
Beam Cookbook (by: Austin Bennett, David Cavazos, Gris Cuevas, Andrea Foegler, Rose Nguyen, Connell O'Callaghan, and you!)
We are creating a cookbook for common data science tasks in Beam and have started brainstorming
We want to have a hackathon after the London Summit to generate content from the community
There will be a session at the summit to gather more ideas and input. Watch the dev and users mailing list for a call for contributions soon!.
Beam 2.7.0 release (by: Charles Chen)
Beam Mascot (by: Gris Cuevas & Community!)
We got approval to launch a contest to create a new Apache Beam mascot
See discussion for more details, if you’re interested in driving this, reach out in the thread!
Planned launch date: Last week of September
Đạt Trần, Ho Chi Minh City, Vietnam
See BEAM-5107 for more details on “Support ES-6.x for ElasticsearchIO”
Ravi Pathak, Copenhagen, Denmark
Using Beam for indexing open data on species at GBIF.org
Improving robustness of SolrIO
Tim Robertson, Copenhagen, Denmark
Events, Talks & Meetups
[Coming Up] Beam Summit @ London, England
Organized by: Matthias Baetens, Victor Kotai, Alex Van Boxel & Gris Cuevas
The Beam Summit London 2018 will take place on October 1 and 2 in London.
If you’re interested in speaking reach out to gris@xxxxxxxxxx
[Coming Up] ApacheCon @ Montréal, Canada
Will take place Sep 24-27
Etienne Chauchot will give a talk on Universal Metrics with Beam
Alexey Romanenko and Ismaël Mejía will give a talk on Building portable and evolvable data-intensive applications with Apache
Ismaël Mejía and Eugene Kirpichov will give a talk on Robust, performant and modular APIs for data ingestion with Apache Beam
Gris Cuevas will host a Birds of a Feather session on 9/26: Design Thinking to manage online communities in Open Source Projects… It’ll be a Beam get together, we’ll have food & Swag, join us!
[Coming Up] DataEngConf @ Barcelona, Spain
Will take place Sep 25-26
Maximilian Michels will give an introduction to Beam and its portability features.
[Occurred] OSCON @ Portland, OR, USA (by: Holden Karau & Gris Cuevas)
Holden Karau gave a talk on TFT/TFMA + Beam on Flink (and other related adventures).
[Occurred] Open Challenge @ Guadalajara, Mexico (by: OSoM, IBM & Google)
Arianne Navarro, Hector Paredes, Pablo Estrada & Gris Cuevas hosted a Hackathon for Apache Beam and BlueXolo, results include 3PR for Beam and 8 Software Engineers introduced to Apache Beam
[Occurred] Open Source Summit @ Vancouver, Canada
Gris Cuevas gave a talk on active diversification in Open Source, slides here
Ismael Mejia gave a talk on Apache Beam, see details here
[Occurred] Flink Forward @ Berlin, Germany
Setting up a Java Development Env Beam on GCP (by: Jacob Ferriero)
This post will help you get a development environment up and running to start developing Java Dataflow jobs. By the end you’ll be able to run an Apache Beam locally in debug mode, execute code in a REPL to speed your development cycles, and submit your job to Google Cloud Dataflow. Medium Post.
Coding Apache Beam in your Web Browser (by: Daniel De Leo)
But what happens when you’re on the go on a computer which doesn’t support your IDE of choice, or you’re using someone else’s computer and need to develop Apache Beam pipelines? Google has you covered! Google’s Cloud Shell comes with a built-in Code Editor for developing/modifying code (it’s based on Eclipse’s Orion). It’s not as full featured as an IDE but it does beat using Vim or Emacs to edit code! Medium Post.
Building a real time quant trading engine on Dataflow and Beam (by: Lei He)
In this post, we are going to build a data pipeline that analyzes real time stock tick data streamed from gCloud Pub/Sub, runs them through a pair correlation trading algorithm, and outputs trading signals onto Pub/Sub for execution. Medium Post.
Apache Beam: Reading from S3 and writing to BigQuery (by: Asa Harland)
In this article we look at how we can use Apache Beam to extract data from AWS S3 (or Google Cloud Storage), run some aggregations over the data and store the result in BigQuery. Medium Post.
Apache Beam Events & Meetups
Join our Slack channel!Until Next Time!This edition was curated by our community of contributors, committers and PMCs. It contains work done in August 2018 and ongoing efforts. We hope to provide visibility to what's going on in the community, so if you have questions, feel free to ask in this thread.--Rose Thị Nguyễn