[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Apache Beam Newsletter - September 2018


September 2018 | Newsletter

What’s been done

CI improvement (by: Etienne Chauchot)

  • For each new commit on master Nexmark suite is run in both batch and streaming mode in Spark, Flink, Cloud Dataflow (thanks to Andrew) and dashboards graphs are produced to track functional and performance regressions.

Elasticsearch IO Supports Version 6 (by: Dat Tran)

  • Elasticsearch IO now supports version 6.x in addition to version 2.x and 5.x.

  • See the merged PR for more details.

KuduIO Added (by: Tim Robertson)

  • Apache Beam master now has KuduIO that will be released with Beam 2.7.0.

  • See BEAM-2661 for more details.

What we’re working on...

Flink Portable Runner (by: Ankur Goenka, Maximilian Michels, Thomas Weise, Ryan Williams)

  • Support for streaming side inputs merged                                                           

  • Portable Compatibility Matrix tests pass in streaming mode                                         

  • Many more ValidatesRunner tests pass (ValidatesRunner is a comprehensive suite for Beam test pipelines)

  • Python Pipelines can be tested without bringing up a JobServer first (it is started in a container)

  • Experimental support for executing the SDK harnesses in a process instead of a Docker container    

  • Bug fixes to Beam discovered during working on the portability

State and Timer Support in Python SDK (by: Charles Chen, Robert Bradshaw)

  • This change adds the reference DirectRunner implementation of the Python User State and Timers API. With this change, a user can execute DoFns with state and timers on the DirectRunner.

  • See the design doc and PR for more details..

New IO - HadoopOutputFormatIO (by: Alexey Romanenko)

  • Adding support of MapReduce OutputFormat.

  • See BEAM-5310 for more details.

High-level Java 8 DSL (by: David Moravek, Vaclav Plajt, Marek Simunek)

Performance improvements for HDFS file writing operations (by: Tim Robertson)

  • Autocreate directories when doing an HDFS rename

  • See PR for more details

Recognition of non-code contributions (by: Gris Cuevas)

  • Got consensus about recognizing non-code contributions

  • See discussion for more details

  • Planned launch date: Beam Summit London (October 2nd)

Weekly Community Updates (by: Gris Cuevas)

  • Some of the project’s subcomponents run weekly updates in the mailing list, we’ll be consolidating best practices to share a weekly community update with all project related must knows in a shell

What’s planned

Beam Cookbook (by: Austin Bennett, David Cavazos, Gris Cuevas, Andrea Foegler, Rose Nguyen, Connell O'Callaghan, and you!)

  • We are creating a cookbook for common data science tasks in Beam and have started brainstorming

  • We want to have a hackathon after the London Summit to generate content from the community

  • There will be a session at the summit to gather more ideas and input. Watch the dev and users mailing list for a call for contributions soon!.

Beam 2.7.0 release (by: Charles Chen)

Beam Mascot (by: Gris Cuevas & Community!)

  • We got approval to launch a contest to create a new Apache Beam mascot

  • See discussion for more details, if you’re interested in driving this, reach out in the thread!

  • Planned launch date: Last week of September

New Members

New Contributors

  • Đạt Trần, Ho Chi Minh City, Vietnam

    • See BEAM-5107 for more details on “Support ES-6.x for ElasticsearchIO”

  • Ravi Pathak, Copenhagen, Denmark

    • Using Beam for indexing open data on species at GBIF.org

    • Improving robustness of SolrIO

New Committers

  • Tim Robertson, Copenhagen, Denmark

Events, Talks & Meetups

[Coming Up] Beam Summit @ London, England

  • Organized by: Matthias Baetens, Victor Kotai, Alex Van Boxel & Gris Cuevas

  • The Beam Summit London 2018 will take place on October 1 and 2 in London.

  • If you’re interested in speaking reach out to gris@xxxxxxxxxx

  • More info can be found in the blog post and you can get your tickets on Eventbrite

[Coming Up] ApacheCon @ Montréal, Canada

[Coming Up] DataEngConf @ Barcelona, Spain

[Occurred] OSCON @ Portland, OR, USA (by: Holden Karau & Gris Cuevas)

[Occurred] Open Challenge @ Guadalajara, Mexico (by: OSoM, IBM & Google)

  • Arianne Navarro, Hector Paredes, Pablo Estrada & Gris Cuevas hosted a Hackathon for Apache Beam and BlueXolo, results include 3PR for Beam and 8 Software Engineers introduced to Apache Beam

[Occurred] Open Source Summit @ Vancouver, Canada

  • Gris Cuevas gave a talk on active diversification in Open Source, slides here

  • Ismael Mejia gave a talk on Apache Beam, see details here

[Occurred] Flink Forward @ Berlin, Germany

  • Robert Bradshaw and Maximilian Michels gave talk on Universal Machine Learning with Apache Beam, schedule, slides

  • Aljoscha Krettek and Thomas Weise Python Streaming Pipelines with Beam on Flink, schedule, slides


Setting up a Java Development Env Beam on GCP (by: Jacob Ferriero)

  • This post will help you get a development environment up and running to start developing Java Dataflow jobs. By the end you’ll be able to run an Apache Beam locally in debug mode, execute code in a REPL to speed your development cycles, and submit your job to Google Cloud Dataflow. Medium Post.

Coding Apache Beam in your Web Browser (by: Daniel De Leo)

  • But what happens when you’re on the go on a computer which doesn’t support your IDE of choice, or you’re using someone else’s computer and need to develop Apache Beam pipelines? Google has you covered! Google’s Cloud Shell comes with a built-in Code Editor for developing/modifying code (it’s based on Eclipse’s Orion). It’s not as full featured as an IDE but it does beat using Vim or Emacs to edit code! Medium Post.

Building a real time quant trading engine on Dataflow and Beam (by: Lei He)

  • In this post, we are going to build a data pipeline that analyzes real time stock tick data streamed from gCloud Pub/Sub, runs them through a pair correlation trading algorithm, and outputs trading signals onto Pub/Sub for execution. Medium Post.

Apache Beam: Reading from S3 and writing to BigQuery (by: Asa Harland)

  • In this article we look at how we can use Apache Beam to extract data from AWS S3 (or Google Cloud Storage), run some aggregations over the data and store the result in BigQuery. Medium Post.

Apache Beam Events & Meetups

Until Next Time!

This edition was curated by our community of contributors, committers and PMCs. It contains work done in August 2018 and ongoing efforts. We hope to provide visibility to what's going on in the community, so if you have questions, feel free to ask in this thread. 
Rose Thị Nguyễn