[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Beam Custom I/O Read Transform

Currently, Python SDK doesn't have a transform for reading XML files. Probably your best bet will be to use Python SDK's file system [1] abstraction to read XML files from a custom ParDo. Also adding a reshuffle transform [2] following this will allow Dataflow to better rebalance steps that come after reading.


On Fri, Nov 9, 2018 at 3:53 PM Sean Schwartz <sean@xxxxxxxxxxx> wrote:

My company, SwiftIQ, uses google dataflow for our large scale data processing pipeline. We currently are using java as our codebase. We are looking at Python, but I'm having trouble trying to see if our dataflow can be supported used Python.

Our first step of the pipeline should be a I/O Read Transform of an XML file. I see that this package exists in Java, however I'm not finding it as a module in Python. 

Is there a Python module that does this? If not is there a way to write our own custom Read Transform that reads a XML file into a PCollection?

A quick response would be greatly appreciated.


Sean Schwartz

Sean Schwartz
Data Engineer
Cell: 847.772.0240