[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Should Beam Python throw an error if DoFn returns a string?

Beam Python expects DoFns to return an iterable that contains the actual output elements. This is documented, and visible in examples, but it is also a bit counter-intuitive.

We should definitely add a check in _OutputProcessor[1] to throw a more expressive error if it receives a non-iterable.

Should we also let Beam error out if users return a string?
e.g. consider the following pipeline:
p | Create(['abc']) | ParDo(lambda x: x) | WriteToFile('myfile')

This pipeline would write three separate elements. Is this not a bit awkward?

Erroring out when a string is returned would be the least surprising solution for users, as opposed to having their strings getting broken down into a bunch of single-char elements.

A con is that there may be users already relying on this functionality, so that might be a breaking change. But I think it's still worth discussing.


[1] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/common.py#L659
Got feedback? go/pabloem-feedback