[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Should Beam Python throw an error if DoFn returns a string?

On Fri, Aug 17, 2018 at 8:16 PM Pablo Estrada <pabloem@xxxxxxxxxx> wrote:
> Beam Python expects DoFns to return an iterable that contains the actual output elements. This is documented, and visible in examples, but it is also a bit counter-intuitive.
> We should definitely add a check in _OutputProcessor[1] to throw a more expressive error if it receives a non-iterable.

Yes. We want to be careful not to slow down the "just works" case though.

> Should we also let Beam error out if users return a string?
> e.g. consider the following pipeline:
> p | Create(['abc']) | ParDo(lambda x: x) | WriteToFile('myfile')
> This pipeline would write three separate elements. Is this not a bit awkward?

One can't do ParDo with lambdas (I think, that was the intent). Only
with Map or FlatMap which forces you to think about this more. It's
still an issue for DoFn.process methods of course.

> Erroring out when a string is returned would be the least surprising solution for users, as opposed to having their strings getting broken down into a bunch of single-char elements.
> A con is that there may be users already relying on this functionality, so that might be a breaking change. But I think it's still worth discussing.

I'm actually fine with this being an error. is It is when (slow) type
checks are enabled. If we do this by default, we should make sure the
check is fast. Note that this doesn't catch, say, returning tuples
like (key, value) that were intended to be elements not iterables of
elements. More typing would be very nice here.

- Robert