[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PROPOSAL] Additional design for the Beam Python State and Timers API

On Fri, Oct 26, 2018 at 6:47 PM Kenneth Knowles <kenn@xxxxxxxxxx> wrote:
> It all sounds very useful but I have basic concerns about item 1. The doc doesn't really seem to go into the design concerns that I have in mind.
>  - map / flatMap are universal functions with definitions that we don't own and shouldn't violate
>  - corollary: map / flatMap have per element parallelism with no dependencies between them
>  - the doc says "map is just implemented as DoFn.process()" which is implementation, not the spec

That's an interesting perspective. I had always thought of Map/FlatMap
as convenient ways of defining a ParDo from a lambda without the
baggage of having to provide a full DoFn class, similar to MapElements
in Java, but your interpretation is that the names themselves imply no
inter-element interaction (e.g. via state) is allowed.

> So suggestions:
>  - how about just give it a new name making it clear it is not map nor flatMap and does not have per-element parallelism
>  - spec out the functionality without reference to DoFn
>  - be explicit about what determines the maximum parallelism / which elements are required to be processed serially (generally key+window)

These probably don't merit a new (pair of) named operation(s). The
motivation to add them was "why can I use state in a DoFn, but not in
Map/FlatMap?" which could be justified by the above.

As for the side input question, definitely (A).

> On Fri, Oct 26, 2018 at 2:49 AM Robert Bradshaw <robertwb@xxxxxxxxxx> wrote:
>> Thanks. They make sense to me (and would have been handy when I was
>> writing the state tests for the Fn API).
>> On Fri, Oct 26, 2018 at 10:48 AM Charles Chen <ccy@xxxxxxxxxx> wrote:
>> >
>> > Hey there,
>> >
>> > A while back, I shared the Beam Python State and Timers API proposal (https://s.apache.org/beam-python-user-state-and-timers) with this list [1]; we reached consensus on the features proposed there and I implemented the API surface described there, along with the reference DirectRunner implementation and some shared code for executing such pipelines (see for example https://github.com/apache/beam/pull/5691 and https://github.com/apache/beam/pull/6304).
>> >
>> > I would like to propose some additional design considerations to improve upon the previous State / Timer API proposal as described here (on pages 14-18 that I have appended to the doc): https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.10nb33sz7u16
>> >
>> > Briefly, these are the additional design considerations proposed to improve upon the API:
>> >
>> > Allowing access to user state / timers in Map / FlatMap callables / lambdas
>> > Allowing access to the key during the timer callback
>> > Allowing access to auxiliary timer data
>> > Allowing access to side inputs during the timer callback
>> >
>> > I would really appreciate any feedback you may have.  Thanks!
>> >
>> > Best,
>> > Charles
>> >
>> >
>> > [1] Previous dev@ discussion thread: https://lists.apache.org/thread.html/51ba1a00027ad8635bc1d2c0df805ce873995170c75d6a08dfe21997@%3Cdev.beam.apache.org%3E