osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PROPOSAL] Additional design for the Beam Python State and Timers API


My concerns are around item 4 (left the same comments in the doc).

What window should timers be using when looking up a side input?

A) The window corresponding to the element that set the original timer.
B) The window that would have been assigned based upon when the timer is scheduled to fire.
C) The window that would have been assigned from the "output" watermark hold time

A makes the most sense to me since it represents the side input that the original element would have accessed. This allows people to schedule a timer to "wait" for a side input refresh. It also handles the side input push back issue.

I'm not sure if B or C would allow different useful user scenarios that you would not be able to capture otherwise.

How does any of these strategies impact the side input garbage collection?

On Fri, Oct 26, 2018 at 9:47 AM Kenneth Knowles <kenn@xxxxxxxxxx> wrote:
It all sounds very useful but I have basic concerns about item 1. The doc doesn't really seem to go into the design concerns that I have in mind.

 - map / flatMap are universal functions with definitions that we don't own and shouldn't violate
 - corollary: map / flatMap have per element parallelism with no dependencies between them
 - the doc says "map is just implemented as DoFn.process()" which is implementation, not the spec

So suggestions:

 - how about just give it a new name making it clear it is not map nor flatMap and does not have per-element parallelism
 - spec out the functionality without reference to DoFn
 - be explicit about what determines the maximum parallelism / which elements are required to be processed serially (generally key+window)

Kenn

On Fri, Oct 26, 2018 at 2:49 AM Robert Bradshaw <robertwb@xxxxxxxxxx> wrote:
Thanks. They make sense to me (and would have been handy when I was
writing the state tests for the Fn API).
On Fri, Oct 26, 2018 at 10:48 AM Charles Chen <ccy@xxxxxxxxxx> wrote:
>
> Hey there,
>
> A while back, I shared the Beam Python State and Timers API proposal (https://s.apache.org/beam-python-user-state-and-timers) with this list [1]; we reached consensus on the features proposed there and I implemented the API surface described there, along with the reference DirectRunner implementation and some shared code for executing such pipelines (see for example https://github.com/apache/beam/pull/5691 and https://github.com/apache/beam/pull/6304).
>
> I would like to propose some additional design considerations to improve upon the previous State / Timer API proposal as described here (on pages 14-18 that I have appended to the doc): https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.10nb33sz7u16
>
> Briefly, these are the additional design considerations proposed to improve upon the API:
>
> Allowing access to user state / timers in Map / FlatMap callables / lambdas
> Allowing access to the key during the timer callback
> Allowing access to auxiliary timer data
> Allowing access to side inputs during the timer callback
>
> I would really appreciate any feedback you may have.  Thanks!
>
> Best,
> Charles
>
>
> [1] Previous dev@ discussion thread: https://lists.apache.org/thread.html/51ba1a00027ad8635bc1d2c0df805ce873995170c75d6a08dfe21997@%3Cdev.beam.apache.org%3E