[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Routing events by key

Reshuffle for the purpose of ensuring stable inputs is deprecated, but this seems a valid(ish) usecase. 

Currently, all runners implement GroupByKey by sending everything with the same key to the same machine, however this is not guaranteed by the Beam model, and changing it has been tossed around (e.g. when using fixed windows, one could send different key-window pairs to different machines). Even when this is the case, there's no guarantee that there won't be backup workers processing the same key, or even if the runner doesn't do backups, zombie workers (e.g. where we thought the worker died, and allocated its work elsewhere, but it turns out the worker is till churning away possibly causing side effects). Such is the nature of a distributed system. 

If you go the GBK route, rather than windowing by 1 second, you could us an AfterPane.elementCountAtLeast(1) trigger [1] (even in the global window) for absolute minimal latency. https://beam.apache.org/documentation/programming-guide/#data-driven-triggers . This is essentially what Reshuffle does. 

- Robert

On Fri, Jul 6, 2018 at 11:11 PM Jean-Baptiste Onofré <jb@xxxxxxxxxxxx> wrote:
Hi Raghu,

AFAIR, Reshuffle is considered as deprecated. Maybe it would be better
to avoid to use it no ?


On 06/07/2018 18:28, Raghu Angadi wrote:
> I would use Reshuffle()[1] with entity id as the key. It internally does
> a GroupByKey and sets up windowing such that it does not buffer anything.
> [1]
> : https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Reshuffle.java#L51 
> On Fri, Jul 6, 2018 at 8:01 AM Niels Basjes <Niels@xxxxxxxxx
> <mailto:Niels@xxxxxxxxx>> wrote:
>     Hi,
>     I have an unbounded stream of change events each of which has the id
>     of the entity that is changed.
>     To avoid the need for locking in the persistence layer that is
>     needed in part of my processing I want to route all events based on
>     this entity id.
>     That way I know for sure that all events around a single entity go
>     through the same instance of my processing sequentially, hence no
>     need for locking or other synchronization regarding this persistence.
>     At this point my best guess is that I need to use the GroupByKey but
>     that seems to need a Window. 
>     I think I don't want a window because the stream is unbounded and I
>     want the lowest possible latency (i.e. a Window of 1 second would be
>     ok for this usecase).
>     Also I want to be 100% sure that all events for a specific id go to
>     only a single instance because I do not want any race conditions.
>     My simple question is: What does that look like in the Beam Java API?
>     --
>     Best regards / Met vriendelijke groeten,
>     Niels Basjes

Jean-Baptiste Onofré
Talend - http://www.talend.com