[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SplittableDoFn

Thanks for taking this up. I added some comments to the doc. A European-friendly time for discussion would be great. 

On Fri, Aug 31, 2018 at 3:14 AM Lukasz Cwik <lcwik@xxxxxxxxxx> wrote:
I came up with a proposal[1] for a progress model solely based off of the backlog and that splits should be based upon the remaining backlog we want the SDK to split at. I also give recommendations to runner authors as to how an autoscaling system could work based upon the measured backlog. A lot of discussions around progress reporting and splitting in the past has always been around finding an optimal solution, after reading a lot of information about work stealing, I don't believe there is a general solution and it really is upto SplittableDoFns to be well behaved. I did not do much work in classifying what a well behaved SplittableDoFn is though. Much of this work builds off ideas that Eugene had documented in the past[2].

I could use the communities wide knowledge of different I/Os to see if computing the backlog is practical in the way that I'm suggesting and to gather people's feedback.

If there is a lot of interest, I would like to hold a community video conference between Sept 10th and 14th about this topic. Please reply with your availability by Sept 6th if your interested.

On Mon, Aug 13, 2018 at 10:21 AM Jean-Baptiste Onofré <jb@xxxxxxxxxxxx> wrote:
Awesome !

Thanks Luke !

I plan to work with you and others on this one.

Le 13 août 2018, à 19:14, Lukasz Cwik <lcwik@xxxxxxxxxx> a écrit:
I wanted to reach out that I will be continuing from where Eugene left off with SplittableDoFn. I know that many of you have done a bunch of work with IOs and/or runner integration for SplittableDoFn and would appreciate your help in advancing this awesome idea. If you have questions or things you want to get reviewed related to SplittableDoFn, feel free to send them my way or include me on anything SplittableDoFn related.

I was part of several discussions with Eugene and I think the biggest outstanding design portion is to figure out how dynamic work rebalancing would play out with the portability APIs. This includes reporting of progress from within a bundle. I know that Eugene had shared some documents in this regard but the position / split models didn't work too cleanly in a unified sense for bounded and unbounded SplittableDoFns. It will likely take me awhile to gather my thoughts but could use your expertise as to how compatible these ideas are with respect to to IOs and runners Flink/Spark/Dataflow/Samza/Apex/... and obviously help during implementation.