[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SplittableDoFn

Here is the link to join the discussion: https://meet.google.com/idc-japs-hwf
Remember that it is this Friday Sept 14th from 11am-noon PST.

On Mon, Sep 10, 2018 at 7:30 AM Maximilian Michels <mxm@xxxxxxxxxx> wrote:
Thanks for moving forward with this, Lukasz!

Unfortunately, can't make it on Friday but I'll sync with somebody on
the call (e.g. Ryan) about your discussion.

On 08.09.18 02:00, Lukasz Cwik wrote:
> Thanks for everyone who wanted to fill out the doodle poll. The most
> popular time was Friday Sept 14th from 11am-noon PST. I'll send out a
> calendar invite and meeting link early next week.
> I have received a lot of feedback on the document and have addressed
> some parts of it including:
> * clarifying terminology
> * processing skew due to some restrictions having their watermarks much
> further behind then others affecting scheduling of bundles by runners
> * external throttling & I/O wait overhead reporting to make sure we
> don't overscale
> Areas that still need additional feedback and details are:
> * reporting progress around the work that is done and is active
> * more examples
> * unbounded restrictions being caused by an unbounded number of splits
> of existing unbounded restrictions (infinite work growth)
> * whether we should be reporting this information at the PTransform
> level or at the bundle level
> On Wed, Sep 5, 2018 at 1:53 PM Lukasz Cwik <lcwik@xxxxxxxxxx
> <mailto:lcwik@xxxxxxxxxx>> wrote:
>     Thanks to all those who have provided interest in this topic by the
>     questions they have asked on the doc already and for those
>     interested in having this discussion. I have setup this doodle to
>     allow people to provide their availability:
>     https://doodle.com/poll/nrw7w84255xnfwqy
>     I'll send out the chosen time based upon peoples availability and a
>     Hangout link by end of day Friday so please mark your availability
>     using the link above.
>     The agenda of the meeting will be as follows:
>     * Overview of the proposal
>     * Enumerate and discuss/answer questions brought up in the meeting
>     Note that all questions and any discussions/answers provided will be
>     added to the doc for those who are unable to attend.
>     On Fri, Aug 31, 2018 at 9:47 AM Jean-Baptiste Onofré
>     <jb@xxxxxxxxxxxx <mailto:jb@xxxxxxxxxxxx>> wrote:
>         +1
>         Regards
>         JB
>         Le 31 août 2018, à 18:22, Lukasz Cwik <lcwik@xxxxxxxxxx
>         <mailto:lcwik@xxxxxxxxxx>> a écrit:
>             That is possible, I'll take people's date/time suggestions
>             and create a simple online poll with them.
>             On Fri, Aug 31, 2018 at 2:22 AM Robert Bradshaw
>             <robertwb@xxxxxxxxxx <mailto:robertwb@xxxxxxxxxx>> wrote:
>                 Thanks for taking this up. I added some comments to the
>                 doc. A European-friendly time for discussion would
>                 be great.
>                 On Fri, Aug 31, 2018 at 3:14 AM Lukasz Cwik
>                 <lcwik@xxxxxxxxxx <mailto:lcwik@xxxxxxxxxx>> wrote:
>                     I came up with a proposal[1] for a progress model
>                     solely based off of the backlog and that splits
>                     should be based upon the remaining backlog we want
>                     the SDK to split at. I also give recommendations to
>                     runner authors as to how an autoscaling system could
>                     work based upon the measured backlog. A lot of
>                     discussions around progress reporting and splitting
>                     in the past has always been around finding an
>                     optimal solution, after reading a lot of information
>                     about work stealing, I don't believe there is a
>                     general solution and it really is upto
>                     SplittableDoFns to be well behaved. I did not do
>                     much work in classifying what a well behaved
>                     SplittableDoFn is though. Much of this work builds
>                     off ideas that Eugene had documented in the past[2].
>                     I could use the communities wide knowledge of
>                     different I/Os to see if computing the backlog is
>                     practical in the way that I'm suggesting and to
>                     gather people's feedback.
>                     If there is a lot of interest, I would like to hold
>                     a community video conference between Sept 10th and
>                     14th about this topic. Please reply with your
>                     availability by Sept 6th if your interested.
>                     1: https://s.apache.org/beam-bundles-backlog-splitting
>                     2: https://s.apache.org/beam-breaking-fusion
>                     On Mon, Aug 13, 2018 at 10:21 AM Jean-Baptiste
>                     Onofré <jb@xxxxxxxxxxxx <mailto:jb@xxxxxxxxxxxx>> wrote:
>                         Awesome !
>                         Thanks Luke !
>                         I plan to work with you and others on this one.
>                         Regards
>                         JB
>                         Le 13 août 2018, à 19:14, Lukasz Cwik
>                         <lcwik@xxxxxxxxxx <mailto:lcwik@xxxxxxxxxx>> a
>                         écrit:
>                             I wanted to reach out that I will be
>                             continuing from where Eugene left off with
>                             SplittableDoFn. I know that many of you have
>                             done a bunch of work with IOs and/or runner
>                             integration for SplittableDoFn and would
>                             appreciate your help in advancing this
>                             awesome idea. If you have questions or
>                             things you want to get reviewed related to
>                             SplittableDoFn, feel free to send them my
>                             way or include me on anything SplittableDoFn
>                             related.
>                             I was part of several discussions with
>                             Eugene and I think the biggest outstanding
>                             design portion is to figure out how dynamic
>                             work rebalancing would play out with the
>                             portability APIs. This includes reporting of
>                             progress from within a bundle. I know that
>                             Eugene had shared some documents in this
>                             regard but the position / split models
>                             didn't work too cleanly in a unified sense
>                             for bounded and unbounded SplittableDoFns.
>                             It will likely take me awhile to gather my
>                             thoughts but could use your expertise as to
>                             how compatible these ideas are with respect
>                             to to IOs and runners
>                             Flink/Spark/Dataflow/Samza/Apex/... and
>                             obviously help during implementation.