osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS] Task speculative execution for Flink batch


Hi all,

After refined, the detailed design doc is here:
https://docs.google.com/document/d/1X_Pfo4WcO-TEZmmVTTYNn44LQg5gnFeeaeqM7ZNLQ7M/edit?usp=sharing

Your kind reviews and comments are very appreciated and will help so much
the feature to be completed.

Best,
Ryan


Tao Yangyu <ryantaocer@xxxxxxxxx> 于2018年11月7日周三 下午4:49写道:

> Thanks so much for your all feedbacks!
>
> Yes, as mentioned above by Jin Sun, the design currently targets batch to
> explore the general framework and basic modules. The strategy could be also
> applied to stream with some extended code, for example, the result
> commitment.
>
> Jin Sun <isunjin@xxxxxxxxx> 于2018年11月7日周三 上午8:38写道:
>
>> I think this is target for batch at the very beginning, the idea should
>> be also work for both case, with different algorithm/strategy.
>>
>> Ryan, since you are working on this, I will assign FLINK-10644 <
>> https://issues.apache.org/jira/browse/FLINK-10644> to you.
>>
>> Jin
>>
>> > On Nov 6, 2018, at 4:45 AM, Till Rohrmann <trohrmann@xxxxxxxxxx> wrote:
>> >
>> > Thanks for starting this discussion Ryan. I'm looking forward to your
>> > design document about this feature. Quick question: Will it be a batch
>> only
>> > feature? If no, then it needs to take checkpointing into account as
>> well.
>> >
>> > Cheers,
>> > Till
>> >
>> > On Tue, Nov 6, 2018 at 4:29 AM zhijiang <wangzhijiang999@xxxxxxxxxx
>> .invalid>
>> > wrote:
>> >
>> >> Thanks yangyu for launching this discussion.
>> >>
>> >> I really like this proposal. We ever found this scene frequently that
>> some
>> >> long tail tasks to delay the total batch job execution time in
>> production.
>> >> We also have some thoughts for bringing this mechanism. Looking
>> forward to
>> >> your detail design doc, then we can discussion further.
>> >>
>> >> Best,
>> >> Zhijiang
>> >> ------------------------------------------------------------------
>> >> 发件人:Tao Yangyu <ryantaocer@xxxxxxxxx>
>> >> 发送时间:2018年11月6日(星期二) 11:01
>> >> 收件人:dev <dev@xxxxxxxxxxxxxxxx>
>> >> 主 题:[DISCUSS] Task speculative execution for Flink batch
>> >>
>> >> Hi everyone,
>> >>
>> >> We propose task speculative execution for Flink batch in this message
>> as
>> >> follows.
>> >>
>> >> In the batch mode, the job is usually divided into multiple parallel
>> tasks
>> >> executed cross many nodes in the cluster. It is common to encounter the
>> >> performance degradation on some nodes due to hardware problems or
>> accident
>> >> I/O busy and high CPU load. This kind of degradation can probably
>> cause the
>> >> running tasks on the node to be quite slow that is so called long tail
>> >> tasks. Although the long tail tasks will not fail, they can severely
>> affect
>> >> the total job running time. Flink task scheduler does not take this
>> long
>> >> tail problem into account currently.
>> >>
>> >>
>> >>
>> >> Here we propose the speculative execution strategy to handle the
>> problem.
>> >> The basic idea is to run a copy of task on another node when the
>> original
>> >> task is identified to be long tail. In more details, the speculative
>> task
>> >> will be triggered when the scheduler detects that the data processing
>> >> throughput of a task is much slower than others. The speculative task
>> is
>> >> executed in parallel with the original one and share the same failure
>> retry
>> >> mechanism. Once either task complete, the scheduler admits its output
>> as
>> >> the final result and cancel the other running one. The preliminary
>> >> experiments has demonstrated the effectiveness.
>> >>
>> >>
>> >> The detailed design doc will be ready soon.  Your reviews and comments
>> will
>> >> be much appreciated.
>> >>
>> >>
>> >> Thanks!
>> >>
>> >> Ryan
>> >>
>> >>
>>
>>