osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS] Task speculative execution for Flink batch


Hi,

+1 for the speculative execution.

It will be more great if it can work well with exisitng checkpointing and
pipelined execution. That way, we can move a further step towards the
unification of batch and stream processing.

Regards,
Xiaogang

Jeff Zhang <zjffdu@xxxxxxxxx> 于2018年11月7日周三 上午9:40写道:

> +1 for the speculative execution for Flink batch, Speculative execution is
> used in lots of batch execution engine like mr, tez and spark. This would
> be a great improvement for Flink in batch scenario.
>
> Jin Sun <isunjin@xxxxxxxxx>于2018年11月7日周三 上午8:38写道:
>
> > I think this is target for batch at the very beginning, the idea should
> be
> > also work for both case, with different algorithm/strategy.
> >
> > Ryan, since you are working on this, I will assign FLINK-10644 <
> > https://issues.apache.org/jira/browse/FLINK-10644> to you.
> >
> > Jin
> >
> > > On Nov 6, 2018, at 4:45 AM, Till Rohrmann <trohrmann@xxxxxxxxxx>
> wrote:
> > >
> > > Thanks for starting this discussion Ryan. I'm looking forward to your
> > > design document about this feature. Quick question: Will it be a batch
> > only
> > > feature? If no, then it needs to take checkpointing into account as
> well.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Tue, Nov 6, 2018 at 4:29 AM zhijiang <wangzhijiang999@xxxxxxxxxx
> > .invalid>
> > > wrote:
> > >
> > >> Thanks yangyu for launching this discussion.
> > >>
> > >> I really like this proposal. We ever found this scene frequently that
> > some
> > >> long tail tasks to delay the total batch job execution time in
> > production.
> > >> We also have some thoughts for bringing this mechanism. Looking
> forward
> > to
> > >> your detail design doc, then we can discussion further.
> > >>
> > >> Best,
> > >> Zhijiang
> > >> ------------------------------------------------------------------
> > >> 发件人:Tao Yangyu <ryantaocer@xxxxxxxxx>
> > >> 发送时间:2018年11月6日(星期二) 11:01
> > >> 收件人:dev <dev@xxxxxxxxxxxxxxxx>
> > >> 主 题:[DISCUSS] Task speculative execution for Flink batch
> > >>
> > >> Hi everyone,
> > >>
> > >> We propose task speculative execution for Flink batch in this message
> as
> > >> follows.
> > >>
> > >> In the batch mode, the job is usually divided into multiple parallel
> > tasks
> > >> executed cross many nodes in the cluster. It is common to encounter
> the
> > >> performance degradation on some nodes due to hardware problems or
> > accident
> > >> I/O busy and high CPU load. This kind of degradation can probably
> cause
> > the
> > >> running tasks on the node to be quite slow that is so called long tail
> > >> tasks. Although the long tail tasks will not fail, they can severely
> > affect
> > >> the total job running time. Flink task scheduler does not take this
> long
> > >> tail problem into account currently.
> > >>
> > >>
> > >>
> > >> Here we propose the speculative execution strategy to handle the
> > problem.
> > >> The basic idea is to run a copy of task on another node when the
> > original
> > >> task is identified to be long tail. In more details, the speculative
> > task
> > >> will be triggered when the scheduler detects that the data processing
> > >> throughput of a task is much slower than others. The speculative task
> is
> > >> executed in parallel with the original one and share the same failure
> > retry
> > >> mechanism. Once either task complete, the scheduler admits its output
> as
> > >> the final result and cancel the other running one. The preliminary
> > >> experiments has demonstrated the effectiveness.
> > >>
> > >>
> > >> The detailed design doc will be ready soon.  Your reviews and comments
> > will
> > >> be much appreciated.
> > >>
> > >>
> > >> Thanks!
> > >>
> > >> Ryan
> > >>
> > >>
> >
> >
>