[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS]Enhancing flink scheduler by implementing blacklist mechanism

Thanks for sharing this design document with the community Yingjie.

I like the design to pass the job specific blacklisted TMs as a scheduling
constraint. This makes a lot of sense to me.


On Fri, Nov 2, 2018 at 4:51 PM yingjie <kevin.yingjie@xxxxxxxxx> wrote:

> Hi everyone,
> This post proposes the blacklist mechanism as an enhancement of flink
> scheduler. The motivation is as follows.
> In our clusters, jobs encounter Hardware and software environment problems
> occasionally, including software library missing,bad hardware,resource
> shortage like out of disk space,these problems will lead to task
> failure,the
> failover strategy will take care of that and redeploy the relevant tasks.
> But because of reasons like location preference and limited total
> resources,the failed task will be scheduled to be deployed on the same
> host,
> then the task will fail again and again, many times. The primary cause of
> this problem is the mismatching of task and resource. Currently, the
> resource allocation algorithm does not take these into consideration.
> We introduce the blacklist mechanism to solve this problem. The basic idea
> is that when a task fails too many times on some resource, the Scheduler
> will not assign the resource to that task. We have implemented this feature
> in our inner version of flink, and currently, it works fine.
> The following is the design draft, we would really appreciate it if you can
> review and comment.
> https://docs.google.com/document/d/1Qfb_QPd7CLcGT-kJjWSCdO8xFeobSCHF0vNcfiO4Bkw
> Best,
> Yingjie
> --
> Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/