[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Default Restart Strategy Not Work With Checkpointing

I’m running a Flink 1.5.0 standalone cluster on which `restart-strategy` was set to `failure-rate`, and the web frontend shows that the JobManager and the TaskManagers are following this configuration, but streaming jobs with checkpointing enabled are still using the fixed delay strategy with no respect to the default restart strategy (no explicit overwrites in the user code). 

I read the source code and found a possible explanation for this (but not very sure): the client generates JobGraph without respect to flink-conf.yaml and sets the restart strategy to fixed delay if the checkpointing is on, and the server side (JobMaster) follows the flink-conf.yaml's default restart strategy configuration, but will gave the one in JobGraph a higher priority, so it’s always overwritten by the fixed delay strategy. 

If I understand correctly, this might be a bug. Is there anything suggestion to avoid it for now?

Best regard,
Paul Lam