[jira] [Created] (FLINK-10288) Failover Strategy improvement
JIN SUN created FLINK-10288:
Summary: Failover Strategy improvement
Issue Type: Improvement
Reporter: JIN SUN
Assignee: JIN SUN
Flink pays significant efforts to make Streaming Job fault tolerant. The checkpoint mechanism and exactly once semantics make Flink different than other systems. However, there are still some cases not been handled very well. Those cases can apply to both Streaming and Batch scenarios, and its orthogonal with current fault tolerant mechanism. Here is a summary of those cases:
# Some failures are non-recoverable, such as a user error: DividebyZeroException. We shouldn't try to restart the task, as it will never succeed. The DivideByZeroException is just a simple case, those errors sometime are not easy to reproduce or predict, as it might be only triggered by specific input data, we shouldn’t retry for all user code exceptions.
# There is no limit for task retry today, unless a SuppressRestartException was encountered, a task will keep on retrying until it succeeds. As mentioned above, we shouldn’t retry for some cases at all, and for the Exceptions we can retry, such as a network exception, should we have a retry limit? We need retry for any transient issue, but we also need to set a limit to avoid infinite retry and resource wasting. For Batch and Streaming workload, we might need different strategies.
# There are some exceptions due to hardware issues, such as disk/network malfunction. when a task/TaskManager fail on this, we’d better detect and avoid to schedule to that machine next time.
# If a task read from a blocking result partition, when its input is not available, we can ‘revoke’ the produce task, set the task fail and rerun the upstream task to regenerate data. the revoke can propagate up through the chain. In Spark, revoke is naturally support by lineage.
To make fault tolerance easier, we need to keep deterministic behavior as much as possible. For user code, it’s not easy to control. However, for system related code, we can fix it. For example, we should at least make sure the different attempt of a same task to have the same inputs (we have a bug in current codebase (DataSourceTask) that cannot guarantee this). Note that this is track by [Flink-10205]
Details see this proposal:
This message was sent by Atlassian JIRA