OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Having a backoff while experiencing checkpointing failures


Hello all,

Are there any recommendations on using a backoff when experiencing checkpointing failures?
What we have seen is when a checkpoint starts to expire, the next checkpoint dosent care about the previous failure, and starts soon after. We experimented with min_pause_between_checkpoints, however that seems only to work for successful checkpoints( the same is discussed on this thread)

Are there any recommendations on how to have a backoff or is there something in works to add a backoff incase of checkpointing failures? This seems very valuable incase of checkpointing on an external location like s3, where one can be potentially throttled or gets errors like TooBusyException from s3(for example like in this jira)

Please let us know!
Thanks,
Vipul