Questions about Savepoints
I have the following questions regarding savepoint recovery.
- In my job, it takes over 30 minutes to take a savepoint of over 100GB
on 3 TMs. Most time spent after the alignment. I assume it was
serialization and uploading to S3. However, when I resume a new job
from the savepoint, it only takes seconds to recover. It seems too
fast to me. I've tried resuming from the savepoint with a different
parallelism. It was also very fast. Is this expected?
- Is there any log messages on the JM or the TMs indicating when a job
or operator restored state from a savepoint? It'll be very helpful to
know if state is restored especially when the
"--allowNonRestoredState" flag is set.
- If a checkpoint was successfully taken after a savepoint, will
resuming a job from the savepoint try to leverage the checkpoint?
- The job uses Kafka as the source, when I resume it from savepoint,
when will the job start consuming from Kafka again? Does it wait until
all operators have finished restoring state or does it start as soon
as the source operator finishes restoring? I assume it waits for all
because that's the only way to guarantee transactionality.
- When cancelling a job with a savepoint, is there anyway to prevent the
job from cancelling if the savepoint fails? Otherwise, it sounds too
dangerous to use this operation.