I had a look at the logs from the restoring job and couldn’t find anything suspicious in them. Everything looks as expected and the state files are properly found and transferred from S3. We are including rescaling in some end-to-end tests now and then let’s see what happens.
If you say that you can reproduce the problem, does that mean reproduce from the single existing checkpoint or also creating other problematic checkpoints? I am asking because maybe a log from the job that produces the problematic checkpoint might be more helpful. You can create a ticket if you want.