Thanks for the great write up.
If I understood you correctly, there are two different issues that are caused by the disabled checkpointing.
1) Recovery from a failure without restarting all operators to preserve the state in the running tasks
2) Planned restarts an application without losing all state (even with disabled checkpointing).
Ad 1) The community is constantly working on reducing the time for checkpointing and recovery.
For 1.5, local task recovery was added, which basically stores a state copy on the local disk which is read in case of a recovery. So, tasks are restarted but don't read the to restore state from distributed storage but from the local disk.
AFAIK, this can only be used together with remote checkpoints. I think this might be an interesting option for you if it would be possible to write checkpoints only to local disk and not remote storage. AFAIK, there are also other efforts to reduce the number of restarted tasks in case of a failure. I guess, you've played with other features such as RocksDBStateBackend, incremental and async checkpoints already.
Ad 2) It sounds as if savepoints are exactly the feature your are looking for. It would be good to know what exactly did not work for you. The MemoryStateBackend is not suitable for large state sizes because it backups into the heap memory of the JobManager.