Making Airflow Fault-Tolerant when running Airflow on Kubernetes

Hi all,

We currently run Airflow as a Deployment in a kubernetes cluster. We also
use a variant of KubernetesOperator to run our DAGs.

We are investigating how to best make Airflow fault-tolerant, in part, due
to investigating the use of preemptible vms [1]. *Has there been much
discussion about about how to deploy Airflow in a fault-tolerant way? Are
there any best practices? Ideally we'd like our kubernetes-hosted Airflow
to support rolling updates for Docker image updates and also recover from
components (worker, scheduler, web) going down temporarily, including when
DAGs are in flight. *

Any advice, ideas and/or feedback appreciated!

[1] https://cloud.google.com/kubernetes-engine/docs/how-to/preemptible-vms