Making Airflow Fault-Tolerant when running Airflow on Kubernetes
We currently run Airflow as a Deployment in a kubernetes cluster. We also
use a variant of KubernetesOperator to run our DAGs.
We are investigating how to best make Airflow fault-tolerant, in part, due
to investigating the use of preemptible vms . *Has there been much
discussion about about how to deploy Airflow in a fault-tolerant way? Are
there any best practices? Ideally we'd like our kubernetes-hosted Airflow
to support rolling updates for Docker image updates and also recover from
components (worker, scheduler, web) going down temporarily, including when
DAGs are in flight. *
Any advice, ideas and/or feedback appreciated!