Thank you Xiaodong for bringing this up and pardon me for being late on this thread. Sharing the setup within Airbnb and some ideas/progresses, which should benefit people who's interested in this topic.
One-time on 1.8 with cherry-picks, planning to move to containerization after releasing 1.10 internally.
400 worker nodes with celery concurrency on each nodes varies from 2 to 200( depends on the queue it is serving).
We have 9 queues and 2 of them are to serve task with special env dependency( need GPU, need special packages, etc) and other
7 are to serve tasks with different resource consumptions, which leads to worker nodes with different celery concurrency and cgroup sizes.
5 mins. 3 mins is possible( our current scheduling delay stays in 0.5-3 min range) but we wanted some more headroom. (Our SLA is evaluated by a monitoring task that is running
on the cluster with 5m interval. It will compare the current timestamp against the expected scheduling timestamp(execution_date + 5m)
and send the time diff in min as one data point)
*- # of DAGs/Tasks*:
We're maintaining ~9,500 active DAGs produced by ~1,600 DAG files( # of DAG file is actually the biggest scheduling bottleneck now). During peak hours we have ~15,000 task running at the same time.
Xiaodong, you had a very good point about scheduler being the performance bottleneck and we need HA for it. Looking forward for your contribution on the scheduelr HA topic!
About scheduler performance/scaling, I've previously sent a proposal with title "[Proposal] Scale Airflow" to the dev mailing list and currently I have one open PR
improving performance with celery executor querying,
one WIP PR
( fixing the CI, ETA this weekend) improving performance of scheduelr and one PR in my backlog( have an interval PR, need to open source it) improving performance with enqueuing in both scheduler and
celery executor. From our internal stress test result, with all three PRs together, our cluster can handle 4,000 DAG files, 30,000 peak concurrent running tasks(even when they all need to be scheduled at the same time)
within our 5 min SLA( calculated using our real DAG file parsing time, which can be as long as 90 seconds for one DAG file). I think we will have enough headroom after the changes have been merged and thus
some longer term improvements( separate DAG parsing component/service, scheduler sharding, distribute scheduler responsibility to worker, etc) can wait a bit. But of course I'm open of hear other opinions that can
better scale Airflow.
Also I do want to mention that with faster scheduling and DAG parsing, DB might become the bottleneck for performance. With our stress test setup we can handle the DB load with an AWS RDS r3.4xlarge instance( only
with the improvement PRs). And webserver is not scaling very well as it is parsing all DAG files in a single process fashion, which is my planned next item to work on.