Basic modeling question
Hello Airflow community,
I have a basic question about how best to model a common data pipeline
pattern here at Dropbox.
At Dropbox, all of our logs are ingested and written into Hive in hourly
and/or daily rollups. On top of this data we build many weekly and monthly
rollups, which typically run on a daily cadence and compute results over a
If we have a metric X, it seems natural to put the daily, weekly, and
monthly rollups for metric X all in the same DAG.
However, the different rollups have different dependency structures. The
daily job only depends on a single day partition, whereas the weekly job
depends on 7, the monthly on 28.
In Airflow, it seems the two paradigms for modeling dependencies are:
1) Depend on a *single run of a task* within the same DAG
2) Depend on *multiple runs of task* by using an ExternalTaskSensor
I'm not sure how I could possibly model this scenario using approach #1,
and I'm not sure approach #2 is the most elegant or performant way to model
Any thoughts or suggestions?