[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Basic modeling question

Hello Airflow community,

I have a basic question about how best to model a common data pipeline
pattern here at Dropbox.

At Dropbox, all of our logs are ingested and written into Hive in hourly
and/or daily rollups. On top of this data we build many weekly and monthly
rollups, which typically run on a daily cadence and compute results over a
rolling window.

If we have a metric X, it seems natural to put the daily, weekly, and
monthly rollups for metric X all in the same DAG.

However, the different rollups have different dependency structures. The
daily job only depends on a single day partition, whereas the weekly job
depends on 7, the monthly on 28.

In Airflow, it seems the two paradigms for modeling dependencies are:
1) Depend on a *single run of a task* within the same DAG
2) Depend on *multiple runs of task* by using an ExternalTaskSensor

I'm not sure how I could possibly model this scenario using approach #1,
and I'm not sure approach #2 is the most elegant or performant way to model
this scenario.

Any thoughts or suggestions?