osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Basic modeling question


If you want to run (daily, rolling weekly, rolling monthly) backups on a
daily basis, and they're mostly the same but have some additional
dependencies, you can write a DAG factory method, which you call three
times. Certain nodes only get added to the longer-than-daily backups.

On Wed, Aug 8, 2018 at 2:03 PM Gabriel Silk <gsilk@xxxxxxxxxxx.invalid>
wrote:

> Thanks Andy and Taylor for the suggestions --
>
> I see how that would work for the case where you want a weekly rollup that
> runs on a weekly cadence.
>
> But what about a rolling weekly or monthly rollup that runs each day?
>
> On Wed, Aug 8, 2018 at 11:00 AM, Andy Cooper <andy.cooper@xxxxxxxxxxxxx>
> wrote:
>
> > To expand on Taylor's idea
> >
> > I recently wrote a ScheduleBlackoutSensor that would allow you to
> prevent a
> > task from running if it meets the criteria provided. It accepts an array
> of
> > args for any number of the criteria so you could leverage this sensor to
> > provide "blackout" runs for a range of days of the week.
> >
> > https://github.com/apache/incubator-airflow/pull/3702/files
> >
> > For example,
> >
> > task = ScheduleBlackoutSensor(day_of_week=[0,1,2,3,4,5], dag=dag)
> >
> > Would prevent a task from running Monday - Saturday, allowing it to run
> on
> > Sunday.
> >
> > You could leverage this Sensor as you would any other sensor or you could
> > invert the logic so that you would only need to specify
> >
> > task = ScheduleBlackoutSensor(day_of_week=6, dag=dag)
> >
> > To "whitelist" a task to run on Sundays.
> >
> >
> > Let me know if you have any questions
> >
> > On Wed, Aug 8, 2018 at 1:47 PM Taylor Edmiston <tedmiston@xxxxxxxxx>
> > wrote:
> >
> > > Gabriel -
> > >
> > > One approach I've seen for a similar use case is to have multiple
> related
> > > rollups in one DAG that runs daily, then have the non-daily tasks skip
> > most
> > > of the time (e.g., weekly only actually executes on Sundays and is
> > > parameterized to look at the last 7 days).
> > >
> > > You could implement that not running part a few ways, but one idea is a
> > > sensor in front of the weekly rollup task.  Imagine a SundaySensor like
> > > return
> > > execution_date.weekday() == 6.  One thing to keep in mind here is
> > > dependence on the DAG's cron schedule being more granular than the
> tasks.
> > >
> > > I think this could generalize into a DayOfWeekSensor / DayOfMonthSensor
> > > that would be nice to have.
> > >
> > > Of course this does mean some scheduler inefficiency on the skip days,
> > but
> > > as long as those skips are fast and the overall number of tasks is
> > small, I
> > > can accept that.
> > >
> > > *Taylor Edmiston*
> > > Blog <https://blog.tedmiston.com/> | CV
> > > <https://stackoverflow.com/cv/taylor> | LinkedIn
> > > <https://www.linkedin.com/in/tedmiston/> | AngelList
> > > <https://angel.co/taylor> | Stack Overflow
> > > <https://stackoverflow.com/users/149428/taylor-edmiston>
> > >
> > >
> > > On Wed, Aug 8, 2018 at 1:11 PM, Gabriel Silk <gsilk@xxxxxxxxxxx.invalid
> >
> > > wrote:
> > >
> > > > Hello Airflow community,
> > > >
> > > > I have a basic question about how best to model a common data
> pipeline
> > > > pattern here at Dropbox.
> > > >
> > > > At Dropbox, all of our logs are ingested and written into Hive in
> > hourly
> > > > and/or daily rollups. On top of this data we build many weekly and
> > > monthly
> > > > rollups, which typically run on a daily cadence and compute results
> > over
> > > a
> > > > rolling window.
> > > >
> > > > If we have a metric X, it seems natural to put the daily, weekly, and
> > > > monthly rollups for metric X all in the same DAG.
> > > >
> > > > However, the different rollups have different dependency structures.
> > The
> > > > daily job only depends on a single day partition, whereas the weekly
> > job
> > > > depends on 7, the monthly on 28.
> > > >
> > > > In Airflow, it seems the two paradigms for modeling dependencies are:
> > > > 1) Depend on a *single run of a task* within the same DAG
> > > > 2) Depend on *multiple runs of task* by using an ExternalTaskSensor
> > > >
> > > > I'm not sure how I could possibly model this scenario using approach
> > #1,
> > > > and I'm not sure approach #2 is the most elegant or performant way to
> > > model
> > > > this scenario.
> > > >
> > > > Any thoughts or suggestions?
> > > >
> > >
> >
>