osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Basic modeling question


Thanks Andy and Taylor for the suggestions --

I see how that would work for the case where you want a weekly rollup that
runs on a weekly cadence.

But what about a rolling weekly or monthly rollup that runs each day?

On Wed, Aug 8, 2018 at 11:00 AM, Andy Cooper <andy.cooper@xxxxxxxxxxxxx>
wrote:

> To expand on Taylor's idea
>
> I recently wrote a ScheduleBlackoutSensor that would allow you to prevent a
> task from running if it meets the criteria provided. It accepts an array of
> args for any number of the criteria so you could leverage this sensor to
> provide "blackout" runs for a range of days of the week.
>
> https://github.com/apache/incubator-airflow/pull/3702/files
>
> For example,
>
> task = ScheduleBlackoutSensor(day_of_week=[0,1,2,3,4,5], dag=dag)
>
> Would prevent a task from running Monday - Saturday, allowing it to run on
> Sunday.
>
> You could leverage this Sensor as you would any other sensor or you could
> invert the logic so that you would only need to specify
>
> task = ScheduleBlackoutSensor(day_of_week=6, dag=dag)
>
> To "whitelist" a task to run on Sundays.
>
>
> Let me know if you have any questions
>
> On Wed, Aug 8, 2018 at 1:47 PM Taylor Edmiston <tedmiston@xxxxxxxxx>
> wrote:
>
> > Gabriel -
> >
> > One approach I've seen for a similar use case is to have multiple related
> > rollups in one DAG that runs daily, then have the non-daily tasks skip
> most
> > of the time (e.g., weekly only actually executes on Sundays and is
> > parameterized to look at the last 7 days).
> >
> > You could implement that not running part a few ways, but one idea is a
> > sensor in front of the weekly rollup task.  Imagine a SundaySensor like
> > return
> > execution_date.weekday() == 6.  One thing to keep in mind here is
> > dependence on the DAG's cron schedule being more granular than the tasks.
> >
> > I think this could generalize into a DayOfWeekSensor / DayOfMonthSensor
> > that would be nice to have.
> >
> > Of course this does mean some scheduler inefficiency on the skip days,
> but
> > as long as those skips are fast and the overall number of tasks is
> small, I
> > can accept that.
> >
> > *Taylor Edmiston*
> > Blog <https://blog.tedmiston.com/> | CV
> > <https://stackoverflow.com/cv/taylor> | LinkedIn
> > <https://www.linkedin.com/in/tedmiston/> | AngelList
> > <https://angel.co/taylor> | Stack Overflow
> > <https://stackoverflow.com/users/149428/taylor-edmiston>
> >
> >
> > On Wed, Aug 8, 2018 at 1:11 PM, Gabriel Silk <gsilk@xxxxxxxxxxx.invalid>
> > wrote:
> >
> > > Hello Airflow community,
> > >
> > > I have a basic question about how best to model a common data pipeline
> > > pattern here at Dropbox.
> > >
> > > At Dropbox, all of our logs are ingested and written into Hive in
> hourly
> > > and/or daily rollups. On top of this data we build many weekly and
> > monthly
> > > rollups, which typically run on a daily cadence and compute results
> over
> > a
> > > rolling window.
> > >
> > > If we have a metric X, it seems natural to put the daily, weekly, and
> > > monthly rollups for metric X all in the same DAG.
> > >
> > > However, the different rollups have different dependency structures.
> The
> > > daily job only depends on a single day partition, whereas the weekly
> job
> > > depends on 7, the monthly on 28.
> > >
> > > In Airflow, it seems the two paradigms for modeling dependencies are:
> > > 1) Depend on a *single run of a task* within the same DAG
> > > 2) Depend on *multiple runs of task* by using an ExternalTaskSensor
> > >
> > > I'm not sure how I could possibly model this scenario using approach
> #1,
> > > and I'm not sure approach #2 is the most elegant or performant way to
> > model
> > > this scenario.
> > >
> > > Any thoughts or suggestions?
> > >
> >
>