OSDir

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Improving Airflow SLAs


Hi all,

Since the response so far has been positive or neutral, I intend to submit
one or more PRs targeting 2.0 (I think that some parts will be separable
from a larger SLA refactor). I intend to address at least the following
JIRA issues:

https://issues.apache.org/jira/browse/AIRFLOW-2236
https://issues.apache.org/jira/browse/AIRFLOW-1472
https://issues.apache.org/jira/browse/AIRFLOW-1360
https://issues.apache.org/jira/browse/AIRFLOW-557
https://issues.apache.org/jira/browse/AIRFLOW-133

Regards,

-James M.



On Thu, May 3, 2018 at 12:13 PM, Maxime Beauchemin <
maximebeauchemin@xxxxxxxxx> wrote:

> About de-coupling the SLA management process, I've had conversations in the
> direction of renaming the scheduler to "supervisor" to reflect the fact
> that it's not just scheduling processes, it does a lot more tasks than just
> that, SLA management being one of them.
>
> I still think the default should be to require a single supervisor that
> would do all the "supervision" work though. I'm generally against requiring
> more types of nodes on the cluster. But perhaps the supervisor could have
> switches to be started in modes where it would only do a subset of its
> tasks, so that people can run multiple specialized supervisor nodes if they
> want to.
>
> For the record, I was thinking that renaming the scheduler to supervisor
> would likely happen as we re-write it to enable multiple concurrent
> supervisor processes. It turns out that parallelizing the scheduler hasn't
> been as critical as I thought it would be originally, especially with the
> current multi-process scheduler. Sounds like the community is getting a lot
> of mileage out of this current multi-process scheduler.
>
> Max
>
> On Thu, May 3, 2018 at 7:31 AM, Jiening Wen <jieningwen@xxxxxxxxxxx>
> wrote:
>
> > I would love to see this proposal gets implemented in airflow.
> > In our case duration based SLA makes much more sense and I ended up
> adding
> > a decorator to the execute method in our custom operators.
> >
> > Best regards,
> > Jiening
> >
> > -----Original Message-----
> > From: James Meickle [mailto:jmeickle@xxxxxxxxxxxxxx]
> > Sent: Wednesday 02 May 2018 9:00 PM
> > To: dev@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
> > Subject: Improving Airflow SLAs [External]
> >
> > At Quantopian we use Airflow to produce artifacts based on the previous
> > day's stock market data. These artifacts are required for us to trade on
> > today's stock market. Therefore, I've been investing time in improving
> > Airflow notifications (such as writing PagerDuty and Slack integrations).
> > My attention has turned to Airflow's SLA system, which has some drawbacks
> > for our use case:
> >
> > 1) Airflow SLAs are not skip-aware, so a task that has an SLA but is
> > skipped for this execution date will still trigger emails/callbacks. This
> > is a huge problem for us because we run almost no tasks on weekends
> (since
> > the stock market isn't open).
> >
> > 2) Defining SLAs can be awkward because they are relative to the
> execution
> > date instead of the task start time. There's no way to alert if a task
> runs
> > for "more than an hour", for any non-trivial DAG. Instead you can only
> > express "more than an hour from execution date".  The financial data we
> use
> > varies in when it arrives, and how long it takes to process (data volume
> > changes frequently); we also have tight timelines that make retries
> > difficult, so we want to alert an operator while leaving the task
> running,
> > rather than failing and then alerting.
> >
> > 3) SLA miss emails don't have a subject line containing the instance URL
> > (important for us because we run the same DAGs in both
> staging/production)
> > or the execution date they apply to. When opened, they can get hard to
> read
> > for even a moderately sized DAG because they include a flat list of task
> > instances that are unsorted (neither alpha nor topo). They are also
> lacking
> > any links back to the Airflow instance.
> >
> > 4) SLA emails are not callbacks, and can't be turned off (other than
> either
> > removing the SLA or removing the email attribute on the task instance).
> The
> > way that SLA miss callbacks are defined is not intuitive, as in contrast
> to
> > all other callbacks, they are DAG-level rather than task-level. Also, the
> > call signature is poorly defined: for instance, two of the arguments are
> > just strings produced from the other two arguments.
> >
> > I have some thoughts about ways to fix these issues:
> >
> > 1) I just consider this one a bug. If a task instance is skipped, that
> was
> > intentional, and it should not trigger any alerts.
> >
> > 2) I think that the `sla=` parameter should be split into something like
> > this:
> >
> > `expected_start`: Timedelta after execution date, representing when this
> > task must have started by.
> > `expected_finish`: Timedelta after execution date, representing when this
> > task must have finished by.
> > `expected_duration`: Timedelta after task start, representing how long it
> > is expected to run including all retries.
> >
> > This would give better operator control over SLAs, particularly for tasks
> > deeper in larger DAGs where exact ordering may be hard to predict.
> >
> > 3) The emails should be improved to be more operator-friendly, and take
> > into account that someone may get a callback for a DAG they don't know
> very
> > well, or be paged by this notification.
> >
> > 4.1) All Airflow callbacks should support a list, rather than requiring a
> > single function. (I've written a wrapper that does this, but it would be
> > better for Airflow to just handle this itself.)
> >
> > 4.2) SLA miss callbacks should be task callbacks that receive context,
> like
> > all the other callbacks. Having a DAG figure out which tasks have missed
> > SLAs collectively is fine, but getting SLA failures in a batched callback
> > doesn't really make much sense. Per-task callbacks can be fired
> > individually within a batch of failures detected at the same time.
> >
> > 4.3) SLA emails should be the default SLA miss callback function, rather
> > than being hardcoded.
> >
> > Also, overall, the SLA miss logic is very complicated. It's stuffed into
> > one overloaded function that is responsible for checking for SLA misses,
> > creating database objects for them, filtering tasks, selecting emails,
> > rendering, and sending. Refactoring it would be a good maintainability
> win.
> >
> > I am already implementing some of the above in a private branch, but I'd
> be
> > curious to hear community feedback as to which of these suggestions might
> > be desirable upstream. I could have this ready for Airflow 2.0 if there
> is
> > interest beyond my own use case.
> >
>