[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Dealing with data latency


Yes, exactly. Sensors are ultimately just a few methods on top of a
standard operator:
https://airflow.apache.org/_modules/airflow/operators/sensors.html

The BaseSensorOperator doesn't modify how retries work. You definitely want
a retry in the case of the worker running the sensor dying. But even if you
have a temporary DNS outage, or drop an SSH connection - that might merit
needing a retry too, depending on how the operator was implemented (whether
it performs any retrying itself before causing a task failure).

On Tue, Jun 5, 2018 at 8:12 PM, Pedro Machado <pedro@xxxxxxxxxxxxxx> wrote:

> Hi James,
> I've noticed that some dags fail if the services are restarted while a
> sensor is waiting. Originally I didn't think retries would be relevant for
> a time sensor but it sounds like if the worker crashes, the only way for
> the sensor to rerun is if the retry count hasn't been met. Is this one of
> the points you are making?
> Thanks.
>
> On Tue, Jun 5, 2018 at 9:41 AM James Meickle <jmeickle@xxxxxxxxxxxxxx>
> wrote:
>
> > We have to use a lot of time sensors like this, for reports that
> shouldn't
> > be filed to a third party before a certain time of day. Since these
> sensors
> > are themselves tasks, they can fail to be scheduled or can fail, like if
> > the underlying worker instance dies. I would recommend double checking
> your
> > concurrency settings (esp. since you will have multiple days worth of
> DAGs
> > concurrently running) and your retry settings.
> >
> > On Tue, Jun 5, 2018 at 10:34 AM, Pedro Machado <pedro@xxxxxxxxxxxxxx>
> > wrote:
> >
> > > Thanks, Max!
> > >
> > > On Mon, Jun 4, 2018 at 12:47 PM Maxime Beauchemin <
> > > maximebeauchemin@xxxxxxxxx> wrote:
> > >
> > > > The common standard is to have the execution_date aligned with the
> > > > partition date in the database (say 2018-08-08) and contain data from
> > > > 2018-08-08T00:00:000
> > > > to 2018-08-09T23:59:999.
> > > >
> > > > The partition date and execution_date match and correspond to the
> left
> > > > bound of the time interval processed.
> > > >
> > > > Then you'd use some sensors to make sure this cannot run until the
> > > desired
> > > > time or conditions are met.
> > > >
> > > > Max
> > > >
> > > > On Mon, Jun 4, 2018 at 5:46 AM Pedro Machado <pedro@xxxxxxxxxxxxxx>
> > > wrote:
> > > >
> > > > > Hi. What is the recommended way to deal with data latency? For
> > > example, I
> > > > > have a feed that is not considered final until 72 hours have passed
> > > after
> > > > > the end of the daily period.
> > > > >
> > > > > For example, Monday's data would be ready by Thursday at 23:59.
> > > > >
> > > > > Should I pull data based on the execution date minus a 72 hour
> offset
> > > or
> > > > > use the execution date and somehow delay the data pull for 72
> hours?
> > > > >
> > > > > The latter would be more intuitive (data pull date = execution
> date)
> > > but
> > > > I
> > > > > am not sure if it's a good pattern.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Pedro
> > > > >
> > > >
> > >
> >
>