[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Dealing with data latency


We have to use a lot of time sensors like this, for reports that shouldn't
be filed to a third party before a certain time of day. Since these sensors
are themselves tasks, they can fail to be scheduled or can fail, like if
the underlying worker instance dies. I would recommend double checking your
concurrency settings (esp. since you will have multiple days worth of DAGs
concurrently running) and your retry settings.

On Tue, Jun 5, 2018 at 10:34 AM, Pedro Machado <pedro@xxxxxxxxxxxxxx> wrote:

> Thanks, Max!
>
> On Mon, Jun 4, 2018 at 12:47 PM Maxime Beauchemin <
> maximebeauchemin@xxxxxxxxx> wrote:
>
> > The common standard is to have the execution_date aligned with the
> > partition date in the database (say 2018-08-08) and contain data from
> > 2018-08-08T00:00:000
> > to 2018-08-09T23:59:999.
> >
> > The partition date and execution_date match and correspond to the left
> > bound of the time interval processed.
> >
> > Then you'd use some sensors to make sure this cannot run until the
> desired
> > time or conditions are met.
> >
> > Max
> >
> > On Mon, Jun 4, 2018 at 5:46 AM Pedro Machado <pedro@xxxxxxxxxxxxxx>
> wrote:
> >
> > > Hi. What is the recommended way to deal with data latency? For
> example, I
> > > have a feed that is not considered final until 72 hours have passed
> after
> > > the end of the daily period.
> > >
> > > For example, Monday's data would be ready by Thursday at 23:59.
> > >
> > > Should I pull data based on the execution date minus a 72 hour offset
> or
> > > use the execution date and somehow delay the data pull for 72 hours?
> > >
> > > The latter would be more intuitive (data pull date = execution date)
> but
> > I
> > > am not sure if it's a good pattern.
> > >
> > > Thanks,
> > >
> > > Pedro
> > >
> >
>