[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Capturing data changes that happen after the initial data pull

One way to do this would be to have your DAG file return two nearly
identical DAGs (like put it in a factory function and call it twice). The
difference would be that the "final" run would have a conditional to add an
extra time sensor at the DAG root, to wait N days for the data to finalize.
The effect would be that two DAG runs would start each day, but the latter
would stay uncompleted for a while.

Doing it that way has the advantage of still aligning with daily updates
rather than having a daily DAG and a weekly DAG; keeping execution dates
correct to which data is being processed; and not having the primary DAG
runs stay open for days (better for status tracking).

It has the disadvantage that any extremely long running tasks (including
sensors) will be more likely to fail, and that you would have several more
tasks and DAG runs open and taking resources/concurrency.

On Jun 6, 2018 3:34 PM, "Pedro Machado" <pedro@xxxxxxxxxxxxxx> wrote:

This is a similar case. My idea was to rerun the whole data pull. The
current DAG is idempotent so there is no issue with inserting duplicates.
Now, I'm trying to figure out the best way to code it in Airflow. Thanks

On Wed, Jun 6, 2018 at 2:29 PM Ben Gregory <ben@xxxxxxxxxxxxx> wrote:

> I've seen a similar use case with DoubleClick/Google Analytics (
> https://support.google.com/ds/answer/2791195?hl=en), where the reporting
> metrics have a "lookback window" of up to 30 days to mark conversion
> attribution (so if a user converts on the 14th day of clicking on an ad it
> will still be counted.
> What we ended up doing in that case is setting the start/stop params of
> query to the the full window (pulled daily) and then upserted in Redshift
> based on a primary key (in our case actually a composite key with multiple
> attributes). So you end up pulling a lot of redundant data but since
> there's no way to pull only updated records (which sounds like the case
> you're in), it's the best way to ensure your reporting is up-to-date.
> On Wed, Jun 6, 2018 at 1:10 PM Pedro Machado <pedro@xxxxxxxxxxxxxx> wrote:
> > Yes. It's literally the same API calls with the same dates, only done a
> few
> > days later. It's just redoing the same data pull but instead of pulling
> one
> > date each dag run, it would pull all dates for the previous week on
> > Tuesdays.
> >
> > Thanks!
> >
> --
> [image: Astronomer Logo] <https://www.astronomer.io/>
> *Ben Gregory*
> Data Engineer
> Mobile: +1-615-483-3653 • Online: astronomer.io <
> https://www.astronomer.io/>
> Download our new ebook. <http://marketing.astronomer.io/guide/> From
> Volume
> to Value - A Guide to Data Engineering.