OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Deployment / Execution Model


I can see how my first email was confusing, where I said:

"Our first attempt at productionizing Airflow used the vanilla DAGs folder,
including all the deps of all the DAGs with the airflow binary itself"

What I meant is that we have separate DAGs deployment, but we are being
forced to package the *dependencies of the DAGs* with the Airflow binary,
because that's the only way to make the DAG definitions work.

On Wed, Oct 31, 2018 at 11:18 PM, Gabriel Silk <gsilk@xxxxxxxxxxx> wrote:

> Our DAG deployment is already a separate deployment from Airflow itself.
>
> The issue is that the Airflow binary (whether acting as webserver,
> scheduler, worker), is the one that *reads* the DAG files. So if you
> have, for example, a DAG that has this import statement in it:
>
> import mylib.foobar
>
> Then the only way to successfully interpret this DAG definition in the
> Airflow process, is if you package the Airflow binary with the mylib.foobar
> dependency.
>
> This implies that every time you add a new dependency in one of your DAG
> definitions, you have to re-deploy Airflow itself, not just the DAG
> definitions.
>
>
> On Wed, Oct 31, 2018 at 2:45 PM, Maxime Beauchemin <
> maximebeauchemin@xxxxxxxxx> wrote:
>
>> Deploying the DAGs should be decoupled from deploying Airflow itself. You
>> can just use a resource that syncs the DAGs repo to the boxes on the
>> Airflow cluster periodically (say every minute). Resource orchestrators
>> like Chef, Ansible, Puppet, should have some easy way to do that. Either
>> that or some sort of mount or mount-equivalent (k8s has constructs for
>> that, EFS on Amazon).
>>
>> Also note that the DagFetcher abstraction that's been discussed before on
>> the mailing list would solve this and more.
>>
>> Max
>>
>> On Wed, Oct 31, 2018 at 2:37 PM Gabriel Silk <gsilk@xxxxxxxxxxx.invalid>
>> wrote:
>>
>> > Hello Airflow community,
>> >
>> >
>> > I'm currently putting Airflow into production at my company of 2000+
>> > people. The most significant sticking point so far is the deployment /
>> > execution model. I wanted to write up my experience so far in this
>> matter
>> > and see how other people are dealing with this issue.
>> >
>> > First of all, our goal is to allow engineers to author DAGs and easily
>> > deploy them. That means they should be able to make changes to their
>> DAGs,
>> > add/remove dependencies, and not have to  redeploy any of the core
>> > component (scheduler, webserver, workers).
>> >
>> > Our first attempt at productionizing Airflow used the vanilla DAGs
>> folder,
>> > and including all the deps of all the DAGs with the airflow binary
>> itself.
>> > Unfortunately, that meant we had to redeploy the scheduler, webserver
>> > and/or workers every time a dependency changed, which will definitely
>> not
>> > work for us long term.
>> >
>> > The next option we considered was to use the "packaged DAGs" approach,
>> > whereby you place dependencies in a zip file. This would not work for
>> us,
>> > due to the lack of support for dynamic libraries (see
>> > https://airflow.apache.org/concepts.html#packaged-dags)
>> >
>> > We have finally arrived at an option that seems reasonable, which is to
>> use
>> > a single Operator that shells out to various binary targets that we
>> build
>> > independently of Airflow, and which include their own dependencies.
>> > Configuration is serialized via protobuf and passed over stdin to the
>> > subprocess. The parent process (which is in Airflow's memory space)
>> streams
>> > the logs from stdout and stderr.
>> >
>> > This approach has the advantage of working seamlessly with our build
>> > system, and allowing us to redeploy DAGs even when dependencies in the
>> > operator implementations change.
>> >
>> > Any thoughts / comments / feedback? Have people faced similar issues out
>> > there?
>> >
>> > Many thanks,
>> >
>> >
>> > -G Silk
>> >
>>
>
>