OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Mocking airflow (similar to moto for AWS)


Thanks! I like the suggestion about testing hooks rather than whole DAGs -
we will certainly use it in the future. And BDD is the approach I really
like - thanks for the code examples! We might also use it in the near
future. Super helpful!

So far we mocked hooks in our unit tests only (for example here
<https://github.com/PolideaInternal/incubator-airflow/blob/master/tests/contrib/operators/test_gcp_compute_operator.py#L241>)
- that helps to test the logic of more complex operators.
@Anthony - we also use a modified docker-based environment to run the tests
(https://github.com/PolideaInternal/airflow-breeze/tree/integration-tests)
including running full Dags. And yeah missing import was just an
exaggerated example :) we also use IDE/lints to catch those early :D.

I think still there is a need to run whole DAGs on top of testing operators
and hooks separate. This is to test a bit more complex interactions between
the operators. In our case we use example dags for both documentation and
running full e2e integration tests (for example here
https://github.com/PolideaInternal/incubator-airflow/blob/master/airflow/contrib/example_dags/example_gcp_compute.py).
Those are simple examples but we will have a bit more complex interactions
and it would be great to be able to run them quicker. However if we get the
hook tests automated/unit-testable as well, maybe our current approach
where we run them in the full dockerized environment will be good enough.

J.


On Thu, Oct 18, 2018 at 5:44 PM Anthony Brown <anthony.brown@xxxxxxxxxxxxxxx>
wrote:

> I have pylint set up in my IDE which catches most silly errors like missing
> imports
> I also use a docker image so I can start up airflow locally and manually
> test any changes before trying to deploy them. I use a slightly modified
> version of https://github.com/puckel/docker-airflow to control it. This
> only works on connections I have access to from my machine
> Finally I have a suite of tests based on
>
> https://blog.usejournal.com/testing-in-airflow-part-1-dag-validation-tests-dag-definition-tests-and-unit-tests-2aa94970570c
> which I can run to test DAGs are valid and any unit tests I can put in. The
> tests are run in a docker container which runs a local db instance so I
> have access to xcoms etc
>
> As part of my deployment pipeline, I run pylint and tests again before
> deploying anywhere to make sure nobody has forgotten to run them locally
>
> Gerard - I like the suggestion about using mocked hooks and BDD. I will
> look into this further
>
> On Thu, 18 Oct 2018 at 15:12, Gerard Toonstra <gtoonstra@xxxxxxxxx> wrote:
>
> > There was a discussion about a unit testing approach last year 2017 I
> > believe. If you dig the mail archives, you can find it.
> >
> > My take is:
> >
> > - You should test "hooks" against some real system, which can be a docker
> > container. Make sure the behavior is predictable when talking against
> that
> > system. Hook tests are not part of general CI tests because of the
> > complexity of the CI setup you'd have to make, so they are run on local
> > boxes.
> > - Maybe add additional "mock" hook tests, mocking out the connected
> > systems.
> > - When hooks are tested, operators can use 'mocked' hooks that no longer
> > need access to actual systems. You can then set up an environment where
> you
> > have predictable inputs and outputs and test how the operators act on
> them.
> > I've used "behave" to do that with very simple record sets, but you can
> > make these as complex as you want.
> > - Then you know your hooks and operators work functionally. Testing if
> your
> > workflow works in general can be implemented by adding "check" operators.
> > The benefit here is that you don't test the workflow once, but you test
> for
> > data consistency every time the dag runs. If you have complex workflows
> > where the correct behavior of the flow is worrysome, then you may need to
> > go deeper into it.
> >
> > The above doesn't depend on DAGS that need to be scheduled and the delays
> > involving that.
> >
> > All of the above is implemented in my repo
> > https://github.com/gtoonstra/airflow-hovercraft  , using "behave" as a
> BDD
> > method of testing, so you can peruse that.
> >
> > Rgds,
> >
> > G>
> >
> >
> > On Thu, Oct 18, 2018 at 2:43 PM Jarek Potiuk <Jarek.Potiuk@xxxxxxxxxxx>
> > wrote:
> >
> > > I am also looking to have (I think) similar workflow. Maybe someone has
> > > done something similar and can give some hints on how to do it the
> > easiest
> > > way?
> > >
> > > Context:
> > >
> > > While developing operators I am using example test DAGs that talk to
> GCP.
> > > So far our "integration tests" require copying the dag folder and
> > > restarting the airflow servers, unpausing the dag and waiting for it to
> > > start. That takes a lot of time, sometimes just to find out that you
> > missed
> > > one import.
> > >
> > > Ideal workflow:
> > >
> > > Ideally I'd love to have a "unit" test (i.e possible to run via
> nosetests
> > > or IDE integration/PyCharm) that:
> > >
> > >    - should not need to have airflow scheduler/webserver started. I
> guess
> > >    we need a DB but possibly an in-memory, on-demand created database
> > > might be
> > >    a good solution
> > >    - load the DAG from a file specified (not really from/dags
> directory)
> > >    - build internal dependencies between the DAG tasks (as specified in
> > the
> > >    Dag)
> > >    - run the DAG immediately and fully (i.e. run all the "execute"
> > methods
> > >    as needed and pass XCOM between tasks).
> > >    - ideally produce log output in console rather in per-task files.
> > >
> > > I thought about using DagRun/DagBag but have not tried it yet and not
> > sure
> > > if you need to have whole environment set (which parts?). Any help
> > > appreciated :) ?
> > >
> > > J.
> > >
> > > On Thu, Oct 18, 2018 at 1:08 AM bielllobera@xxxxxxxxx <
> > > bielllobera@xxxxxxxxx>
> > > wrote:
> > >
> > > > I think it would be great to have a way to mock airflow for unit
> tests.
> > > > The way I approached this was to create a context manager that
> creates
> > a
> > > > temporary directory, sets the AIRFLOW_HOME environment variable to
> this
> > > > directory (only within the scope of the context manager) and then
> > renders
> > > > an airflow.cfg to that location. This creates an SQLite just for the
> > test
> > > > so you can add variables and connections needed for the test without
> > > > affecting the real Airflow installation.
> > > >
> > > > The first thing I realized is that this didn't work if the imports
> were
> > > > outside the context manager, since airflow.configuration and
> > > > airflow.settings perform all the initialization when they are
> imported,
> > > so
> > > > the AIRFLOW_HOME variable is already set to the real installation
> > before
> > > > getting inside the context manager.
> > > >
> > > > The workaround for this was to reload those modules and this works
> for
> > > the
> > > > tests I have written. However, when I tried to use it for something
> > more
> > > > complex (I have a plugin that I'm importing) I noticed that inside
> the
> > > > operator in this plugin, AIRFLOW_HOME is still set to the real
> > > > installation, not the temporary one for the test. I thought this must
> > be
> > > > related to the imports but I haven't been able to figure out a way to
> > fix
> > > > the issue. I tried patching some methods but I must have been missing
> > > > something because the database initialization failed.
> > > >
> > > > Does anyone have an idea on the best way to mock/patch airflow so
> that
> > > > EVERYTHING that is executed inside the context manager uses the
> > temporary
> > > > installation?
> > > >
> > > > PS: This is my current attempt which works for the tests I defined
> but
> > > not
> > > > for external plugins:
> > > > https://github.com/biellls/airflow_testing
> > > >
> > > > For an example on how it works:
> > > >
> > >
> >
> https://github.com/biellls/airflow_testing/blob/master/tests/mock_airflow_test.py
> > > >
> > >
> > >
> > > --
> > >
> > > *Jarek Potiuk, Principal Software Engineer*
> > > Mobile: +48 660 796 129
> > >
> >
>
>
> --
> --
>
> Anthony Brown
> Data Engineer BI Team - John Lewis
> Tel : 0787 215 7305
> **********************************************************************
> This email is confidential and may contain copyright material of the John
> Lewis Partnership.
> If you are not the intended recipient, please notify us immediately and
> delete all copies of this message.
> (Please note that it is your responsibility to scan this message for
> viruses). Email to and from the
> John Lewis Partnership is automatically monitored for operational and
> lawful business reasons.
> **********************************************************************
>
> John Lewis plc
> Registered in England 233462
> Registered office 171 Victoria Street London SW1E 5NN
>
> Websites: https://www.johnlewis.com
> http://www.waitrose.com
> https://www.johnlewisfinance.com
> http://www.johnlewispartnership.co.uk
>
> **********************************************************************
>


-- 

*Jarek Potiuk, Principal Software Engineer*
Mobile: +48 660 796 129