osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: The need for LocalTaskJob


Yes clearly this area needs TLC. Thanks for getting the ball rolling.

Max

On Sat, Aug 4, 2018 at 1:58 PM Ash Berlin-Taylor <
ash_airflowlist@xxxxxxxxxxxxxx> wrote:

>
> > On 4 Aug 2018, at 21:25, Bolke de Bruin <bdbruin@xxxxxxxxx> wrote:
> >
> > We can just execute “python” just fine. Because it will run in a
> separate interpreter no issues will come from sys.modules as that is not
> inherited. Will still parse DAGs in a separate process then. Forking (@ash)
> probably does not work as that does share sys.modules.
>
> Some sharing of modules was my idea - if we are careful about what modules
> we load, and we only load the airflow core pre fork, and don't parse any
> DAG pre-fork, then forking sharing currently loaded modules is a good thing
> for speed. Think of it like the preload_app option to a gunicorn worker,
> where the master loads the app and then forks.
>
> > [snip]
> >
> > I’m writing AIP-2
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-2+Simplify+process+launching
> to work this out.
>
> Sounds good. I'm not proposing we try my forking idea yet, and your
> proposal is a definite improvement from where we are now.
>
> >
> > B.
> >
> > Verstuurd vanaf mijn iPad
> >
> >> Op 4 aug. 2018 om 19:40 heeft Ash Berlin-Taylor <
> ash_airflowlist@xxxxxxxxxxxxxx> het volgende geschreven:
> >>
> >> Comments inline.
> >>
> >>> On 4 Aug 2018, at 18:28, Maxime Beauchemin <maximebeauchemin@xxxxxxxxx>
> wrote:
> >>>
> >>> Let me confirm I'm understanding this right, we're talking specifically
> >>> about the CeleryExecutor not starting and `airflow run` (not --raw)
> >>> command, and fire up a LocalTaskJob instead? Then we'd still have the
> >>> worker fire up the `airflow run --raw` command?
> >>>
> >>> Seems reasonable. One thing to keep in mind is the fact that shelling
> out
> >>> guarantees no `sys.module` caching, which is a real issue for slowly
> >>> changing DAG definitions. That's the reason why we'd have to reboot the
> >>> scheduler periodically before it used sub-processes to evaluate DAGs.
> Any
> >>> code that needs to evaluate a DAG should probably be done in a
> subprocess.
> >>
> >>>
> >>> Shelling out also allows for doing things like unix impersonation and
> >>> applying CGROUPS. This currently happens between `airflow run` and
> `airflow
> >>> run --raw`. The parent process also does heartbeat and listen for
> external
> >>> kill signal (kill pills).
> >>>
> >>> I think what we want is smarter executors and only one level of bash
> >>> command: the `airflow run --raw`, and ideally the system that fires
> this up
> >>> is not Airflow itself, and cannot be DAG-aware (or it will need to get
> >>> restarted to flush the cache).
> >>
> >> Rather than shelling out to `airflow run` could we instead fork and run
> the CLI code directly? This involves parsing the config twice, loading all
> of the airflow and SQLAlchemy deps twice etc. This I think would account
> for a not-insignificant speed difference for the unit tests. In the case of
> impersonation we'd probably have no option but to exec `airflow`, but
> most(?) people don't use that?
> >>
> >> Avoiding the extra parsing pentalty and process when we don't need it
> might be worth it for test speed up alone. And we've already got
> impersonation covered in the tests so we'll know that it still works.
> >>
> >>>
> >>> To me that really brings up the whole question of what should be
> handled by
> >>> the Executor, and what belongs in core Airflow. The Executor needs to
> do
> >>> more, and Airflow core less.
> >>
> >> I agree with the sentiment that Core should do less and Executors more
> -- many parts of the core are reimplementing what Celery itself could do.
> >>
> >>
> >>>
> >>> When you think about how this should all work on Kubernetes, it looks
> >>> something like this:
> >>> * the scheduler, through KubeExecutor, calls the k8s API, tells it to
> fire
> >>> up and Airflow task
> >>> * container boots up and starts an `airflow run --raw` command
> >>> * k8s handles heartbeats, monitors tasks, knows how to kill a running
> task
> >>> * the scheduler process (call it supervisor), talks with k8s through
> >>> KubeExecutor
> >>> and handles zombie cleanup and sending kill pills
> >>>
> >>> Now because Celery doesn't offer as many guarantees it gets a bit more
> >>> tricky. Is there even a way to send a kill pill through Celery? Are
> there
> >>> other ways than using a parent process to accomplish this?
> >>
> >> It does
> http://docs.celeryproject.org/en/latest/userguide/workers.html#revoke-revoking-tasks
> (at least it does now)
> >>
> >>>
> >>> At a higher level, it seems like we need to move more logic from core
> >>> Airflow into the executors. For instance, the heartbeat construct
> should
> >>> probably be 100% handled by the executor, and not an assumption in the
> core
> >>> code base.
> >>>
> >>> I think I drifted a bit, hopefully that's still helpful.
> >>>
> >>> Max
>
>