OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: The need for LocalTaskJob


Comments inline.

> On 4 Aug 2018, at 18:28, Maxime Beauchemin <maximebeauchemin@xxxxxxxxx> wrote:
> 
> Let me confirm I'm understanding this right, we're talking specifically
> about the CeleryExecutor not starting and `airflow run` (not --raw)
> command, and fire up a LocalTaskJob instead? Then we'd still have the
> worker fire up the `airflow run --raw` command?
> 
> Seems reasonable. One thing to keep in mind is the fact that shelling out
> guarantees no `sys.module` caching, which is a real issue for slowly
> changing DAG definitions. That's the reason why we'd have to reboot the
> scheduler periodically before it used sub-processes to evaluate DAGs. Any
> code that needs to evaluate a DAG should probably be done in a subprocess.

> 
> Shelling out also allows for doing things like unix impersonation and
> applying CGROUPS. This currently happens between `airflow run` and `airflow
> run --raw`. The parent process also does heartbeat and listen for external
> kill signal (kill pills).
> 
> I think what we want is smarter executors and only one level of bash
> command: the `airflow run --raw`, and ideally the system that fires this up
> is not Airflow itself, and cannot be DAG-aware (or it will need to get
> restarted to flush the cache).

Rather than shelling out to `airflow run` could we instead fork and run the CLI code directly? This involves parsing the config twice, loading all of the airflow and SQLAlchemy deps twice etc. This I think would account for a not-insignificant speed difference for the unit tests. In the case of impersonation we'd probably have no option but to exec `airflow`, but most(?) people don't use that?

Avoiding the extra parsing pentalty and process when we don't need it might be worth it for test speed up alone. And we've already got impersonation covered in the tests so we'll know that it still works.

> 
> To me that really brings up the whole question of what should be handled by
> the Executor, and what belongs in core Airflow. The Executor needs to do
> more, and Airflow core less.

I agree with the sentiment that Core should do less and Executors more -- many parts of the core are reimplementing what Celery itself could do.


> 
> When you think about how this should all work on Kubernetes, it looks
> something like this:
> * the scheduler, through KubeExecutor, calls the k8s API, tells it to fire
> up and Airflow task
> * container boots up and starts an `airflow run --raw` command
> * k8s handles heartbeats, monitors tasks, knows how to kill a running task
> * the scheduler process (call it supervisor), talks with k8s through
> KubeExecutor
> and handles zombie cleanup and sending kill pills
> 
> Now because Celery doesn't offer as many guarantees it gets a bit more
> tricky. Is there even a way to send a kill pill through Celery? Are there
> other ways than using a parent process to accomplish this?

It does http://docs.celeryproject.org/en/latest/userguide/workers.html#revoke-revoking-tasks (at least it does now)

> 
> At a higher level, it seems like we need to move more logic from core
> Airflow into the executors. For instance, the heartbeat construct should
> probably be 100% handled by the executor, and not an assumption in the core
> code base.
> 
> I think I drifted a bit, hopefully that's still helpful.
> 
> Max