[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Concurrency Settings for Celery Executor

"celeryd_concurrency" and "parallelism" serve different purposes

"celeryd_concurrency" determines how many worker subprocesses will be spun
up on an airflow worker instance (how many concurrent tasks per machine)

"parallelism" determines how many tasks airflow will schedule at once;
running and queued tasks count against this limit

Generally, you will want to set "celeryd_concurrency" to as high a value as
your worker nodes can handle (or a lower value and use low powered workers)

As a general rule of thumb, you will want to set "parallelism" to be
(number of workers) * ("celeryd_concurrency"). A lower "parallelism" value
will leave workers idle. If you increase "parallelism", you will get tasks
backed up in your celery queue; this is great for high throughput and makes
sure workers aren't starved for work but can slightly increase the latency
of individual tasks.

Once you have those values tuned, you also need to be aware of dag
concurrency limits and task pool limits. Each DAG has a concurrency
parameter, which defaults to the "dag_concurrency" configuration value.
Each Task has a pool parameter, which defaults to `None`. These each
decrease the number of running tasks below the parallelism limit.

DAG concurrency behaves exactly the same as parallelism but sets a limit on
the individual DAG. Depending on how many DAGs you have, when they're
scheduled, etc., you will want to set the "dag_concurrency" to a level that
doesn't prevent DAGs from stepping on each others' resources too much. You
can also set individual limits for individual DAGs.

Pools also behave the same as parallelism but operate across DAGs. They are
useful for things like limiting the number of concurrent tasks hitting a
single database. If you do not specify a pool, the tasks will be placed in
the default pool, which has a size defined by
the "non_pooled_task_slot_count" configuration variable.

If you don't want to think too much about things, try:

celeryd_concurrency = 4 or 8
parallelism = celeryd_concurrency * number of workers
num_pooled_task_slot_count = parallelism
dag_concurrency = parallelism / 2

and then pretend pools and individual dag concurrency limits don't exist

On Wed, Jun 13, 2018 at 4:11 AM ramandumcs@xxxxxxxxx <ramandumcs@xxxxxxxxx>

> Hi All,
> There seems to be couple of settings in airflow.cfg which controls the
> number of tasks that can run in parallel on Airflow( "parallelism" and
> "celeryd_concurrency")
> In case of celeryExecutor which one is honoured. Do we need to set both or
> setting only  celeryd_concurrency would work.
> Thanks,
> Raman