Re: Concurrency Settings for Celery Executor
On 2018/06/20 20:06:38, George Leslie-Waksman <george@xxxxxxxxxxxxxxxx.INVALID> wrote:
> "celeryd_concurrency" and "parallelism" serve different purposes
> "celeryd_concurrency" determines how many worker subprocesses will be spun
> up on an airflow worker instance (how many concurrent tasks per machine)
> "parallelism" determines how many tasks airflow will schedule at once;
> running and queued tasks count against this limit
> Generally, you will want to set "celeryd_concurrency" to as high a value as
> your worker nodes can handle (or a lower value and use low powered workers)
> As a general rule of thumb, you will want to set "parallelism" to be
> (number of workers) * ("celeryd_concurrency"). A lower "parallelism" value
> will leave workers idle. If you increase "parallelism", you will get tasks
> backed up in your celery queue; this is great for high throughput and makes
> sure workers aren't starved for work but can slightly increase the latency
> of individual tasks.
> Once you have those values tuned, you also need to be aware of dag
> concurrency limits and task pool limits. Each DAG has a concurrency
> parameter, which defaults to the "dag_concurrency" configuration value.
> Each Task has a pool parameter, which defaults to `None`. These each
> decrease the number of running tasks below the parallelism limit.
> DAG concurrency behaves exactly the same as parallelism but sets a limit on
> the individual DAG. Depending on how many DAGs you have, when they're
> scheduled, etc., you will want to set the "dag_concurrency" to a level that
> doesn't prevent DAGs from stepping on each others' resources too much. You
> can also set individual limits for individual DAGs.
> Pools also behave the same as parallelism but operate across DAGs. They are
> useful for things like limiting the number of concurrent tasks hitting a
> single database. If you do not specify a pool, the tasks will be placed in
> the default pool, which has a size defined by
> the "non_pooled_task_slot_count" configuration variable.
> If you don't want to think too much about things, try:
> celeryd_concurrency = 4 or 8
> parallelism = celeryd_concurrency * number of workers
> num_pooled_task_slot_count = parallelism
> dag_concurrency = parallelism / 2
> and then pretend pools and individual dag concurrency limits don't exist
> On Wed, Jun 13, 2018 at 4:11 AM ramandumcs@xxxxxxxxx <ramandumcs@xxxxxxxxx>
> > Hi All,
> > There seems to be couple of settings in airflow.cfg which controls the
> > number of tasks that can run in parallel on Airflow( "parallelism" and
> > "celeryd_concurrency")
> > In case of celeryExecutor which one is honoured. Do we need to set both or
> > setting only celeryd_concurrency would work.
> > Thanks,
> > Raman