airflow.exceptions.AirflowException dag_id not found
We’re using Airflow in our startup and it’s been great in many ways, thanks for the work you guys are doing!
Unfortunately, we’re hitting a bunch of issues with ops timing out, DAGs failing for unclear reasons, with no logs or the following error: "airflow.exceptions.AirflowException: dag_id could not be found”. This seems to happen when enough DAGs are running at the same time, though it can also happen more rarely here and there. But, the best way to reproduce the error with our setup is to run enough DAGs at once. Most of the time, clearing the DAG run or ops that have failed and letting the DAG re-run is enough to fix the problem.
I found resources pointing to the dagbag_import_timeout, e.g., https://stackoverflow.com/questions/43235130/airflow-dag-id-could-not-be-found <https://stackoverflow.com/questions/43235130/airflow-dag-id-could-not-be-found>.
I did play with that parameter, and other parameters as well. And it does seem that they help, i.e., I can run more DAGs at once, but
(1) if I run enough DAGs at once, I still see ops and DAGs failing, so the problem is not fixed ;
(2) more importantly, I don’t fully understand the problem. I have some ideas on what is happening, but maybe I’m totally wrong?
Any recommendations on how I should investigate that?
Thank you very much!
Have a nice rest of the day,