Performance issues in Apache Airflow rarely appear as clear failures. Instead, they surface as subtle signals: longer task queue times, slower DAG parsing, scheduler lag, or workers hitting limits as workloads grow.

In this talk, we share lessons from profiling real production deployments across Airflow 2.x and 3.x. Combining frontline operational insights with focused technical investigation, we analysed task latency, DAG parsing time, worker behaviour, and metadata database performance under sustained load.

We show how configuration choices such as parallelism, max active runs, and worker resources can amplify or limit version-level improvements. We also discuss performance drift in long-running environments, where accumulated DAG runs expose slow queries or missing indexes that fresh deployments do not reveal.

Finally, we examine how dynamic DAG generation (e.g. with cosmos dbt dags) and custom user code can unintentionally impact parsing and execution performance.

Attendees leave with a practical framework to profile existing deployments, isolate bottlenecks, optimise performance, reduce recurring issues, and approach upgrades with confidence.

Pankaj Koti

Astronomer, Software Engineer

Vara Prasad Regani

Astronomer, Senior Airflow Reliability Engineer