This talk covers migrating a production Airflow platform that orchestrates a large VM fleet — provisioning, OS patching, and decommissioning at high concurrency. This is not a data pipeline — it is infrastructure operations at fleet scale We’ll share workflow patterns that make fleet-scale orchestration possible in Airflow, then cover how we moved from an Airflow 2 monolith — all components on every node with fixed worker counts — to Airflow 3 with independently scalable services, each with its own release cycle. We’ll dig into a silent breaking change in Airflow 3’s XCom behavior: xcom_pull(key=…) without task_ids no longer searches upstream tasks, returning None with no warning. We’ll present three iterations of solving this — from O(n) DAG traversal to a custom XCom backend that restores Airflow 2 semantics with zero DAG code changes — and the design tradeoffs at each stage. Attendees will learn how Airflow powers infrastructure operations beyond data pipelines, how Airflow 3’s XCom silently breaks Airflow 2 workflows, three approaches to the same migration problem, and lessons from running both versions in parallel.
Rumeysa Ozaydin
SWE@Bloomberg