We built a centralized Gateway that sits in front of our entire scheduling fleet and solves three problems no single-cluster Airflow setup ever faces.
Composite Routing — Workflows are bound to clusters via a tag or their workspace
Global Concurrency Control — Each cluster enforces its own Airflow pool locally, unaware of what the other five are running. Shared downstream systems — rate-limited APIs, licensed compute engines — can be overwhelmed even when every individual pool looks healthy. The Gateway acts as a platform-wide slot broker: operators acquire a slot before doing real work. A built-in heartbeat scheduler reconciles stale slots against each cluster’s REST API, handling crashes and OOM kills transparently.
Transparent Version Upgrade — Each cluster carries version tags. During an Airflow upgrade: re-tag routing rules to send new submissions to the high-version cluster; existing runs drain on the old cluster undisturbed. Once drained, upgrade the old cluster and rejoin it. No maintenance window.
Takeaway: a thin routing layer makes your Airflow fleet elastic and upgradable without touching the scheduler or any pipeline code.
Huanjie Guo
Sr Software Engineer at eBay