Stabilizing LinkedIn Continuous Deployment on Airflow

By Wensi Hu Pooja Pal

Last year, we showed how LinkedIn’s continuous deployment (LCD) runs on Apache Airflow to orchestrate safe, repeatable releases across thousands of services—powering everyday deployments for 10,000+ engineers.

This year, we’ll dive into the hard‑won patterns that keep those deployments stable at scale: preserving DAG consistency during live updates; routing seamlessly across multiple clusters for graceful failover; enforcing HA guardrails on the control plane; and using dynamic task mapping to deliver faster rollbacks and reduce deployment overhead. You’ll see how we abstract Airflow for a cleaner user experience, what really moved the needle on launching tasks faster, and portable observability practices that cut on‑call toil.

Key takeaways:

Abstracting Airflow UX: define and update workflows without Airflow internals
Multi‑cluster routing and failover: keep pipelines running during degradation
DAG consistency without per‑run versioning: controlled ingestion and safe re‑ingestion
Dynamic task mapping: faster rollbacks and practical rerun strategies
Observability: health metrics, alerts, SLOs, and incident playbooks

Wensi Hu

Senior Software Engineer @ Linkedin

Pooja Pal

Software Engineer @ LinkedIn