Last year, we showed how LinkedIn’s continuous deployment (LCD) runs on Apache Airflow to orchestrate safe, repeatable releases across thousands of services—powering everyday deployments for 10,000+ engineers.

This year, we’ll dive into the hard‑won patterns that keep those deployments stable at scale: preserving DAG consistency during live updates; routing seamlessly across multiple clusters for graceful failover; enforcing HA guardrails on the control plane; and using dynamic task mapping to deliver faster rollbacks and reduce deployment overhead. You’ll see how we abstract Airflow for a cleaner user experience, what really moved the needle on launching tasks faster, and portable observability practices that cut on‑call toil.

Key takeaways:

  • Abstracting Airflow UX: define and update workflows without Airflow internals
  • Multi‑cluster routing and failover: keep pipelines running during degradation
  • DAG consistency without per‑run versioning: controlled ingestion and safe re‑ingestion
  • Dynamic task mapping: faster rollbacks and practical rerun strategies
  • Observability: health metrics, alerts, SLOs, and incident playbooks

Wensi Hu

Senior Software Engineer

Pooja Pal

Software Engineer - Systems & Infrastructure