Designing Self-Healing Airflow Platforms: Autonomous DAG Recovery at Scale

By Kumuda Sreenivasa Sandeep Bommisetti

Most Airflow failures are still handled manually — retries, Slack alerts, and late-night debugging. This talk shows how to design Airflow as a self-healing platform that detects problems early, limits blast radius, and automatically recovers. We’ll cover practical patterns for DAG, schema, and dependency-drift detection; safe, selective backfills; predictive failure modeling using metadata; lineage-aware rollbacks; and canary deployment for DAGs. You’ll learn how to isolate unstable workloads before they impact others and how to turn Airflow into an intelligent control plane — not just a scheduler.

Kumuda Sreenivasa

Sr Data Architect ,ATC Drivetrain Founder ,Unimonk & GoIcure

Sandeep Bommisetti

Cybersecurity Leader