Data pipelines often fail due to upstream issues, infrastructure instability, or unexpected spikes in data volume. While Apache Airflow provides retries and alerts, many failures still require manual intervention. As data platforms scale, this reactive approach becomes difficult to manage.

In this session, we will explore how teams can design self-healing data pipelines using Airflow combined with AI-driven insights. We will look at common pipeline failure patterns and how anomaly detection on metrics such as task runtimes, retries, and data volume changes can help identify issues early. By combining intelligent monitoring with automated recovery strategies, engineers can reduce downtime and operational overhead while building more resilient and reliable data workflows.

Naga Durga Rao Dindi

Data Engineer at Citi