Data Pipeline HealthCheck for Correctness, Performance, and Cost Efficiency

Presented at Airflow Summit 2021

We are witnessing a rapid growth in the number of mission-critical data pipelines that leaders of data products are responsible for. “Are your data pipelines healthy?” This question was posed to more than 200 leaders of data products from various industries. The answers ranged from “unfortunately, no” to “they are mostly fine, but I am always afraid that something or the other will cause a pipeline to break”.

This talk presents the concept of Pipeline HealthCheck (PHC) which enables leaders of data products to have high confidence in the correctness, performance, and cost efficiency of their data pipelines. More importantly, PHC enables leaders of data products as well as their development and operations teams to have high confidence in their ability to quickly detect, troubleshoot, and fix problems that make data pipelines unhealthy. The talk also includes a demo of how PHC helps handle common problems in data pipelines like incorrect results, missing SLAs, and overshooting cost budgets.