Streamlining Data Pipelines Creation at Stripe with Airflow

By Jiayu Yi

At Stripe, we process petabytes of data daily across thousands of pipelines powering financial reporting, fraud detection, and merchant analytics. As our data estate grew, so did the complexity of authoring, scheduling, and operating these pipelines. Engineers spent more time wrangling Airflow DAG boilerplate and managing dependencies than writing transformation logic.

To address this, we built a declarative platform that generates Airflow DAGs from YAML and SQL definitions. Authors specify what they want — source tables, SQL transformations, incremental mode, output schema — and the platform handles the rest: generating Airflow tasks, wiring upstream sensors, registering Iceberg tables, and configuring scheduling parameters. A key piece is an in-house dataset-to-task mapping service that resolves upstream dataset dependencies to their producing Airflow tasks. When an author declares an input dataset, the platform automatically looks up which task produces it and generates the appropriate sensor — no manual DAG cross-referencing required. This eliminates an entire class of misconfigured dependency bugs common in hand-wired Airflow deployments.

Jiayu Yi

Stripe, senior software engineer