These are the confirmed sessions for Airflow Summit 2025.

Title

5 Simple Strategies To Enhance Your DAGs For Data Processing

by William Orgertrice

Take your DAGs in Apache Airflow to the next level? This is an insightful session where we’ll uncover 5 transformative strategies to enhance your data workflows. Whether you’re a data engineering pro or just getting started, this presentation is packed with practical tips and actionable insights that you can apply right away.

We’ll dive into the magic of using powerful libraries like Pandas, share techniques to trim down data volumes for faster processing, and highlight the importance of modularizing your code for easier maintenance. Plus, you’ll discover efficient ways to monitor and debug your DAGs, and how to make the most of Airflow’s built-in features.

Agentic AI Automating Semantic Layer Updates with Airflow 3

by Scott Mitchell

In today’s dynamic data environments, tables and schemas are constantly evolving and keeping semantic layers up to date has become a critical operational challenge. Manual updates don’t scale, and delays can quickly lead to broken dashboards, failed pipelines, and lost trust.

We’ll show how to harness Apache Airflow 3 and its new event-driven scheduling capabilities to automate the entire lifecycle: detecting table and schema changes in real time, parsing and interpreting those changes, and shifting left the updating of semantic models across dbt, Looker, or custom metadata layers. AI agents will add intelligence and automation that rationalize schema diffs, assess impact of changes, and propose targeted updates to semantic layers reducing manual work and minimizing the risk of errors.

Airflow & Bigtop: Modernize and integrate time-proven OSS stack with Apache Airflow

by Kengo Seki

Apache Bigtop is a time-proven open-source software stack for building data platform, which has been built around the Hadoop and Spark ecosystem since 2011. Its software composition has been changed during such a long period, and recently job scheduler is removed mainly due to the inactivity of its development. The speaker believes that Airflow perfectly fits into this gap and is proposing incorporating it in the Bigtop stack. This presentation will introduce how easily users can build a data platform with Bigtop including Airflow, and how Airflow can integrate those software with its wide range of providers and enterprise-readiness such as the Kerberos support.

Airflow 3 - An Open Heart Surgery

by M Waqas Shahid

Curious how code truly flows inside Airflow? Join me for a unique visualisation journey into Airflow’s inner workings (first of it’s kind) — code blocks and modules called when certain operations are running.

A walkthrough that unveils task execution, observability, and debugging like never before. This session will demystify Airflow’s architecture, showcasing real-time task flows and the heartbeat of pipelines in action.

Perfect for engineers looking to optimize workflows, troubleshoot efficiently, and gain a new perspective on Airflow’s powerful core. See Airflow running live with detailed insights and unlock the secrets to better pipeline management!

Airflow 3 UI is not enough? Add a Plugin!

by Jens Scheffler, Brent Bovenzi & Pierre Jeambrun

In Airflow 2 there was a plugin mechanism to extend the UI for new functions as well as be able to add hooks and other features.

As Airflow 3 rewrote the UI old Plugins were not working for all cases anymore. Airflow 3.1 now provides a re-vamped option to extend the UI with a new plugin schema in native React components and embedded iframes following AIP-68 definitions.

In this session we will provide an overview about capabilities and give some intro how you can roll-your-own.

Airflow 3’s Trigger UI: Evolution of Params

by Shubham Raj & Jens Scheffler

Are you looking to build slick, dynamic trigger forms for your DAGs? It all starts with mastering params.

Params are the gold standard for adding execution options to your DAGs, allowing you to create dynamic, user-friendly trigger forms with descriptions, validation, and now, with Airflow 3, bidirectional support for conf data!

In this talk, we’ll break down how to use params effectively, share best practices, and explore what’s new since the 2023 Airflow Summit talk (https://airflowsummit.org/sessions/2023/flexible-dag-trigger-forms-aip-50/). If you want to make DAG execution more flexible, intuitive, and powerful, this session is a must-attend!

Airflow as an AI Agent’s Toolkit: Unlocking 1000+ Integrations with MCP

by Kaxil Naik

AI agents transform conversational prompts into actionable automation provided they have reliable access to essential tools like data warehouses, cloud storage, and APIs.

Now imagine exposing Airflow’s rich integration layer directly to AI agents via the emerging Model Context Protocol (MCP). This isn’t just gluing AI into Airflow; it’s turning Airflow into a structured execution layer for adaptive, agentic logic with full observability, retries, and audit trails built in.

We’ll demonstrate a real-world fraud detection pipeline powered by agents: suspicious transactions are analyzed, enriched dynamically with external customer data via MCP, and escalated based on validated, structured outputs. Every prompt, decision, and action is auditable and compliant.

Airflow at Zoox: A journey to orchestrate heterogeneous workflows

by Justin Wang & Saurabh Gupta

The workflow orchestration team at Zoox aims to build a solution for orchestrating heterogeneous workflows encompassing data, ML, and QA pipelines. We have encountered two primary challenges: first, the steep learning curve for new Airflow users and the need for a user-friendly yet scalable development process; second, integrating and migrating existing pipelines with established solutions.

This presentation will detail our approach, as a small team at Zoox, to address these challenges. The discussion will cover the scope and scale of Airflow within Zoox, including current applications and future directions. Furthermore, we will share our strategies for simplifying the Airflow DAG creation process and enhancing user experience. Finally, we will present a case study illustrating the onboarding of a heterogeneous workflow across Databricks, AWS, and a Zoox in-house platform to manage both on-prem and cloud services.

Airflow That Remembers: The Dag Versioning Era is here!

by Jed Cunningham & Ephraim Anierobi

Airflow 3 introduced a game-changing feature: Dag versioning.

Gone are the days of “latest only” Dags and confusing, inconsistent UI views when pipelines change mid-flight. This talk covers:

  • Visualizing Dag changes over time in the UI
  • How Dags code is versioned and can be grabbed from external sources
  • Executing a whole Dag run against the same code version
  • Dynamic Dags? Where do they fit in?!

You’ll see real-world scenarios, UI demos, and learn how these advancements will help avoid “Airflow amnesia”.

Allegro's Airflow Journey: From On-Prem to Cloud Orchestration at Scale

by Piotr Dziuba & Marek Gawinski

This session will detail Allegro’s, a leading e-commerce company in Poland, journey with Apache Airflow. It will chart our evolution from a custom, on-premises Airflow-as-a-Service solution through a significant expansion to over 300 Cloud Composer instances in Google Cloud, culminating in Airflow becoming the core of our data processing. We orchestrate over 64,000 regular tasks spanning over 6,000 active DAGs on more than 200 Airflow instances. From feeding business-supporting dashboards, to managing main data marts, and handling ML pipelines, and more.

Applying Airflow to drive the digital workforce in the Enterprise

by Shoubhik Bose

Red Hat’s unified data and AI platform relies on Apache Airflow for orchestration, alongside Snowflake, Fivetran, and Atlan. The platform prioritizes building a dependable data foundation, recognizing that effective AI depends on quality data. Airflow was selected for its predictability, extensive connectivity, reliability, and scalability.

The platform now supports business analytics, transitioning from ETL to ELT processes. This has resulted in a remarkable improvement in how we make data available for business decisions.

Assets: Past, Present, Future

by Tzu-ping Chung

Airflow Asset originated from data lineage and evolved into its current state, being used as a scheduling concept (data-aware, event-based scheduling). It has even more potential. This talk discusses how other parts of Airflow, namely Connection and Object Storage, contain concepts related to Asset, and we can tie them all together to make task authoring flow even more naturally.

Planned topics:

  • Brief history on Asset and related constructs.
  • Current state of Asset concepts.
  • Inlets, anyone?
  • Finding inspiration from Pydantic et al.
  • My next step for Asset.

Automating Business Intelligence with Airflow: A Practical Guide

by Chinni Krishna Abburi

In today’s fast-paced business world, timely and reliable insights are crucial — but manual BI workflows can’t keep up. This session offers a practical guide to automating business intelligence processes using Apache Airflow. We’ll walk through real-world examples of automating data extraction, transformation, dashboard refreshes, and report distribution. Learn how to design DAGs that align with business SLAs, trigger workflows based on events, integrate with popular BI tools like Tableau and Power BI, and implement alerting and failure recovery mechanisms. Whether you’re new to Airflow or looking to scale your BI operations, this session will equip you with actionable strategies to save time, reduce errors, and supercharge your organization’s decision-making capabilities.

Automating Threat Intelligence with Airflow, XDR, and LLMs using the MITRE ATT&CK Framework

by Karan Alang

Security teams often face alert fatigue from massive volumes of raw log data. This session demonstrates how to combine Apache Airflow, Wazuh, and LLMs to build automated pipelines for smarter threat triage—grounded in the MITRE ATT&CK framework.

We’ll explore how Airflow can orchestrate a full workflow: ingesting Wazuh alerts, using LLMs to summarize log events, matching behavior to ATT&CK tactics and techniques, and generating enriched incident summaries. With AI-powered interpretation layered on top of structured threat intelligence, teams can reduce manual effort while increasing context and clarity.

AWS Lambda Executor: The Speed of Local Execution with the Advantages of Remote

by Niko Oliveira

Apache Airflow’s executor landscape has traditionally presented users with a clear trade-off: choose either the speed of local execution or the scalability, isolation and configurability of remote execution. The AWS Lambda Executor introduces a new paradigm that bridges this gap, offering near-local execution speeds with the benefits of remote containerization.

This talk will begin with a brief overview of Airflow’s executors, how they work and what they are responsible for, highlighting the compromises between different executors. We will explore the emerging niche for fast, yet remote execution and demonstrate how the AWS Lambda Executor fills this space. We will also address practical considerations when using such an executor, such as working within Lambda’s 15 minute execution limit, and how to mitigate this using multi-executor configuration.

Behind the Scenes: How We Tested Airflow 3 for Stability and Reliability

by Rahul Vats & Phani Kumar

Ensuring the stability of a major release like Airflow 3 required extensive testing across multiple dimensions. In this session, we will dive into the testing strategies and validation techniques used to guarantee a smooth rollout. From unit and integration tests to real-world DAG validations, this talk will cover the challenges faced, key learnings, and best practices for testing Airflow. Whether you’re a contributor, QA engineer, or Airflow user preparing for migration, this session will offer valuable takeaways to improve your own testing approach.

Benchmarking the Performance of Dynamically Generated DAGs

by Tatiana Al-Chueyr Martins & Rahul Vats

As teams scale their Airflow workflows, a common question is: “My DAG has 5,000 tasks—how long will it take to run in Airflow?”

Beyond execution time, users often face challenges with dynamically generated DAGs, such as:

  • Delayed visualization in the Airflow UI after deployment.
  • High resource consumption, leading to Kubernetes pod evictions and out-of-memory errors.

While estimating the resource utilization in a distributed data platform is complex, benchmarking can provide crucial insights.

Beyond Execution Dates: Empowering inference execution and hyper-parameter tuning with Airflow 3

by Ankit Chaurasia & Rahul Vats

In legacy Airflow 2.x, each DAG run was tied to a unique “execution_date.” By removing this requirement, Airflow can now directly support a variety of new use cases, such as model training and generative AI inference, without the need for hacks and workarounds typically used by machine learning and AI engineers.

In this talk, we will delve into the significant advancements in Airflow 3 that enable GenAI and MLOps use cases, particularly through the changes outlined in AIP 83. We’ll cover key changes like the renaming of “execution_date” to “logical_date,” along with the allowance for it to be null, and the introduction of the new “run_after” field which provides a more meaningful mechanism for scheduling and sorting. Furthermore, we’ll discuss how by removing the uniqueness constraint, Airflow 3 enables multiple parallel runs, empowering diverse triggering mechanisms and easing backfill logic with a real-world demo.

Beyond Logs: Unlocking Airflow 3.0 Observability with OpenTelemetry Traces

by Christos Bisias

Using OpenTelemetry tracing, users can gain full visibility into tasks and calls to outside services. This is an increasingly important skill, especially as tasks in an Airflow DAG involve multiple complex computations which take hours or days to complete. Airflow allows users to easily monitor how long entire DAG runs or individual tasks take, but preserves the anonymity of internal actions. OpenTelemetry gives users much more operational awareness and metrics they can use to improve operations.

Beyond the bundle - evolving DAG parsing in Airflow 3

by Igor Kholopov

Airflow 3 made some great strides with AIP-66, introducing the concept of a DAG bundle. This successfully challenged one of the fundamental architectural limitations of original Airflow design of how DAGs are deployed, bringing the structure to something that often had to be operated as a pile of files in the past. However, we believe that this by no means should be the end of the road when it comes to making the DAG management easier, authoring more accessible to a broader audience, and integration with Data Agents smoother. We believe that the next step in Airflow’s evolution is in having a native option to break away from the necessity of having a real file in file systems on multiple components to have your DAG up and running. This is what we are hoping to achieve as part of AIP-85 - extendable DAG parsing control. In this talk I’d like to give a detailed overview of how we want to make it happen and show the examples of the valuable integrations we hope to unblock with it.

Boosting dbt-core workflows performance with Airflow’s Deferrable capabilities

by Pankaj Koti, Tatiana Al-Chueyr Martins & Pankaj Singh

Efficiently handling long-running workflows is crucial for scaling modern data pipelines. Apache Airflow’s deferrable operators help offload tasks during idle periods — freeing worker slots while tracking progress.

This session explores how Cosmos 1.9 (https://github.com/astronomer/astronomer-cosmos) integrates Airflow’s deferrable capabilities to enhance orchestrating dbt (https://github.com/dbt-labs/dbt-core) in production, with insights from recent contributions that introduced this functionality.

Key takeaways:

  • Deferrable Operators: How they work and why they’re ideal for long-running dbt tasks.
  • Integrating with Cosmos: Refactoring and enhancements to enable deferrable behaviour across platforms.
  • Performance Gains: Resource savings and task throughput improvements from deferrable execution.
  • Challenges & Future Enhancements: Lessons learned, compatibility, and ideas for broader support.

Whether orchestrating dbt models on a cloud warehouse or managing large-scale transformations, this session offers practical strategies to reduce resource contention and boost pipeline performance.

Breaking News with Data Pipelines: How Airflow and AI Power Investigative Journalism

by Zdravko Hvarlingov & Ivan Nikolov

Investigative journalism often relies on uncovering hidden patterns in vast amounts of unstructured and semi-structured data. At the FT, we leverage Airflow to orchestrate AI-powered pipelines that transform complex, fragmented datasets into structured insights. Our Storyfinding team works closely with journalists to automate tedious data processing, enabling them to tell stories that might otherwise go untold.

This talk will explore how we use Airflow to process and analyze text, documents, and other difficult-to-structure data sources combining AI, machine learning, and advanced computational techniques to extract meaningful entities, relationships, and patterns. We’ll also showcase our connection analysis workflows, which link various datasets to reveal previously hidden chains of people and companies, a crucial capability for investigative reporting.

Bringing Apache Airflow to a Security-First Organization: A Battle Plan for Automation

by Oluwafemi Olawoyin

What happens when you introduce Apache Airflow in an environment where every change must pass compliance gates, infrastructure is tightly controlled, and public cloud is off-limits?

This talk shares the journey of implementing Airflow within a high-security, regulation-heavy setting, navigating legacy systems, manual workflows, and cautious stakeholders.

What You’ll Learn:

  1. How Airflow was deployed on-premise without Docker or cloud dependencies
  2. How GitHub Actions were used to bridge Windows-based engineers with a Linux-hosted Airflow instance
  3. How we worked through security reviews and compliance requirements to gain production approval
  4. What worked, what didn’t, and lessons for teams facing similar constraints

Who Should Attend:

Building a Transparent Data Workflow with Airflow and Data Catalog

by John Robert

As modern data ecosystems grow in complexity, ensuring transparency, discoverability, and governance in data workflows becomes critical. Apache Airflow, a powerful workflow orchestration tool, enables data engineers to build scalable pipelines, but without proper visibility into data lineage, ownership, and quality, teams risk operating in a black box.

In this talk, we will explore how integrating Airflow with a data catalog can bring clarity and transparency to data workflows. We’ll discuss how metadata-driven orchestration enhances data governance, enables lineage tracking, and improves collaboration across teams. Through real-world use cases, we will demonstrate how Airflow can automate metadata collection, update data catalogs dynamically, and ensure data quality at every stage of the pipeline.

Building Airflow 3 setups resilient to zonal/regional down events, ready for Disaster Recovery event

by Khaled Hassan

Want to be resilient to any zonal/regional down events when building Airflow in a cloud environment? Unforeseen disruptions in cloud infrastructure, whether isolated to specific zones or impacting entire regions, pose a tangible threat to the continuous operation of critical data workflows managed by Airflow. These outages, though often technical in nature, translate directly into real-world consequences, potentially causing interruptions in essential services, delays in crucial information delivery, and ultimately impacting the reliability and efficiency of various operational processes that businesses and individuals depend upon daily. The inability to process data reliably due to infrastructure instability can cascade into tangible setbacks across diverse sectors, highlighting the urgent need for resilient and robust Airflow deployments.

Building an Airflow Center of Excellence: Lessons from the Frontlines

by Jonathan Leek & Michelle Winters

As organizations scale their data infrastructure, Apache Airflow becomes a mission-critical component for orchestrating workflows efficiently. But scaling Airflow successfully isn’t just about running pipelines—it’s about building a Center of Excellence (CoE) that empowers teams with the right strategy, best practices, and long-term enablement. Join Jon Leek and Michelle Winters as they share their experiences helping customers design and implement Airflow Centers of Excellence. They’ll walk through real-world challenges, best practices, and the structured approach Astronomer takes to ensure teams have the right plan, resources, and support to succeed. Whether you’re just starting with Airflow or looking to optimize and scale your workflows, this session will give you a proven framework to build a sustainable Airflow Center of Excellence within your organization. 🚀

Building an MLOps Platform for 300+ ML/DS Specialists on Top of Airflow

by Aleksandr Shirokov, Roman Khomenko & Tarasov Alexey

As your organization scales to 20+ data science teams and 300+ DS/ML/DE engineers, you face a critical challenge: how to build a secure, reliable, and scalable orchestration layer that supports both fast experimentation and stable production workflows. We chose Airflow — and didn’t regret it! But to make it truly work at our scale, we had to rethink its architecture from the ground up.

In this talk, we’ll share how we turned Airflow into a powerful MLOps platform through its core capability: running pipelines across multiple K8s GPU clusters from a single UI (!) using per-cluster worker pools. To support ease of use, we developed MLTool — our own library for fast and standardized DAG development, integrated Vault for secure secret management across teams, enabled real-time logging with S3 persistence and built a custom SparkSubmitOperator for Kerberos-authenticated Spark/Hadoop jobs in Kubernetes. We also streamlined the developer experience — users can generate a GitLab repo and deploy a versioned pipeline to prod in under 10 minutes!

Cloud Composer : Introduction into Advanced Features

by Eugene Kosteev

During this workshop you are going to learn the latest features published within Cloud Composer which is a managed service for Apache Airflow on Google Cloud Platform

Common provider abstractions: Key for multi-cloud data handling

by Vikram Koka

Enterprises want the flexibility to operate across multiple clouds, whether to optimize costs, improve resiliency, to avoid vendor lock-in, or for data sovereignty. But for developers, that flexibility usually comes at the cost of extra complexity and redundant code. The goal here is simple: write once, run anywhere, with minimum boilerplate. In Apache Airflow, we’ve already begun tackling this problem with abstractions like Common-SQL, which lets you write database queries once and run them on 20+ databases, from Snowflake to Postgres to SQLite to SAP HANA. Similarly, Common-IO standardizes cloud blob storage interactions across all public clouds. With Airflow 3.0, we are pushing this further by introducing a Common Message Bus provider, which is an abstraction, initially supporting Amazon SQS and expanding to Google PubSub and Apache Kafka soon after. We expect additional implementations such as Amazon Kinesis and Managed Kafka over time.

Creating DuoFactory: An Orchestration Ecosystem with Airflow

by Belle Romea

Duolingo has built an internal tool DuoFactory to orchestrate AI generated content using Airflow. The tool has been used to generate example sentences per lesson, math exercises, and Duoradio lessons. The ecosystem is flexible for various company needs. Some of these use cases contain end to end generation where one click of a button generates content in app. We also have created a Workflow Builder to orchestrate and iterate on generative AI workflows by creating one-time DAG instances with a UI easy enough for non-engineers to use.

DAGLint: Elevating Airflow DAG Quality Through Automated Linting

by Snir Israeli

Maintaining consistency, code quality, and best practices for writing Airflow DAGs between teams and individual developers can be a significant challenge. Trying to achieve it using manual code reviews is both time-consuming and error-prone.

To solve this at Next, we decided to build a custom, internally developed linting tool for Airflow DAGs, to help us evaluate their quality and uniformity - we call it - DAGLint.

In this talk I am going to share why we chose to implement it, how we built it, and how we use it to elevate our code quality and standards throughout the entire Data engineering group.

Data Quality and Observability with Airflow

by Ipsa Trivedi & Chirag Tailor

Tekmetric is the largest cloud based auto shop management system in the United States. We process vast amounts of data from various integrations with internal and external systems. Data quality and governance are crucial for both our internal operations and the success of our customers.

We leverage multi-step data processing pipelines using AWS services and Airflow. While we utilize traditional data pipeline workflows to manage and move data, we go beyond standard orchestration. After data is processed, we apply tailored quality checks for schema validation, record completeness, freshness, duplication and more.

Deadline Alerts in Airflow 3.1

by Dennis Ferruzzi

Do you have a DAG that needs to be done by a certain time? Have you tried to use Airflow 2’s SLA feature and found it restrictive or complicated? You aren’t alone! Come learn about the all-new Deadline Alerts feature in Airflow 3.1 which replaces SLA. We will discuss how Deadline Alerts work and how they improve on the retired SLA feature. Then we will look at some examples of workflows you can build with the new feature, including some of the callback options and how they work, and finally looking ahead to some future use-cases of using Deadlines for Tasks and even Assets.

Designing Scalable Retrieval-Augmented Generation (RAG) Pipelines at SAP with Apache Airflow

by Sagar Sharma

At SAP Business AI, we’ve transformed Retrieval-Augmented Generation (RAG) pipelines into enterprise-grade powerhouses using Apache Airflow. Our Generative AI Foundations Team developed a cutting-edge system that effectively grounds Large Language Models (LLMs) with rich SAP enterprise data. Powering Joule for Consultants, our innovative AI copilot, this pipeline manages the seamless ingestion, sophisticated metadata enrichment, and efficient lifecycle management of over a million structured and unstructured documents. By leveraging Airflow’s Dynamic DAGs, TaskFlow API, XCom, and Kubernetes Event-Driven Autoscaling (KEDA), we achieved unprecedented scalability and flexibility. Join our session to discover actionable insights, innovative scaling strategies, and a forward-looking vision for Pipeline-as-a-Service, empowering seamless integration of customer-generated content into scalable AI workflows

Do you trust Airflow with your money? (We do!)

by Nick Bilozerov, Daniel Melchor & Sabrina Liu

Airflow is wonderfully, frustratingly complex - and so is global finance! Stripe has very specific needs all over the planet, and we have customized Airflow to adapt to the variety and rigor that we need to grow the GDP of the internet.

In this talk, you’ll learn:

  • How we support independent DAG change management for over 500 different teams running over 150k tasks.

  • How we’ve customized Airflow’s Kubernetes integration to comply with Stripe’s unique compliance requirements.

Dynamic Data Pipelines with DBT and Airflow

by Miquel Angel Andreu Febrer

This session showcases Okta’s innovative approach to data pipeline orchestration with dbt and Airflow. How we’ve implemented dynamically generated airflow dags workflows based on dbt’s dependency graph. This allows us to enforce strict data quality standards by automatically executing downstream model tests before upstream model deployments, effectively preventing error cascades. The entire CI/CD pipeline, from dbt model changes to production DAG deployment, is fully automated. The result? Accelerated development cycles, reduced operational overhead, and bulletproof data reliability

EdgeExecutor / Edge Worker - The new option to run anywhere

by Jens Scheffler & Daniel Wolf

Airflow 3 extends the deployment options to run your workload anywhere. You don’t need to bring your data to airflow but you can bring the execution where it needs to be.

You can connect any cloud and on-prem location together and generate a hybrid workflow from one central Airflow instance. Only a HTTP connection is needed.

We will present the use cases and concepts of the Edge deployment and how it is working also in a hybrid setup with Celery or other executors.

ELT, AI, and Elections: Leveraging Airflow and Machine Learning to Analyze Voting Behavior at INTRVL

by Kyle McCluskey

Discover how Apache Airflow powers scalable ELT pipelines, enabling seamless data ingestion, transformation, and machine learning-driven insights. This session will walk through:

Automating Data Ingestion: Using Airflow to orchestrate raw data ingestion from third-party sources into your data lake (S3, GCP), ensuring a steady pipeline of high-quality training and prediction data.

Optimizing Transformations with Serverless Computing: Offloading intensive transformations to serverless functions (GCP Cloud Run, AWS Lambda) and machine learning models (BigQuery ML, Sagemaker), integrating their outputs seamlessly into Airflow workflows.

Empowering Precision Healthcare with Apache Airflow-iKang Healthcare Group’s DataHub Journey

by Yuan Luo & Huiliang Zhang

iKang Healthcare Group, serving nearly 10 million patients annually, built a centralized healthcare data hub powered by Apache Airflow to support its large-scale, real-time clinical operations. The platform integrates batch and streaming data in a lakehouse architecture, orchestrating complex workflows from data ingestion (HL7/FHIR) to clinical decision support.

Healthcare data’s inherent complexity—spanning structured lab results to unstructured clinical notes—requires dynamic, reliable orchestration. iKang uses Airflow’s DAGs, extensibility, and workflow-as-code capabilities to address challenges like multi-system coordination, semantic data linking, and fault-tolerant automation.

Enabling SQL testing in Airflow workflows using Pydantic types

by Gurmeet Saran & Kushal Thakkar

This session explores how to bring unit testing to SQL pipelines using Airflow. I’ll walk through the development of a SQL testing library that allows isolated testing of SQL logic by injecting mock data into base tables. To support this, we built a type system for AWS Glue tables using Pydantic, enabling schema validation and mock data generation. Over time, this type system also powered production data quality checks via a custom Airflow operator. Learn how this approach improves reliability, accelerates development, and scales testing across data workflows.

Enhancing Airflow REST API: From Basic Integration to Enterprise Scale

by Vishal Vijayvargiya

Apache Airflow’s REST API has evolved to support diverse orchestration needs, with managed services like MWAA introducing custom enhancements. One such feature, InvokeRestApi, enables dynamic interactions with external services while maintaining Airflow’s core orchestration capabilities.

In this talk, we will explore the architectural design behind InvokeRestApi, detailing how it enhances API-driven workflows. Beyond the architecture, we’ll share key challenges and learnings from implementing and scaling Airflow’s REST API in production environments. Topics include authentication, performance considerations, error handling, and best practices for integrating external APIs efficiently.

Enhancing DAG Management with DMS: A Scalable Solution for Airflow

by Sungji Yang & DaeHoon Song

In this talk, we will introduce the DAG Management Service (DMS), developed to address critical challenges in managing Airflow clusters. With over 10,000 active DAGs, a single Airflow cluster faces scaling limits and noisy neighbor issues, impacting task scheduling SLAs. DMS enhances reliability by distributing DAGs across multiple clusters and enforcing proper configurations.

We will also discuss how DMS streamlines Airflow version upgrades. Upgrading from an old Airflow version to the latest requires sequential updates and code modifications for over 10,000 DAGs. DMS proposes an efficient upgrade method, reducing dependency on users.

Enhancing Small Retailer Visibility

by Hannah Lundrigan & Alberto Hernandez

Small retailers often lack the data visibility that larger companies rely on for decision-making. In this session, we’ll dive into how Apache Airflow powers end-to-end machine learning pipelines that process inventory and sales data, enabling retailers and suppliers to gain valuable industry insights. We’ll cover feature engineering, model training, and automated inference workflows, along with strategies for handling messy, incomplete retail data. We will discuss how Airflow enables scalable ML-driven insights that improve demand forecasting, product categorization, and supply chain optimization.

Ensuring Data Accuracy & Consistency with Airflow and dbt Tests

by Bao Nguyen

As analytics engineers, ensuring data accuracy and consistency is critical, but how do we systematically catch errors before they impact stakeholders? This session will explore how to integrate Airflow with dbt tests to build reliable and automated data validation workflows.

We’ll cover:

  • How to orchestrate dbt tests with Airflow DAGs for real-time data quality checks.
  • Handling test failures with alerting and retry strategies.
  • Using custom dbt tests for advanced validation beyond built-in checks.
  • Best practices for data observability, logging, and monitoring failed runs.

Event-Driven Airflow 3.0: Real-Time Orchestration with Pub/Sub

by Andrea Bombino & Nawfel Bacha

Traditional time-based scheduling in Airflow can lead to inefficiencies and delays. With Airflow 3.0, we can now leverage native event-driven DAG execution, enabling workflows to trigger instantly when data arrives—eliminating polling-based sensors and rigid schedules. This talk explores real-time orchestration using Airflow 3.0 and Google Cloud Pub/Sub. We’ll showcase how to build an event-driven pipeline where DAGs automatically trigger as new data lands, ensuring faster and more efficient processing. Through a live demo, we’ll demonstrate how Airflow listens to Pub/Sub messages and dynamically triggers dbt transformations only when fresh data is available. This approach improves scalability, reduces costs, and enhances orchestration efficiency. Key Takeaways: How event-driven DAGs work vs. traditional scheduling, Best practices for integrating Airflow with Pub/Sub,Eliminating polling-based sensors for efficiency,Live demo: Event-driven pipeline with Airflow 3.0, Pub/Sub & dbt.

Event-Driven, Partition-Aware: Modern Orchestration with Airflow at Datadog

by Julien Le Dem & Zach Gottesman

Datadog is a world-class data platform ingesting more than a 100 trillion events a day, providing real-time insights.

Before Airflow’s prominence, we built batch processing on Luigi, Spotify’s open-source orchestrator. As Airflow gained wide adoption, we evaluated adopting the major improvements of release 2.0, but opted for building our own orchestrator instead to realize our dataset-centric, event-driven vision.

Meanwhile, the 3.0 release aligned Airflow with the same vision we pursued internally, as a modern asset-driven orchestrator. It showed how futile it is to build our own compared to the momentum of the community. We evaluated several orchestrators and decided to join forces with the Airflow project.

From Centrailization to Autonomy: Managing Airflow Pipeline through Multi-Tenancy

by Silver Pang

At the enterprise level, managing Airflow deployments across multiple teams can become complex, leading to bottlenecks and slowed development cycles. We will share our journey of decentralizing Airflow repositories to empower data engineering teams with multi-tenancy, clean folder structures, and streamlined DevOps processes.

We dive into how restructuring our Airflow architecture and utilizing repository templates allowed teams to generate new data pipelines effortlessly. This approach enables engineers to focus on business logic without worrying about underlying Airflow configurations. By automating deployments and reducing manual errors through CI/CD pipelines, we minimized operational overhead.

From Chaos to Cosmos: Automating DBT Workflows with Airflow at Riot

by Zach Ward

Breaking Barriers in Data Orchestration: Discover how Riot Games has integrated DBT, Airflow, Astronomer Cosmos, and custom automation to transform data workflows from a complex, code-heavy process into a simple, config-driven experience.

From Complexity to Simplicity: Learn how Riot has dramatically reduced the technical overhead of building and orchestrating DBT models, slashing Time to Production and accelerating Time to Insights.

Building a Seamless Data Pipeline Ecosystem: Get a high-level overview of how we’ve stitched together multiple technologies to create a unified, scalable, and developer-friendly pipelining system.

From Complexity to Simplicity with TaskHarbor: Trendyol's Path to a Unified Orchestration Platform

by Salih Goktug Kose & Burak Ozdemir

At Trendyol, Turkey’s leading e-commerce company, Apache Airflow powers our task orchestration, handling DAGs with 500+ tasks, complex interdependencies, and diverse environments. Managing on-prem Airflow instances posed challenges in scalability, maintenance, and deployment. To address these, we built TaskHarbor, a fully managed orchestration platform with a hybrid architecture—combining Airflow on GKE with on-prem resources for optimal performance and efficiency.

This talk covers how we:

  • Enabled seamless DAG synchronization across environments using GCS Fuse.
  • Optimized workload distribution via GCP’s HTTPS & TCP Load Balancers.
  • Automated infrastructure provisioning (GKE, CloudSQL, Kubernetes) using Terraform.
  • Simplified Airflow deployments by replacing Helm YAML files with a custom templating tool, reducing configurations to 10-15 lines.
  • Built a fully automated deployment pipeline, ensuring zero developer intervention.

We enhanced efficiency, reliability, and automation in hybrid orchestration by embracing a scalable, maintainable, and cloud-native strategy. Attendees will obtain practical insights into architecting Airflow at scale and optimizing deployments.

From DAGs to Insights: Business-Driven Airflow Use Cases

by Tala Karadsheh

Airflow is integral to GitHub’s data and insight generation. This session dives into use cases from GitHub where key business decisions are driven, at the root, with the help of Airflow. The session will also highlight how both GitHub and Airflow celebrate, promote, and nurture OSS innovations in their own ways.

From Legacy to Leading Edge: How Airflow Migration Unlocked Cross-Team Business Value

by Blagoy Kaloferov

At TrueCar, migrating hundreds of legacy Oozie workflows and in-house orchestration tools to Apache Airflow required key technical decisions that transformed our data platform architecture and organizational capabilities. We consolidated individual chained tasks into optimized DAGs leveraging native Airflow functionality to trigger compute across cloud environments. A crucial breakthrough was developing DAG generators to scale migration—essential for efficiently migrating hundreds of workflows while maintaining consistency. By decoupling orchestration from compute, we gained flexibility to select optimal tools for specific outcomes—programmatic processing, analytics, batch jobs, or AI/ML pipelines. This resulted in cost reductions, performance improvements, and team agility. We also gained unprecedented visibility into DAG performance and dependency patterns previously invisible across fragmented systems. Attendees will learn how we redesigned complex workflows into efficient DAGs using dynamic task generation, architectural decisions that enabled platform innovation and the decision framework that made our migration transformational.

From Oops to Secure Ops: Self-Hosted AI for Airflow Failure Diagnosis

by Nathan Hadfield

Last year, ‘From Oops to Ops’ showed how AI-powered failure analysis could help diagnose why Airflow tasks fail. But do we really need large, expensive cloud-based AI models to answer simple diagnostic questions? Relying on external AI APIs introduces privacy risks, unpredictable costs, and latency, often without clear benefits for this use case.

With the rise of distilled, open-source models, self-hosted failure analysis is now a practical alternative. This talk will explore how to deploy an AI service on infrastructure you control, compare cost, speed, and accuracy between OpenAI’s API and self-hosted models, and showcase a live demo of AI-powered task failure diagnosis using DeepSeek and Llama—running without external dependencies to keep data private and costs predictable.

Get Certified: DAG Authoring for Apache Airflow 3

by Marc Lamberti

We’re excited to offer Airflow Summit 2025 attendees an exclusive opportunity to earn their DAG Authoring certification in person, now updated to include all the latest Airflow 3.0 features. This certification workshop comes at no additional cost to summit attendees.

The DAG Authoring for Apache Airflow certification validates your expertise in advanced Airflow concepts and demonstrates your ability to build production-grade data pipelines. It covers TaskFlow API, Dynamic task mapping, Templating, Asset-driven scheduling, Best practices for production DAGs, and new Airflow 3.0 features and optimizations.

GitHub's Airflow Journey: Lessons, Mistakes, and Insights

by Oleksandr Slynko

This session explores how GitHub uses Apache Airflow for efficient data engineering. We will share nearly 9 years of experiences, including lessons learnt, mistakes made, and the ways we reduced our on-call and engineering burden. We’ll demonstrate how we keep data flowing smoothly while continuously evolving Airflow and other components of our data platform, ensuring safety and reliability. The session will touch on how we migrate Airflow between cloud without user impact. We’ll also cover how we cut down the time from idea to running a DAG in production, despite our Airflow repo being among the top 15 by number of PRs within GitHub.

How Airflow can help with Data Management and Governance

by Kunal Jain

Metadata management is a cornerstone of effective data governance, yet it presents unique challenges distinct from traditional data engineering. At scale, efficiently extracting metadata from relational and NoSQL databases demands specialized solutions. To address this, our team has developed custom Airflow operators that scan and extract metadata across various database technologies, orchestrating 100+ production jobs to ensure continuous and reliable metadata collection.

Now, we’re expanding beyond databases to tackle non-traditional data sources such as file repositories and message queues. This shift introduces new complexities, including processing structured and unstructured files, managing schema evolution in streaming data, and maintaining metadata consistency across heterogeneous sources. In this session, we’ll share our approach to building scalable metadata scanners, optimizing performance, and ensuring adaptability across diverse data environments. Attendees will gain insights into designing efficient metadata pipelines, overcoming common pitfalls, and leveraging Airflow to drive metadata governance at scale.

How Airflow Runs The Weather

by Eloi Codina Torras

Forecasting the weather and air quality is a logistical challenge. Numerical simulations are complex, resource-hungry, and sometimes fail without warning. Yet, our clients depend on accurate forecasts delivered daily and on time. At the heart of this operation is Airflow: the orchestration engine that keeps everything running.

In this session, we’ll dive into the world behind weather and air quality forecasts. In particular, we’ll explore:

  • The atmospheric modeling pipeline, to understand the unique demands it places on infrastructure
  • How we use Airflow to orchestrate complex simulations reliably and at scale, to inspire new ways of managing time-critical, compute-heavy workflows.
  • Our integration of Airflow with a high-performance computing (HPC) environment using Slurm, to run resource-intensive workloads efficiently in bare metal machines.

At Meteosim we are experts on weather and air quality intelligence. With projects in over 80 countries, we support decision-making in industries where weather and air quality matter most: from daily operations to long-term sustainability.

How Airflow solves the coordination of decentralised teams at Vinted

by Oscar Ligthart & Rodrigo Loredo

Vinted is the biggest second-hand marketplace in Europe with multiple business verticals. Our data ecosystem has over 20 decentralized teams responsible for generating, transforming, and building Data Products from petabytes of data. This creates a daring environment where inter-team dependencies, varied expertise with scheduling tools, and diverse use cases need to be managed efficiently. To tackle these challenges, we have centralized our approach by leveraging Apache Airflow to orchestrate data dependencies across teams.

How Pinterest Uses Ai to Empower Airflow Users for Troubleshooting

by Rachel Sun

At Pinterest, there are over 10,000 DAGs supporting various use cases across different teams and roles. With this scale and diversity, user support has been an ongoing challenge to unlock productivity. As Airflow increasingly serves as a user interface to a variety of data and ML infrastructure behind the scenes, it’s common for issues from multiple areas to surface in Airflow, making triage and troubleshooting a challenge.

In this session, we will discuss the scale of the problem we are facing, how we have addressed it so far, and how we are introducing LLM AI to help solve this problem.

Implementing Operations Research Problems with Apache Airflow: From Modelling to Production

by Philippe Gagnon

This workshop will provide an overview of implementing operations research problems using Apache Airflow. This is a hands-on session where attendees will gain experience creating DAGs to define and manage workflows for classical operations research problems. The workshop will include several examples of how Airflow can be used to optimize and automate various decision-making processes, including:

Inventory management: How to use Airflow to optimize inventory levels and reduce stockouts by analyzing demand patterns, lead times, and other factors.

Introducing Apache Airflow® 3 – The Next Evolution in Orchestration

by Amogh Desai, Ash Berlin-Taylor, Brent Bovenzi, Bugra Ozturk, Daniel Standish, Jed Cunningham, Jens Scheffler, Kaxil Naik, Pierre Jeambrun, Tzu-ping Chung, Vikram Koka, Vincent Beck & Constance Martineau

Apache Airflow® 3 is here, bringing major improvements to data orchestration. In this keynote, core Airflow contributors will walk through key enhancements that boost flexibility, efficiency, and user experience.

Vikram Koka will kick things off with an overview of Airflow 3, followed by deep dives into DAG versioning (Jed Cunningham), enhanced backfilling (Daniel Standish), and a modernized UI (Brent Bovenzi & Pierre Jeambrun).

Next, Ash Berlin-Taylor, Kaxil Naik, and Amogh Desai will introduce the Task Execution Interface and Task SDK, enabling tasks in any environment and language. Jens Scheffler will showcase the Edge Executor, while Tzu-ping Chung and Vincent Beck will demo event-driven scheduling and data assets. Finally, Buğra Öztürk will unveil CLI enhancements for automation and debugging.

Learn from Deutsche Bank: Using Apache Airflow in Regulated Environments

by Christian Foernges

Operating within the stringent regulatory landscape of Corporate Banking, Deutsche Bank relies heavily on robust data orchestration. This session explores how Deutsche Bank’s Corporate Bank leverages Apache Airflow across diverse environments, including both on-premises infrastructure and cloud platforms. Discover their approach to managing critical data & analytics workflows, encompassing areas like regulatory reporting, data integration and complex data processing pipelines. Gain insights into the architectural patterns and operational best practices employed to ensure compliance, security, and scalability when running Airflow at scale in a highly regulated, hybrid setting.

Lessons learned for scaling up Airflow 3 in Public Cloud

by Przemek Wiech & Augusto Hidalgo

Apache Airflow 3 is a new state-of-the-art version of Airflow. For many users who plan to adopt Airflow 3 it’s important to understand how Airflow 3 behaves from performance perspective compared to Airflow 2.

This presentation is going to present performance results for various Airflow 3 configurations and provide information to users to should give Airflow 3 adopters good understanding of Airflow 3 performance.

The reference Airflow 3 configuration will be using Kubernetes cluster as a compute layer, PostgreSQL as Airflow Database and would be performed on Google Cloud Platform. Performance tests will be performed using community version of performance tests framework and there might be references to Cloud Composer (managed service for Apache Airflow). The tests will be done in production-grade configurations that might be good references for Airflow community users.

Lessons learned from migrating to Airflow @ LI Scale

by Arthur Chen, Trevor DeVore & Deng Pan

At LinkedIn, our data pipelines process exabytes of data, with our offline infrastructure executing 300K ETL workflows daily and 10K concurrent executions. Historically, these workloads ran on our legacy system, Azkaban, which faced UX, scalability, and operational challenges. To modernize our infra, we built a managed Airflow service, leveraging its enhanced developer & operator experience, rich feature set, and strong OSS community support. That initiated LinkedIn’s largest-ever infrastructure migration—transitioning thousands of legacy workflows to Airflow.

Linkedin's Journey on scaling airflow

by Rahul Gade & Arun Kumar

Last year, we shared how LinkedIn’s continuous deployment platform (LCD) leveraged Apache Airflow to streamline and automate deployment workflows. LCD is the deployment platform inside Linkedin which is actively used by all engineers (10000+) at Likedin.

This year, we take a deeper dive into the challenges, solutions, and engineering innovations that helped us scale Airflow to support thousands of concurrent tasks while maintaining usability and reliability.

Key Takeaways: Abstracting Airflow for a Better User Experience – How we designed a system where users could define and update their workflows without directly interacting with Airflow.

LLM-Powered Review Analysis: Optimising Data Engineering using Airflow

by Naseem Shah

A real-world journey of how my small team at Xena Intelligence built robust data pipelines for our enterprise customers using Airflow. If you’re a data engineer, or part of a small team, this talk is for you. Learn how we orchestrated a complex workflow to process millions of public reviews.

What You’ll Learn:

  1. Cost-Efficient DAG Designing: Decomposing complex processes into atomic tasks using the TaskFlow, XComs, Mapped tasks, and Task groups. Diving into one of our DAGs as a concrete example of how our approach optimizes parallelism, error handling, delivery speed, and reliability.

LLMOps with Airflow 3.0 and the Airflow AI SDK

by Ryan Hatter

Airflow 3 brings several exciting new features that better support MLOps:

  • Native, intuitive backfills
  • Removal of the unique execution date for dag runs
  • Native support for event-driven scheduling

These features, combined with the Airflow AI SDK, enable dag authors to easily build scalable, maintainable, and performant LLMOps pipelines.

In this talk, we’ll go through a series of workflows that use the Airflow AI SDK to empower Astronomer’s support staff to more quickly resolve problems faced by Astronomer’s customers.

Managing Airflow DAGs Across DAG and ETL Repos

by Yunhao Qing

At Lyft, we manage Airflow DAGs across both the ETL and DAG repos, each serving distinct needs. The ETL repo is ideal for simple use cases and users with only a few DAGs, offering a streamlined workflow. Meanwhile, the DAG repo supports power users with numerous DAGs, custom dependencies, and complex ML pipelines. In this session, I’ll share how we structure these repos, the trade-offs involved, and best practices for scaling Airflow DAG management across diverse teams and workloads.

Mastering Event-Driven in Airflow 3: Building Scalable Data Pipelines

by Luan Moreno Medeiros Maciel

Transform your data pipelines with event-driven scheduling in Airflow 3. In this hands-on workshop, you’ll:

  • Set up AssetWatchers to track S3, Kafka, or database events
  • Build DAGs that trigger instantly on new data
  • Master scaling techniques for high-volume workflows

Create a live pipeline—process logs or IoT data in real time—and adapt it to your needs. No event-driven experience required; just bring a laptop and Airflow basics. Gain practical skills to make your pipelines responsive and efficient.

Model Context Protocol with Airflow

by Abhishek Bhakat & Sudarshan Chaudhari

In today’s data-driven world, effective workflow management and AI are crucial for success. However, there’s a notable gap between Airflow and AI. Our presentation offers a solution to close this gap.

Proposing MCP (Model Context Protocol) server to act as a bridge. We’ll dive into two paths:

  • AI-Augmented Airflow: Enhancing Airflow with AI to improve error handling, automate DAG generation, proactively detect issues, and optimize resource use.
  • Airflow-Powered AI: Utilizing Airflow’s reliability to empower LLMs in executing complex tasks, orchestrating AI agents, and supporting decision-making with real-time data.

Key takeaways:

Multi-Instance Asset Synchronization - push or pull ?

by Sebastien Crocquevieille

As Data Engineers, our jobs regularly include scheduling or scaling workflows.

But have you ever asked yourself, can I scale my scheduling ?

It turns out that you can! But doing so raises a number of issues that need to be addressed.

In this talk we’ll be:

  • Recapping Asset-aware scheduling in Apache Airflow
  • Discussing diverse methods to upscale our scheduling
  • Solving the issue of maintaining our Airflow Asset synchronized between instances
  • Comparing our professional push based solution and the built-in solution from AIP-82 and the pros and cons of each method.

I hope you will enjoy it!

Navigating Secure and Cost-Efficient Flink Batch on Kubernetes with Airflow

by Purshotam Shah & David Scherba

New Tools, Same Craft: The Developer's Toolbox in 2025

by Brooke Jamieson

Our development workflows look dramatically different than they did a year ago. Code generation, automated testing, and AI-assisted documentation tools are now part of many developers’ daily work. Yet as these tools reshape how we code, I’ve noticed something worth examining: while our toolbox is changing rapidly, the core of being a good developer hasn’t. Problem-solving, collaborative debugging, and systems thinking remain as crucial as ever.

In this keynote, I’ll share observations about:

No More Missed Beats: How Airflow Rescued Our Analytics Pipeline

by pei-chi-miko-chen

Before Airflow, our BigQuery pipelines at Create Music Group operated like musicians without a conductor—each playing on its own schedule, regardless of whether upstream data was ready. As our data platform grew, this chaos led to spiralling costs, performance bottlenecks, and became utterly unsustainable.

This talk tells the story of how Create Music Group brought harmony to its data workflows by adopting Apache Airflow and the Medallion architecture, ultimately slashing our data processing costs by 50%. We’ll show how moving to event-driven scheduling with datasets helped eliminate stale data issues, dramatically improved performance, and unlocked faster iteration across teams. Discover how we replaced repetitive SQL with standardized dimension/fact tables, empowering analysts in a safer sandbox.

Operation Airlift: Uber's ongoing journey of migrating 200K pipelines to a single Airflow3 instance

by Sumit Maheshwari

Yes, you read that right — 200,000 pipelines, nearly 1 million task executions per day, all powered by a single Airflow instance.

In this session, we’ll take you behind the scenes of one of the boldest orchestration projects ever attempted: how Uber’s data platform team is executing what might be the largest Apache Airflow migration in history — and doing it straight to Airflow 3.

From scaling challenges and architectural choices to lessons learned in high-throughput orchestration, this is a deep dive into the tech, the chaos, and the strategy behind making data fly at unprecedented scale.

Orchestrating AI Knowledge Bases with Apache Airflow

by Theo Lebrun

In the age of Generative AI, knowledge bases are the backbone of intelligent systems, enabling them to deliver accurate and context-aware responses. But how do you ensure that these knowledge bases remain up-to-date and relevant in a rapidly changing world? Enter Apache Airflow, a robust orchestration tool that streamlines the automation of data workflows.

This talk will explore how Airflow can be leveraged to manage and update AI knowledge bases across multiple data sources. We’ll dive into the architecture, demonstrate how Airflow enables efficient data extraction, transformation, and loading (ETL), and share insights on tackling challenges like data consistency, scheduling, and scalability.

Orchestrating Apache Airflow ML Workflows at Scale with SageMaker Unified Studio

by Vinod Jayendra, Suba Palanisamy, Sean Bjurstrom & Anurag Srivastava

As organizations increasingly rely on data-driven applications, managing the diverse tools, data, and teams involved can create challenges. Amazon SageMaker Unified Studio addresses this by providing an integrated, governed platform to orchestrate end-to-end data and AI/ML workflows.

In this workshop, we’ll explore how to leverage Amazon SageMaker Unified Studio to build and deploy scalable Apache Airflow workflows that span the data and AI/ML lifecycle. We’ll walk through real-world examples showcasing how this AWS service brings together familiar Airflow capabilities with SageMaker’s data processing, model training, and inference features - all within a unified, collaborative workspace.

Orchestrating Data Quality - Quality Data Brought To You By Airflow

by Maggie Stark & Marion Azoulai

Ensuring high-quality data is essential for building user trust and enabling data teams to work efficiently. In this talk, we’ll explore how the Astronomer data team leverages Airflow to uphold data quality across complex pipelines; minimizing firefighting and maximizing confidence in reported metrics.

Maintaining data quality requires a multi-faceted approach: safeguarding the integrity of source data, orchestrating pipelines reliably, writing robust code, and maintaining consistency in outputs. We’ve embedded data quality into the DevEx experience, so it’s always at the forefront instead of in the backlog of tech debt.

Orchestrating Global Market Data Pipelines with Airflow

by Di Wu

In this presentation, I will highlight how Apache Airflow addresses key data management challenges for Exchange-Traded Funds (ETFs) in the global financial market. ETFs, which combine features of mutual funds and stocks, track indexes, commodities, or baskets of assets and trade on major stock exchanges. Because they operate around the clock across multiple time zones, ETF managers must navigate diverse regulations, coordinate complex operational constraints, and ensure accurate valuations. This often requires integrating data from vendors for pricing and reference details. These data sets arrive at different times, can conflict, and must pass rigorous quality checks before being published for global investors. Managing updates, orchestrating workflows, and maintaining high data quality present significant hurdles. Apache Airflow tackles these issues by scheduling repetitive tasks and enabling event-triggered job runs for immediate data checks. It offers monitoring and alerting, thus reducing manual intervention and errors. Using DAGs, Airflow scales efficiently, streamlining complex data ingestion, validation, and publication processes.

Orchestrating MLOps and Data Transformation at EDB with Airflow

by Karthik Dulam

This talk explores EDB’s journey from siloed reporting to a unified data platform, powered by Airflow. We’ll delve into the architectural evolution, showcasing how Airflow orchestrates a diverse range of use cases, from Analytics Engineering to complex MLOps pipelines.

Learn how EDB leverages Airflow and Cosmos to integrate dbt for robust data transformations, ensuring data quality and consistency.

We’ll provide a detailed case study of our MLOps implementation, demonstrating how Airflow manages training, inference, and model monitoring pipelines for Azure Machine Learning models.

Productionising dbt-core with Airflow

by Tatiana Al-Chueyr Martins, Pankaj Singh & Pankaj Koti

As a popular open-source library for analytics engineering, dbt is often combined with Airflow. Orchestrating and executing dbt models as DAGs ensures an additional layer of control over tasks, observability, and provides a reliable, scalable environment to run dbt models.

This workshop will cover a step-by-step guide to Cosmos (https://github.com/astronomer/astronomer-cosmos), a popular open-source package from Astronomer that helps you quickly run your dbt Core projects as Airflow DAGs and Task Groups, all with just a few lines of code. We’ll walk through:

Purple is the new green: harnessing deferrable operators to improve performance & reduce costs

by Ethan Shalev

Airflow’s traditional execution model often leads to wasted resources: worker nodes sitting idle, waiting on external systems. At Wix, we tackled this inefficiency head-on by refactoring our in-house operators to support Airflow’s deferrable execution model.

Join us on a walk through Wix’s journey to a more efficient Airflow setup, from identifying bottlenecks to implementing deferrable operators and reaping their benefits. We’ll share the alternatives considered, the refactoring process, and how the team seamlessly integrated deferrable execution with no disruption to data engineers’ workflows.

Run Airflow tasks on your coffee machine

by Cedrik Neumann

Airflow 3 comes with two new features: Edge execution and the task SDK. Powered by a HTTP API, these make it possible to write and execute Airflow tasks in any language from anywhere.

In this session I will explain some of the APIs needed and show how to interact with them based on an embedded toy worker written in Rust and running on an ESP32-C3. Furthermore I will provide practical tips on writing your own edge worker and how to develop against a running instance of Airflow.

Scaling Airflow with MWAA: A Multi-Tenant Enterprise Data Platform Journey

by Srinivas Podila & Venkat Sadineni

We use Amazon MWAA to orchestrate our enterprise data warehouse and MDM solutions. Our DAGs extract data from Salesforce, Oracle, Workday, and SFTP, transform it using Mulesoft, Informatica, and DBT, and load it into Salesforce Data Cloud and Snowflake. MWAA is configured as a multi-tenant platform, supporting more than 10 teams and managing thousands of DAGs per environment. Each team follows a full SDLC and has a dedicated Git repo integrated with Jenkins-based CI/CD pipelines for independent deployments.

Scaling and Unifying Multiple Airflow Instances with Orchestration Frederator

by Chirag Todarka & Alvin Zhang

In large organizations, multiple Apache Airflow instances often arise organically—driven by team-specific needs, distinct use cases, or tiered workloads. This fragmentation introduces complexity, operational overhead, and higher infrastructure costs. To address these challenges, we developed the “Orchestration Frederator,” a solution designed to unify and horizontally scale multiple Airflow deployments seamlessly.

This session will detail our journey in implementing Orchestration Frederator, highlighting how we achieved:

  • Horizontal Scalability: Seamlessly scaling Airflow across multiple instances without operational overhead.

Seamless Airflow Upgrades: Migrating from 2.x to 3

by Ankit Chaurasia

Airflow 3 has officially arrived! In this session, we’ll start by discussing prerequisites for a smooth upgrade from Airflow 2.x to Airflow 3, including airflow version requirements, removing deprecated SubDAGs, and backing up and cleaning your metadata database prior to migration. We’ll then explore the new CLI utility: airflow config update [—-fix] for auto-applying configuration changes. We’ll demo cleaning old XCom data to speed up schema migration.

During this session, attendees will learn to verify and adapt their pipelines for Airflow 3 using a Ruff-based upgrade utility. I will demo run ruff check dag/ –select AIR301 to surface scheduling issues, inspect fixes via ruff check dag/ –select AIR301 –show-fixes, and apply corrections with ruff check dag/ –select AIR301 –fix. We’ll also examine rules AIR302 for deprecated config and AIR303 for provider package migrations. By the end, your DAGs will pass all AIR3xx checks error-free.

Seamless Integration: Building Applications That Leverage Airflow's Database Migration Framework

by Ephraim Anierobi

This session presents a comprehensive guide to building applications that integrate with Apache Airflow’s database migration system. We’ll explore how to harness Airflow’s robust Alembic-based migration toolchain to maintain schema compatibility between Airflow and custom applications, enabling developers to create solutions that evolve alongside the Airflow ecosystem without disruption.

Seamless Migration: Leveraging Ruff for a Smooth Transition from Airflow 2 to Airflow 3

by Wei Lee

Migrating from Airflow 2 to the newly released Airflow 3 may seem intimidating due to numerous breaking changes and the introduction of new features. Although a backward compatibility layer has been implemented and most of the existing dags should work fine, some features—such as subdags and execution_date—have been removed based on community consensus.

To support this transition, we worked with Ruff to establish rules that automatically identify removed or deprecated features and even assist in fixing them. In this presentation, I will outline our current Ruff features, the migration rules from Airflow 2 to 3, and how this experience opens the door for us to promote best practices in Airflow through Ruff in the future.

Security made us do it: Airflow’s new Task Execution Architecture

by Amogh Desai & Ash Berlin-Taylor

Airflow v2 architecture has strong coupling between the Airflow core & the User Code running in an Airflow task. This poses barriers in security, maintenance, and adoption. One such threat is that user code can access the source of truth of Airflow - the metadata DB and run any query against it! From a scalability angle, ‘n’ tasks create ‘n’ DB connections, limiting Airflow’s ability to scale effectively.

To address this we proposed AIP-72 – a client-server model for task execution. The new architecture addresses several long-standing issues, including DB isolation from workers, dependency conflicts between Airflow core & workers, and ‘n’ number of DB connections.The new architecture has two parts:

Semiconductor (Chip) Design Workflow Orchestration with Airflow

by Dheeraj Turaga

The design of Qualcomm’s Snapdragon System-On-Chip (SoCs) involves several hundred complex workflows orchestrated across multiple data centers, taking the design from RTL to GDS. In the Snapdragon Oryon Custom CPU team, we introduced Airflow about 2 years ago to orchestrate design, verification, emulation, CI/CD, and physical implementation of our CPUs.

Use Case: • Standardization and Templatization: We standardize and templatize common workflows, allowing designers to verify their designs by customizing YAML parameters. • Custom Shell Operators: We created custom shell operators (tcshrc) to source project environments and work with internal tooling. • Smart Retries: We use pre/post-execute hooks to trigger smart retries on failure. • Dynamic Celery Workers: We auto-create Celery workers on the fly on our High-Performance Compute (HPC) clusters to launch and manage Electronic Design Automation (EDA) workloads. • Hybrid Executor Strategy: We use a hybrid executor strategy (CeleryExecutor and EdgeExecutor) to orchestrate tasks across multiple data centers. • EdgeExecutor for Remote Testing: We leverage EdgeExecutor to access post-silicon hardware in remote locations.

Simplifying Data Lineage: How OpenLineage Empowers Airflow and Beyond

by Harel Shein & Maciej Obuchowski

OpenLineage has simplified collecting lineage metadata across the data ecosystem by standardizing its representation in an extensible model. It enabled a whole ecosystem improving data pipeline reliability and ease of troubleshooting in production environments. In this talk, we’ll briefly introduce the OpenLineage model and explore how this metadata is collected from Airflow, Spark, dbt, and Flink. We’ll demonstrate how to extract valuable insights and outline practical benefits and common challenges when building ingestion, processing and storage for OpenLineage data. We will also briefly show how OpenLineage events can be used to observe data pipelines exhastively and the benefits that brings.

Simplifying Data Management with DAG Factory

by Katarzyna Kalek & Jakub Orlowski

At OLX, we connect millions of people daily through our online marketplace while relying on robust data pipelines. In this talk, we explore how the DAG Factory concept elevates data governance, lineage, and discovery by centralizing operator logic and restricting direct DAG creation. This approach enforces code quality, optimizes resources, maintains infrastructure hygiene and enables smooth version upgrades. We then leverage consistent naming conventions in Airflow to build targeted namespaces, aligning teams with global policies while preserving autonomy. Integrating external tools like AWS Lake Formation and Open Metadata further unifies governance, making it straightforward to manage and secure data. This is critical when handling hundreds or even thousands of active DAGs. If the idea of storing 1,600 pipelines in one folder seems overwhelming, join us to learn how the DAG Factory concept simplifies pipeline management. We’ll also share insights from OLX, highlighting how thoughtful design fosters oversight, efficiency, and discoverability across diverse use cases.

Single Pane Orchestration using Airflow for multiple teams at GoDaddy

by Ankit Sahu & Brandon Abear

As the adoption of Airflow increases within large enterprises to orchestrate their data pipelines, more than one team needs to create, manage, and run their workflows in isolation. With Multi-tenancy not supported natively in Airflow, customers are adopting alternate ways to allow multiple teams to use the same infrastructure. In this session, we will explore how Godaddy uses MWAA to build a Single Pane Airflow set up for multiple teams with common observability platform. This session will help Attendees gain insights the use case, the solution, and its implementation challenges and benefits.

Supercharging Apache Airflow: Enhancing Core Components with Rust

by Shahar Epstein

Apache Airflow is a powerful workflow orchestrator, but as workloads grow, its Python-based components can become performance bottlenecks. This talk explores how Rust, with its speed, safety, and concurrency advantages, can enhance Airflow’s core components (e.g, scheduler, DAG processor, etc). We’ll dive into the motivations behind using Rust, architectural trade-offs, and the challenges of bridging the gap between Python and Rust. A proof-of-concept showcasing an Airflow scheduler rewritten in Rust will demonstrate the potential benefits of this approach.

Sustainable Computing in Airflow: Reducing Emissions with Carbon Aware Scheduling

by Ryan Singman

As the climate impact of cloud computing grows, carbon aware computing offers a promising way to cut emissions without compromising performance. By shifting workloads to times of lower carbon intensity on the power grid, we can achieve significant emissions reductions—often 10–30%—with no code changes to the underlying task.

In this talk, we’ll explore the principles behind carbon-aware computing, walk through how these ideas translate to actionable reductions in Airflow, and introduce the open-source CarbonAware provider for Airflow. We’ll also highlight how Airflow’s deferable operators, task metadata, and flexible execution model make it uniquely well suited for temporal shifting based on grid carbon intensity.

Task failures troubleshooting based on Airflow & Kubernetes signals

by Khadija Al Ahyane

Per Airflow community survey, Kubernetes is the most popular compute platform used to run Airflow and when run on Kubernetes, Airflow gains, out of the box, lots of benefits like monitoring, reliability, ease of deployment, scalability and autoscaling. On the other hand, running Airflow on Kubernetes means running a sophisticated distributed system on another distributed system which makes troubleshooting of Airflow tasks and DAGs failures harder.

This session tackles that bottleneck head-on, introducing a practical approach to building an automated diagnostic pipeline for Airflow on Kubernetes. Imagine offloading tedious investigations to a system that, on task failure, automatically collects and correlates key signals from Kubernetes components (linking Airflow tasks to specific Pods and their events), KubernetesGKE monitoring, and relevant logs—pinpointing root causes and suggesting actionable fixes.

The Secret to Airflow's Evergreen Build: CI/CD magic

by Amogh Desai, Jarek Potiuk & Pavan kumar Gopidesu

Have you ever wondered why Apache Airflow builds are asymptotically(*) green? That thrive for “perennial green build” is not magic, it’s the result of continuous, often unseen engineering effort within our CI/CD pipelines & dev environments. This dedication ensures that maintainers can work efficiently & contributors can onboard smoothly.

To tackle the ever growing contributor base, we have a CI/CD team run by volunteers putting in significant work in the foundational tooling. In this talk, we reveal some innovative solutions we have implemented like:

Transforming Data Engineering: Achieving Efficiency and Ease with an Intuitive Orchestration Solution

by Rakesh Kumar Tai & Mili Tripathi

In the rapidly evolving field of data engineering and data science, efficiency and ease of use are crucial. Our innovative solution offers a user-friendly interface to manage and schedule custom PySpark, PySQL, Python, and SQL code, streamlining the process from development to production. Using Airflow at the backend, this tool eliminates the complexities of infrastructure management, version control, CI/CD processes, and workflow orchestration.The intuitive UI allows users to upload code, configure job parameters, and set schedules effortlessly, without the need for additional scripting or coding. Additionally, users have the flexibility to bring their own custom artifactory solution and run their code. In summary, our solution significantly enhances the orchestration and scheduling of custom code, breaking down traditional barriers and empowering organizations to maximize their data’s potential and drive innovation efficiently. Whether you are an individual data scientist or part of a large data engineering team, this tool provides the resources needed to streamline your workflow and achieve your goals faster than ever before.

Transforming Insurance underwriting with Agentic AI

by Peeyush Rai

The weav.ai platform is built on top of Apache Airflow, chosen for its deterministic, predictable execution coupled with extreme developer customizability. weav.ai has seamlessly integrated its AI agents with Airflow to enable a unified AI orchestration to bring the power of scalability, robustness and the intelligence of AI in a single process. This talk will focus on the use cases being served, an architecture overview of the key Airflow capabilities being leveraged, and how Agentic AI has been seamlessly integrated to deliver the AI powered workflows. Weav.ai’s platform is agnostic to any specific cloud or LLM and can orchestrate across those based on the use case.

Unleash Airflow's Potential with hands-on Performance Optimization workshop

by Mike Ellis

This interactive workshop session empowers you to unlock the full potential of Apache Airflow through performance optimization techniques. Gain hands-on experience identifying performance bottlenecks and implementing best practices to overcome them.

Unlocking Event-Driven Scheduling in Airflow 3: A New Era of Reactive Data Pipelines

by Vincent Beck

Airflow 3 introduces a major evolution in orchestration: native support for external event-driven scheduling. In this talk, I’ll share the journey behind AIP-82—why we needed it, how we built it, and what it unlocks. I’ll dive into how the new AssetWatcher enables pipelines to respond immediately to events like file arrivals, API calls, or pub/sub messages. You’ll see how this drastically reduces latency and infrastructure overhead while improving reactivity and resource efficiency. We’ll explore how it works under the hood, real-world use cases, best practices, and migration tips for teams ready to shift from time-based to event-driven workflows. If you’re looking to make your Airflow DAGs more dynamic, this is the talk that shows you how. Whether you’re an operator or contributor, you’ll walk away with a deep understanding of one of Airflow 3’s most impactful features.

Uses in an on Prem Research Setting

by Lawrence Gerstley

KP Division of Research uses Airflow as a central technology for integrating diverse technologies in an agile setting. We wish to present a set of use-cases for AI/ML workloads, including imaging analysis (tissue segmentation, mammography), NLP (early identification of psychosis), LLM processing (identification of vessel diameter from radiological impressions), and other large data processing tasks. We create these “short-lived” project workflows to accomplish specific aims, and then may never run the job again, so leveraging generalized patterns are crucial to quickly implementing these jobs. Our Advanced Computational Infrastructure is comprised of multiple Kubernetes clusters, and we use Airflow to democratize the use of our batch level resources in those clusters. We use Airflow form-based parameters to deploy pods running R and Python scripts where generalized parameters are injected into scripts that follow internal programming patterns. Finally, we also leverage Airflow to create headless services inside Kubernetes for large computational workloads (Spark & H2O) that subsequent pods consume ephemerally.

Using Apache Airflow with Trino for (almost) all your data problems

by Philippe Gagnon

Trino is incredibly effective at enabling users to extract insights quickly and effectively from large amount of data located in dispersed and heterogeneous federated data systems.

However, some business data problems are more complex than interactive analytics use cases, and are best broken down into a sequence of interdependent steps, a.k.a. a workflow. For these use cases, dedicated software is often required in order to schedule and manage these processes with a principled approach.

When Airflow Meets Yunikorn: Enhancing Airflow with Yunikorn for Higher Efficiency

by Xiaodong Deng & Chaoran Yu

Apache Airflow’s Kubernetes integration enables flexible workload execution on Kubernetes but lacks advanced resource management features including application queueing, tenant isolation and gang scheduling. These features are increasingly critical for data engineering as well as AI/ML use cases, particularly GPU utilization optimization. Apache Yunikorn, a Kubernetes-native scheduler, addresses these gaps by offering a high-performance alternative to Kubernetes default scheduler. In this talk, we’ll demonstrate how to conveniently leverage Yunikorn’s power in Airflow, along with practical use cases and examples.

Why AWS chose Apache Airflow to power workflows for the next generation of Amazon SageMaker

by John Jackson

On March 13th, 2025, Amazon Web Services announced General Availability of Amazon SageMaker Unified Studio, bringing together AWS machine learning and analytics capabilities. At the heart of this next generation of Amazon SageMaker sits Apache Airflow. All SageMaker Unified Studio users have a personal, open-source Airflow deployment, running alongside their Jupyter notebook, enabling those users to easily develop Airflow DAGs that have unified access to all of their data.

In this talk, I will go into details around the motivations for choosing Airflow for this capability, the challenges with incorporating Airflow into such a large and diverse experience, the key role that open-source plays, how we’re leveraging GenAI to make that open source development experience better, and the goals for the future of Airflow in SageMaker Unified Studio.

Why Data Teams Keep Reinventing the Wheel: The Struggle for Code Reuse in the Data Transformation La

by Maxime Beauchemin

Data teams have a bad habit: reinventing the wheel. Despite the explosion of open-source tooling, best practices, and managed services, teams still find themselves building bespoke data platforms from scratch—often hitting the same roadblocks as those before them. Why does this keep happening, and more importantly, how can we break the cycle?

In this talk, we’ll unpack the key reasons data teams default to building rather than adopting, from technical nuances to cultural and organizational dynamics. We’ll discuss why fragmentation in the modern data stack, the pressure to “own” infrastructure, and the allure of in-house solutions make this problem so persistent.

Workshop: Get started with Airflow 3.0

by Kenten Danas

Airflow 3.0 is the most significant release in the project’s history, and brings a better user experience, stronger security, and the ability to run tasks anywhere, at any time. In this workshop, you’ll get hands-on experience with the new release and learn how to leverage new features like DAG versioning, backfills, data assets, and a new react-based UI.

Whether you’re writing traditional ELT/ETL pipelines or complex ML and GenAI workflows, you’ll learn how Airflow 3 will make your day-to-day work smoother and your pipelines even more flexible. This workshop is suitable for intermediate to advanced Airflow users. Beginning users should consider taking the Airflow fundamentals course on the Astronomer Academy before attending this workshop.

Your first Apache Airflow Contribution

by Ryan Hatter, Amogh Desai & Phani Kumar

Ready to contribute to Apache Airflow? In this hands-on workshop, you’ll be expected to come prepared with your development environment already configured (Breeze installed is strongly recommended, but Codespaces works if you can’t install Docker). We’ll dive straight into finding issues that match your skills and walk you through the entire contribution process—from creating your first pull request to receiving community feedback. Whether you’re writing code, enhancing documentation, or offering feedback, there’s a place for you. Let’s get started and see your name among Airflow contributors!

Your privacy or our progress: rethinking telemetry in Airflow

by Bolke de Bruin

We face a paradox: we could use usage data to build better software, but collecting that data seems to contradict the very principles of user freedom that open source represents. Apache Airflow’s current telemetry - already purged - system has become a battleground for this conflict, with some users voicing concerns over privacy while maintainers struggle to make informed decisions without data. What can we do to strike the right balance?