Check out the full program for Airflow Summit.

If you prefer, you can also see this as sessionize layout or list of sessions.

Thursday, October 9, 2025

09:15
Coffee break
10:00
10:30
11:00
11:30
12:00
12:30
13:00
Lunch
14:00
14:30
15:00
15:30
Coffee break
15:45
16:15
16:45
17:30
17:35
Lightning talk (sign up for this slot at registration desk)
17:40
Lightning talk (sign up for this slot at registration desk)
17:45
Lightning talk (sign up for this slot at registration desk)
17:50
Lightning talk (sign up for this slot at registration desk)
18:00
Event wrap-up
09:15 - 10:00.
By Peeyush Rai & Vikram Koka
Track: Keynote
Room: Columbia A
10/09/2025 9:15 AM 10/09/2025 10:00 AM America/Los_Angeles AS25: Airflow as a Platform for Agentic AI Digital Products Within Enterprises

In this keynote, Peeyush Rai and Vikram Koka will be walking through how Airflow is being used as part of a Agentic AI platform servicing insurance companies, which runs on all the major public clouds, leveraging models from Open AI, Google (Gemini), AWS (Claude and Bedrock).

This talk walks through the details of the actual end user business workflow including gathering relevant financial data to make a decision, as well as the tricky challenge of handling AI hallucinations, with new Airflow capabilities such as “Human in the loop”.

This talk offers something for both business and technical audiences. Business users will get a clear view of what it takes to bring an AI application into production and how to align their operations and business teams with an AI enabled workflow. Meanwhile, technical users will walk away with practical insights on how to orchestrate complex business processes enabling a seamless collaboration between Airflow, AI Agents and Human in the loop.

Columbia A

In this keynote, Peeyush Rai and Vikram Koka will be walking through how Airflow is being used as part of a Agentic AI platform servicing insurance companies, which runs on all the major public clouds, leveraging models from Open AI, Google (Gemini), AWS (Claude and Bedrock).

This talk walks through the details of the actual end user business workflow including gathering relevant financial data to make a decision, as well as the tricky challenge of handling AI hallucinations, with new Airflow capabilities such as “Human in the loop”.

10:30 - 10:55.
By Zhe-You Liu
Track: Airflow intro/overview
Room: Beckler
10/09/2025 10:30 AM 10/09/2025 10:55 AM America/Los_Angeles AS25: Becoming an Apache Airflow Committer from 0

How a Complete Beginner in Data Engineering / Junior Computer Science Student Became an Apache Airflow Committer in Just 5 Months—With 70+ PRs and 300 Hours of Contributions

This talk is aimed at those who are still hesitant about contributing to Apache Airflow. I hope to inspire and encourage anyone to take the first step and start their journey in open-source—let’s build together!

Beckler

How a Complete Beginner in Data Engineering / Junior Computer Science Student Became an Apache Airflow Committer in Just 5 Months—With 70+ PRs and 300 Hours of Contributions

This talk is aimed at those who are still hesitant about contributing to Apache Airflow. I hope to inspire and encourage anyone to take the first step and start their journey in open-source—let’s build together!

10:30 - 13:00.
By Jon Fink & Amy Pitcher
Track: Workshop
Room: 305
10/09/2025 10:30 AM 10/09/2025 1:00 PM America/Los_Angeles AS25: Bridging Data Pipelines and Business Applications with Airflow and Control-M

AI and ML pipelines built in Airflow often power critical business outcomes, but they rarely operate in isolation. In this hands-on workshop, learn how Control-M integrates with Airflow to orchestrate end-to-end workflows that include upstream and downstream enterprise systems like Supply Chain and Billing. Gain visibility, reliability, and seamless coordination across your data pipelines and the business operations they support.

305
Learn how Control-M integrates with Airflow to orchestrate end-to-end workflows that include upstream and downstream enterprise systems like Supply Chain and Billing. Gain visibility, reliability, and seamless coordination across your data pipelines and the business operations they support.
10:30 - 10:55.
By Hannah Lundrigan
Track: Use cases
Room: Columbia C
10/09/2025 10:30 AM 10/09/2025 10:55 AM America/Los_Angeles AS25: Enhancing Small Retailer Visibility: Machine Learning Pipelines with Apache Airflow

Small retailers often lack the data visibility that larger companies rely on for decision-making. In this session, we’ll dive into how Apache Airflow powers end-to-end machine learning pipelines that process inventory and sales data, enabling retailers and suppliers to gain valuable industry insights. We’ll cover feature engineering, model training, and automated inference workflows, along with strategies for handling messy, incomplete retail data. We will discuss how Airflow enables scalable ML-driven insights that improve demand forecasting, product categorization, and supply chain optimization.

Columbia C

Small retailers often lack the data visibility that larger companies rely on for decision-making. In this session, we’ll dive into how Apache Airflow powers end-to-end machine learning pipelines that process inventory and sales data, enabling retailers and suppliers to gain valuable industry insights. We’ll cover feature engineering, model training, and automated inference workflows, along with strategies for handling messy, incomplete retail data. We will discuss how Airflow enables scalable ML-driven insights that improve demand forecasting, product categorization, and supply chain optimization.

10:30 - 13:00.
By Marc Lamberti
Track: Workshop
Room: 301
10/09/2025 10:30 AM 10/09/2025 1:00 PM America/Los_Angeles AS25: Get Certified: DAG Authoring for Apache Airflow 3

We’re excited to offer Airflow Summit 2025 attendees an exclusive opportunity to earn their DAG Authoring certification in person, now updated to include all the latest Airflow 3.0 features. This certification workshop comes at no additional cost to summit attendees.

The DAG Authoring for Apache Airflow certification validates your expertise in advanced Airflow concepts and demonstrates your ability to build production-grade data pipelines. It covers TaskFlow API, Dynamic task mapping, Templating, Asset-driven scheduling, Best practices for production DAGs, and new Airflow 3.0 features and optimizations.

The certification session includes:

  • 20-minute preparation period with expert guidance
  • Live Q&A session with Marc Lamberti from Astronomer
  • 60-minute examination period
  • Real-time results and immediate feedback

To prepare for the Airflow Certification, visit the Astronomer Academy (https://academy.astronomer.io/page/astronomer-certification).

301
The DAG Authoring for Apache Airflow certification validates your expertise in advanced Airflow concepts and demonstrates your ability to build production-grade data pipelines.
10:30 - 10:55.
By Karthik Dulam
Track: Use cases
Room: Columbia A
10/09/2025 10:30 AM 10/09/2025 10:55 AM America/Los_Angeles AS25: Orchestrating MLOps and Data Transformation at EDB with Airflow

This talk explores EDB’s journey from siloed reporting to a unified data platform, powered by Airflow. We’ll delve into the architectural evolution, showcasing how Airflow orchestrates a diverse range of use cases, from Analytics Engineering to complex MLOps pipelines.

Learn how EDB leverages Airflow and Cosmos to integrate dbt for robust data transformations, ensuring data quality and consistency.

We’ll provide a detailed case study of our MLOps implementation, demonstrating how Airflow manages training, inference, and model monitoring pipelines for Azure Machine Learning models.

Discover the design considerations driven by our internal data governance framework and gain insights into our future plans for AIOps integration with Airflow.

Columbia A

This talk explores EDB’s journey from siloed reporting to a unified data platform, powered by Airflow. We’ll delve into the architectural evolution, showcasing how Airflow orchestrates a diverse range of use cases, from Analytics Engineering to complex MLOps pipelines.

Learn how EDB leverages Airflow and Cosmos to integrate dbt for robust data transformations, ensuring data quality and consistency.

We’ll provide a detailed case study of our MLOps implementation, demonstrating how Airflow manages training, inference, and model monitoring pipelines for Azure Machine Learning models.

10:30 - 10:55.
By Ashok Prakash
Track: Best practices
Room: Columbia D
10/09/2025 10:30 AM 10/09/2025 10:55 AM America/Los_Angeles AS25: Scaling ML Infrastructure: Lessons from Building Distributed Systems

In today’s data-driven world, scalable ML infrastructure is mission-critical. As ML workloads grow, orchestration tools like Apache Airflow become essential for managing pipelines, training, deployment, and observability. In this talk, I’ll share lessons from building distributed ML systems across cloud platforms, including GPU-based training and AI-powered healthcare. We’ll cover patterns for scaling Airflow DAGs, integrating telemetry and auto-healing, and aligning cross-functional teams. Whether you’re launching your first pipeline or managing ML at scale, you’ll gain practical strategies to make Airflow the backbone of your ML infrastructure.

Columbia D

In today’s data-driven world, scalable ML infrastructure is mission-critical. As ML workloads grow, orchestration tools like Apache Airflow become essential for managing pipelines, training, deployment, and observability. In this talk, I’ll share lessons from building distributed ML systems across cloud platforms, including GPU-based training and AI-powered healthcare. We’ll cover patterns for scaling Airflow DAGs, integrating telemetry and auto-healing, and aligning cross-functional teams. Whether you’re launching your first pipeline or managing ML at scale, you’ll gain practical strategies to make Airflow the backbone of your ML infrastructure.

11:00 - 11:25.
By Jonathan Leek & Michelle Winters
Track: Best practices
Room: Columbia C
10/09/2025 11:00 AM 10/09/2025 11:25 AM America/Los_Angeles AS25: Building an Airflow Center of Excellence: Lessons from the Frontlines

As organizations scale their data infrastructure, Apache Airflow becomes a mission-critical component for orchestrating workflows efficiently. But scaling Airflow successfully isn’t just about running pipelines—it’s about building a Center of Excellence (CoE) that empowers teams with the right strategy, best practices, and long-term enablement. Join Jon Leek and Michelle Winters as they share their experiences helping customers design and implement Airflow Centers of Excellence. They’ll walk through real-world challenges, best practices, and the structured approach Astronomer takes to ensure teams have the right plan, resources, and support to succeed. Whether you’re just starting with Airflow or looking to optimize and scale your workflows, this session will give you a proven framework to build a sustainable Airflow Center of Excellence within your organization. 🚀

Columbia C

As organizations scale their data infrastructure, Apache Airflow becomes a mission-critical component for orchestrating workflows efficiently. But scaling Airflow successfully isn’t just about running pipelines—it’s about building a Center of Excellence (CoE) that empowers teams with the right strategy, best practices, and long-term enablement. Join Jon Leek and Michelle Winters as they share their experiences helping customers design and implement Airflow Centers of Excellence. They’ll walk through real-world challenges, best practices, and the structured approach Astronomer takes to ensure teams have the right plan, resources, and support to succeed. Whether you’re just starting with Airflow or looking to optimize and scale your workflows, this session will give you a proven framework to build a sustainable Airflow Center of Excellence within your organization. 🚀

11:00 - 11:25.
By Rachel Sun
Track: Airflow & ...
Room: Columbia D
10/09/2025 11:00 AM 10/09/2025 11:25 AM America/Los_Angeles AS25: How Pinterest Uses Ai to Empower Airflow Users for Troubleshooting

At Pinterest, there are over 10,000 DAGs supporting various use cases across different teams and roles. With this scale and diversity, user support has been an ongoing challenge to unlock productivity. As Airflow increasingly serves as a user interface to a variety of data and ML infrastructure behind the scenes, it’s common for issues from multiple areas to surface in Airflow, making triage and troubleshooting a challenge.

In this session, we will discuss the scale of the problem we are facing, how we have addressed it so far, and how we are introducing LLM AI to help solve this problem.

Columbia D

At Pinterest, there are over 10,000 DAGs supporting various use cases across different teams and roles. With this scale and diversity, user support has been an ongoing challenge to unlock productivity. As Airflow increasingly serves as a user interface to a variety of data and ML infrastructure behind the scenes, it’s common for issues from multiple areas to surface in Airflow, making triage and troubleshooting a challenge.

In this session, we will discuss the scale of the problem we are facing, how we have addressed it so far, and how we are introducing LLM AI to help solve this problem.

11:00 - 11:25.
By Bolke de Bruin
Track: Community
Room: Columbia A
10/09/2025 11:00 AM 10/09/2025 11:25 AM America/Los_Angeles AS25: Your privacy or our progress: rethinking telemetry in Airflow

We face a paradox: we could use usage data to build better software, but collecting that data seems to contradict the very principles of user freedom that open source represents. Apache Airflow’s current telemetry - already purged - system has become a battleground for this conflict, with some users voicing concerns over privacy while maintainers struggle to make informed decisions without data. What can we do to strike the right balance?

Columbia A

We face a paradox: we could use usage data to build better software, but collecting that data seems to contradict the very principles of user freedom that open source represents. Apache Airflow’s current telemetry - already purged - system has become a battleground for this conflict, with some users voicing concerns over privacy while maintainers struggle to make informed decisions without data. What can we do to strike the right balance?

11:30 - 11:55.
By Shalabh Agarwal
Track: Airflow & ...
Room: Beckler
10/09/2025 11:30 AM 10/09/2025 11:55 AM America/Los_Angeles AS25: Custom Operators in Action: A Guide to Extending Airflow's Capabilities

Custom operators are the secret weapon for solving Airflow’s unique & challenging orchestration problems.

This session will cover:

  • When to build custom operators vs. using existing solutions
  • Architecture patterns for creating maintainable, reusable operators
  • Live coding demonstration: Building a custom operator from scratch
  • Real-world examples: How custom operators solve specific business challenges

Through practical code examples and architecture patterns, attendees will walk away with the knowledge to implement custom operators that enhance their Airflow deployments.

This session is ideal for experienced Airflow users looking to extend functionality beyond out-of-the-box solutions.

Beckler

Custom operators are the secret weapon for solving Airflow’s unique & challenging orchestration problems.

This session will cover:

  • When to build custom operators vs. using existing solutions
  • Architecture patterns for creating maintainable, reusable operators
  • Live coding demonstration: Building a custom operator from scratch
  • Real-world examples: How custom operators solve specific business challenges

Through practical code examples and architecture patterns, attendees will walk away with the knowledge to implement custom operators that enhance their Airflow deployments.

11:30 - 11:55.
By Brandon Abear
Track: Airflow & ...
Room: Columbia C
10/09/2025 11:30 AM 10/09/2025 11:55 AM America/Los_Angeles AS25: Data & AI Orchestration at GoDaddy

As the adoption of Airflow increases within large enterprises to orchestrate their data pipelines, more than one team needs to create, manage, and run their workflows in isolation. With multi-tenancy not yet supported natively in Airflow, customers are adopting alternate ways to enable multiple teams to share infrastructure. In this session, we will explore how GoDaddy uses MWAA to build a Single Pane Airflow setup for multiple teams with a common observability platform, and how this foundation enables orchestration expansion beyond data workflows to AI workflows as well. We’ll discuss our roadmap for leveraging upcoming Airflow 3 features, including the task execution API for enhanced workflow management and DAG versioning capabilities for comprehensive auditing and governance. This session will help attendees gain insights into the use case, the solution architecture, implementation challenges and benefits, and our strategic vision for unified orchestration across data and AI workloads.

Outline:

  • About GoDaddy
  • GoDaddy Data & AI Orchestration Vision
  • Current State & Airflow Usage
  • Airflow Monitoring & Observability
  • Lessons Learned & Best Practices
  • Airflow 3 Adoption
Columbia C

As the adoption of Airflow increases within large enterprises to orchestrate their data pipelines, more than one team needs to create, manage, and run their workflows in isolation. With multi-tenancy not yet supported natively in Airflow, customers are adopting alternate ways to enable multiple teams to share infrastructure. In this session, we will explore how GoDaddy uses MWAA to build a Single Pane Airflow setup for multiple teams with a common observability platform, and how this foundation enables orchestration expansion beyond data workflows to AI workflows as well. We’ll discuss our roadmap for leveraging upcoming Airflow 3 features, including the task execution API for enhanced workflow management and DAG versioning capabilities for comprehensive auditing and governance. This session will help attendees gain insights into the use case, the solution architecture, implementation challenges and benefits, and our strategic vision for unified orchestration across data and AI workloads.

11:30 - 11:55.
By Nathan Hadfield
Track: Airflow & ...
Room: Columbia D
10/09/2025 11:30 AM 10/09/2025 11:55 AM America/Los_Angeles AS25: From Oops to Secure Ops: Self-Hosted AI for Airflow Failure Diagnosis

Last year, ‘From Oops to Ops’ showed how AI-powered failure analysis could help diagnose why Airflow tasks fail. But do we really need large, expensive cloud-based AI models to answer simple diagnostic questions? Relying on external AI APIs introduces privacy risks, unpredictable costs, and latency, often without clear benefits for this use case.

With the rise of distilled, open-source models, self-hosted failure analysis is now a practical alternative. This talk will explore how to deploy an AI service on infrastructure you control, compare cost, speed, and accuracy between OpenAI’s API and self-hosted models, and showcase a live demo of AI-powered task failure diagnosis using DeepSeek and Llama—running without external dependencies to keep data private and costs predictable.

Columbia D

Last year, ‘From Oops to Ops’ showed how AI-powered failure analysis could help diagnose why Airflow tasks fail. But do we really need large, expensive cloud-based AI models to answer simple diagnostic questions? Relying on external AI APIs introduces privacy risks, unpredictable costs, and latency, often without clear benefits for this use case.

With the rise of distilled, open-source models, self-hosted failure analysis is now a practical alternative. This talk will explore how to deploy an AI service on infrastructure you control, compare cost, speed, and accuracy between OpenAI’s API and self-hosted models, and showcase a live demo of AI-powered task failure diagnosis using DeepSeek and Llama—running without external dependencies to keep data private and costs predictable.

11:30 - 11:55.
By Theo Lebrun
Track: Use cases
Room: Columbia A
10/09/2025 11:30 AM 10/09/2025 11:55 AM America/Los_Angeles AS25: Orchestrating AI Knowledge Bases with Apache Airflow

In the age of Generative AI, knowledge bases are the backbone of intelligent systems, enabling them to deliver accurate and context-aware responses. But how do you ensure that these knowledge bases remain up-to-date and relevant in a rapidly changing world? Enter Apache Airflow, a robust orchestration tool that streamlines the automation of data workflows.

This talk will explore how Airflow can be leveraged to manage and update AI knowledge bases across multiple data sources. We’ll dive into the architecture, demonstrate how Airflow enables efficient data extraction, transformation, and loading (ETL), and share insights on tackling challenges like data consistency, scheduling, and scalability.

Whether you’re building your own AI-driven systems or looking to optimize existing workflows, this session will provide practical takeaways to make the most of Apache Airflow in orchestrating intelligent solutions.

Columbia A

In the age of Generative AI, knowledge bases are the backbone of intelligent systems, enabling them to deliver accurate and context-aware responses. But how do you ensure that these knowledge bases remain up-to-date and relevant in a rapidly changing world? Enter Apache Airflow, a robust orchestration tool that streamlines the automation of data workflows.

This talk will explore how Airflow can be leveraged to manage and update AI knowledge bases across multiple data sources. We’ll dive into the architecture, demonstrate how Airflow enables efficient data extraction, transformation, and loading (ETL), and share insights on tackling challenges like data consistency, scheduling, and scalability.

12:00 - 12:25.
By William Orgertrice
Track: Best practices
Room: Columbia D
10/09/2025 12:00 PM 10/09/2025 12:25 PM America/Los_Angeles AS25: 5 Simple Strategies To Enhance Your DAGs For Data Processing

Take your DAGs in Apache Airflow to the next level? This is an insightful session where we’ll uncover 5 transformative strategies to enhance your data workflows. Whether you’re a data engineering pro or just getting started, this presentation is packed with practical tips and actionable insights that you can apply right away.

We’ll dive into the magic of using powerful libraries like Pandas, share techniques to trim down data volumes for faster processing, and highlight the importance of modularizing your code for easier maintenance. Plus, you’ll discover efficient ways to monitor and debug your DAGs, and how to make the most of Airflow’s built-in features.

By the end of this session, you’ll have a toolkit of strategies to boost the efficiency and performance of your DAGs, making your data processing tasks smoother and more effective. Don’t miss out on this opportunity to elevate your Airflow DAGs!

Columbia D

Take your DAGs in Apache Airflow to the next level? This is an insightful session where we’ll uncover 5 transformative strategies to enhance your data workflows. Whether you’re a data engineering pro or just getting started, this presentation is packed with practical tips and actionable insights that you can apply right away.

We’ll dive into the magic of using powerful libraries like Pandas, share techniques to trim down data volumes for faster processing, and highlight the importance of modularizing your code for easier maintenance. Plus, you’ll discover efficient ways to monitor and debug your DAGs, and how to make the most of Airflow’s built-in features.

12:00 - 12:25.
By Lawrence Gerstley
Track: Use cases
Room: Columbia A
10/09/2025 12:00 PM 10/09/2025 12:25 PM America/Los_Angeles AS25: Airflow Uses in an on-prem Research Setting

KP Division of Research uses Airflow as a central technology for integrating diverse technologies in an agile setting. We wish to present a set of use-cases for AI/ML workloads, including imaging analysis (tissue segmentation, mammography), NLP (early identification of psychosis), LLM processing (identification of vessel diameter from radiological impressions), and other large data processing tasks. We create these “short-lived” project workflows to accomplish specific aims, and then may never run the job again, so leveraging generalized patterns are crucial to quickly implementing these jobs. Our Advanced Computational Infrastructure is comprised of multiple Kubernetes clusters, and we use Airflow to democratize the use of our batch level resources in those clusters. We use Airflow form-based parameters to deploy pods running R and Python scripts where generalized parameters are injected into scripts that follow internal programming patterns. Finally, we also leverage Airflow to create headless services inside Kubernetes for large computational workloads (Spark & H2O) that subsequent pods consume ephemerally.

Columbia A

KP Division of Research uses Airflow as a central technology for integrating diverse technologies in an agile setting. We wish to present a set of use-cases for AI/ML workloads, including imaging analysis (tissue segmentation, mammography), NLP (early identification of psychosis), LLM processing (identification of vessel diameter from radiological impressions), and other large data processing tasks. We create these “short-lived” project workflows to accomplish specific aims, and then may never run the job again, so leveraging generalized patterns are crucial to quickly implementing these jobs. Our Advanced Computational Infrastructure is comprised of multiple Kubernetes clusters, and we use Airflow to democratize the use of our batch level resources in those clusters. We use Airflow form-based parameters to deploy pods running R and Python scripts where generalized parameters are injected into scripts that follow internal programming patterns. Finally, we also leverage Airflow to create headless services inside Kubernetes for large computational workloads (Spark & H2O) that subsequent pods consume ephemerally.

12:00 - 12:25.
By Vishal Vijayvargiya
Track: Airflow & ...
Room: Beckler
10/09/2025 12:00 PM 10/09/2025 12:25 PM America/Los_Angeles AS25: Enhancing Airflow REST API: From Basic Integration to Enterprise Scale

Apache Airflow’s REST API has evolved to support diverse orchestration needs, with managed services like MWAA introducing custom enhancements. One such feature, InvokeRestApi, enables dynamic interactions with external services while maintaining Airflow’s core orchestration capabilities.

In this talk, we will explore the architectural design behind InvokeRestApi, detailing how it enhances API-driven workflows. Beyond the architecture, we’ll share key challenges and learnings from implementing and scaling Airflow’s REST API in production environments. Topics include authentication, performance considerations, error handling, and best practices for integrating external APIs efficiently.

Attendees will gain a deeper understanding of Airflow’s API extensibility, its implications for workflow automation, and actionable insights for building robust, API-driven orchestration solutions. Whether you’re an Airflow user or an architect, this session will provide valuable takeaways for simplifying API interactions across airflow environments.

Beckler

Apache Airflow’s REST API has evolved to support diverse orchestration needs, with managed services like MWAA introducing custom enhancements. One such feature, InvokeRestApi, enables dynamic interactions with external services while maintaining Airflow’s core orchestration capabilities.

In this talk, we will explore the architectural design behind InvokeRestApi, detailing how it enhances API-driven workflows. Beyond the architecture, we’ll share key challenges and learnings from implementing and scaling Airflow’s REST API in production environments. Topics include authentication, performance considerations, error handling, and best practices for integrating external APIs efficiently.

12:00 - 12:25.
By Philippe Gagnon
Track: Airflow & ...
Room: Columbia C
10/09/2025 12:00 PM 10/09/2025 12:25 PM America/Los_Angeles AS25: Using Apache Airflow with Trino for (almost) all your data problems

Trino is incredibly effective at enabling users to extract insights quickly and effectively from large amount of data located in dispersed and heterogeneous federated data systems.

However, some business data problems are more complex than interactive analytics use cases, and are best broken down into a sequence of interdependent steps, a.k.a. a workflow. For these use cases, dedicated software is often required in order to schedule and manage these processes with a principled approach.

In this session, we will look at how we can leverage Apache Airflow to orchestrate Trino queries into complex workflows that solve practical batch processing problems, all the while avoiding the use of repetitive, redundant data movement.

Columbia C

Trino is incredibly effective at enabling users to extract insights quickly and effectively from large amount of data located in dispersed and heterogeneous federated data systems.

However, some business data problems are more complex than interactive analytics use cases, and are best broken down into a sequence of interdependent steps, a.k.a. a workflow. For these use cases, dedicated software is often required in order to schedule and manage these processes with a principled approach.

12:30 - 12:55.
By Vikram Koka
Track: Best practices
Room: Columbia A
10/09/2025 12:30 PM 10/09/2025 12:55 PM America/Los_Angeles AS25: Common provider abstractions: Key for multi-cloud data handling

Enterprises want the flexibility to operate across multiple clouds, whether to optimize costs, improve resiliency, to avoid vendor lock-in, or for data sovereignty. But for developers, that flexibility usually comes at the cost of extra complexity and redundant code. The goal here is simple: write once, run anywhere, with minimum boilerplate. In Apache Airflow, we’ve already begun tackling this problem with abstractions like Common-SQL, which lets you write database queries once and run them on 20+ databases, from Snowflake to Postgres to SQLite to SAP HANA. Similarly, Common-IO standardizes cloud blob storage interactions across all public clouds. With Airflow 3.0, we are pushing this further by introducing a Common Message Bus provider, which is an abstraction, initially supporting Amazon SQS and expanding to Google PubSub and Apache Kafka soon after. We expect additional implementations such as Amazon Kinesis and Managed Kafka over time.

This talk will dive into why these abstractions matter, how they reduce friction for developers while giving enterprises true multi-cloud optionality, and what’s next for Airflow’s evolving provider ecosystem.

Columbia A

Enterprises want the flexibility to operate across multiple clouds, whether to optimize costs, improve resiliency, to avoid vendor lock-in, or for data sovereignty. But for developers, that flexibility usually comes at the cost of extra complexity and redundant code. The goal here is simple: write once, run anywhere, with minimum boilerplate. In Apache Airflow, we’ve already begun tackling this problem with abstractions like Common-SQL, which lets you write database queries once and run them on 20+ databases, from Snowflake to Postgres to SQLite to SAP HANA. Similarly, Common-IO standardizes cloud blob storage interactions across all public clouds. With Airflow 3.0, we are pushing this further by introducing a Common Message Bus provider, which is an abstraction, initially supporting Amazon SQS and expanding to Google PubSub and Apache Kafka soon after. We expect additional implementations such as Amazon Kinesis and Managed Kafka over time.

12:30 - 12:55.
By Ashir Alam & Gangfeng Huang
Track: Use cases
Room: Columbia C
10/09/2025 12:30 PM 10/09/2025 12:55 PM America/Los_Angeles AS25: Dynamic DAGs and Data Quality using DAGFactory

We have a similar pattern of DAGs running for different data quality dimensions like accuracy, timeliness, & completeness. To do this again and again, we would be duplicating and potentially introducing human error while doing copy paste of code or making people write same code again.

To solve for this, we are doing few things:

  • Run DAGs via DagFactory to dynamically generate DAGs using just some YAML code for all the steps we want to run in our DQ checks.
  • Hide this behind a UI which is hooked to github PR open step, now the user just provides some inputs or selects from dropdown in UI and a YAML DAG is generated for them.

This highlights the potential for DAGFactory to hide Airflow Python code from users and make it more accessible to Data Analysts and Business Intelligence along with normal Software Engg, along with reducing human error. YAML is the perfect format to be able to generate code, create a PR and DagFactory is the perfect fir for that. All of this is running in GCP Cloud Composer.

Columbia C

We have a similar pattern of DAGs running for different data quality dimensions like accuracy, timeliness, & completeness. To do this again and again, we would be duplicating and potentially introducing human error while doing copy paste of code or making people write same code again.

To solve for this, we are doing few things:

  • Run DAGs via DagFactory to dynamically generate DAGs using just some YAML code for all the steps we want to run in our DQ checks.
  • Hide this behind a UI which is hooked to github PR open step, now the user just provides some inputs or selects from dropdown in UI and a YAML DAG is generated for them.

This highlights the potential for DAGFactory to hide Airflow Python code from users and make it more accessible to Data Analysts and Business Intelligence along with normal Software Engg, along with reducing human error. YAML is the perfect format to be able to generate code, create a PR and DagFactory is the perfect fir for that. All of this is running in GCP Cloud Composer.

12:30 - 12:55.
By Chirag Todarka & Alvin Zhang
Track: Airflow & ...
Room: Columbia D
10/09/2025 12:30 PM 10/09/2025 12:55 PM America/Los_Angeles AS25: Scaling and Unifying Multiple Airflow Instances with Orchestration Frederator

In large organizations, multiple Apache Airflow instances often arise organically—driven by team-specific needs, distinct use cases, or tiered workloads. This fragmentation introduces complexity, operational overhead, and higher infrastructure costs. To address these challenges, we developed the “Orchestration Frederator,” a solution designed to unify and horizontally scale multiple Airflow deployments seamlessly.

This session will detail our journey in implementing Orchestration Frederator, highlighting how we achieved:

  • Horizontal Scalability: Seamlessly scaling Airflow across multiple instances without operational overhead.

  • End-to-End Data Lineage: Constructing comprehensive data lineage across disparate Airflow deployments to simplify monitoring and debugging.

  • Multi-Region Support: Introducing multi-region capabilities, enhancing reliability and disaster recovery.

  • Unified Ecosystem: Consolidating previously fragmented Airflow environments into a cohesive orchestration platform.

Join us to explore practical strategies, technical challenges, lessons learned, and best practices for enhancing scalability, reliability, and maintainability in large-scale Airflow deployments.

Columbia D

In large organizations, multiple Apache Airflow instances often arise organically—driven by team-specific needs, distinct use cases, or tiered workloads. This fragmentation introduces complexity, operational overhead, and higher infrastructure costs. To address these challenges, we developed the “Orchestration Frederator,” a solution designed to unify and horizontally scale multiple Airflow deployments seamlessly.

This session will detail our journey in implementing Orchestration Frederator, highlighting how we achieved:

  • Horizontal Scalability: Seamlessly scaling Airflow across multiple instances without operational overhead.

12:30 - 12:55.
By Rakesh Kumar Tai & Mili Tripathi
Track: Use cases
Room: Beckler
10/09/2025 12:30 PM 10/09/2025 12:55 PM America/Los_Angeles AS25: Transforming Data Engineering: Achieving Efficiency and Ease with an Intuitive Orchestration Solution

In the rapidly evolving field of data engineering and data science, efficiency and ease of use are crucial. Our innovative solution offers a user-friendly interface to manage and schedule custom PySpark, PySQL, Python, and SQL code, streamlining the process from development to production. Using Airflow at the backend, this tool eliminates the complexities of infrastructure management, version control, CI/CD processes, and workflow orchestration.The intuitive UI allows users to upload code, configure job parameters, and set schedules effortlessly, without the need for additional scripting or coding. Additionally, users have the flexibility to bring their own custom artifactory solution and run their code. In summary, our solution significantly enhances the orchestration and scheduling of custom code, breaking down traditional barriers and empowering organizations to maximize their data’s potential and drive innovation efficiently. Whether you are an individual data scientist or part of a large data engineering team, this tool provides the resources needed to streamline your workflow and achieve your goals faster than ever before.

Beckler

In the rapidly evolving field of data engineering and data science, efficiency and ease of use are crucial. Our innovative solution offers a user-friendly interface to manage and schedule custom PySpark, PySQL, Python, and SQL code, streamlining the process from development to production. Using Airflow at the backend, this tool eliminates the complexities of infrastructure management, version control, CI/CD processes, and workflow orchestration.The intuitive UI allows users to upload code, configure job parameters, and set schedules effortlessly, without the need for additional scripting or coding. Additionally, users have the flexibility to bring their own custom artifactory solution and run their code. In summary, our solution significantly enhances the orchestration and scheduling of custom code, breaking down traditional barriers and empowering organizations to maximize their data’s potential and drive innovation efficiently. Whether you are an individual data scientist or part of a large data engineering team, this tool provides the resources needed to streamline your workflow and achieve your goals faster than ever before.

14:00 - 14:25.
By Ben Rogojan
Track: Community
Room: Columbia D
10/09/2025 2:00 PM 10/09/2025 2:25 PM America/Los_Angeles AS25: A Decade in Data Engineering - Lessons Realities and Where We Go From Here

There was a post on the data engineering subreddit recently that discussed how difficult it is to keep up with the data engineering world.

Did you learn Hadoop, great we are on Snowflake, BigQuery and Databricks now.

Just learned Airflow, well now we have Airflow 3.0.

And the list goes on.

But what doesn’t change, and what have been the lessons over the past decade. That’s what I’ll be covering in this talk. Real lessons and realities that come up time and time again whether you’re working for a start-up or a large enterprise.

Columbia D

There was a post on the data engineering subreddit recently that discussed how difficult it is to keep up with the data engineering world.

Did you learn Hadoop, great we are on Snowflake, BigQuery and Databricks now.

Just learned Airflow, well now we have Airflow 3.0.

And the list goes on.

But what doesn’t change, and what have been the lessons over the past decade. That’s what I’ll be covering in this talk. Real lessons and realities that come up time and time again whether you’re working for a start-up or a large enterprise.

14:00 - 14:25.
By Yunhao Qing
Track: Use cases
Room: Columbia A
10/09/2025 2:00 PM 10/09/2025 2:25 PM America/Los_Angeles AS25: From Cron to Data-Aware: Evolving Airflow Scheduling at Scale

As data platforms grow in complexity, so do the orchestration needs behind them. Time-based (cron) scheduling has long been the default in Airflow, but dataset-based scheduling promises a more data-aware, efficient alternative. In this session, I’ll share lessons learned from operating Airflow at scale—supporting thousands of DAGs across teams with varied use cases, from simple ETL to complex ML workflows. We’ll explore when dataset scheduling makes sense, the challenges it introduces, and how to evolve your DAG design and platform architecture to make the most of it. Whether you’re migrating legacy workflows or designing new ones, this talk will help you evaluate the right scheduling model for your needs.

Columbia A

As data platforms grow in complexity, so do the orchestration needs behind them. Time-based (cron) scheduling has long been the default in Airflow, but dataset-based scheduling promises a more data-aware, efficient alternative. In this session, I’ll share lessons learned from operating Airflow at scale—supporting thousands of DAGs across teams with varied use cases, from simple ETL to complex ML workflows. We’ll explore when dataset scheduling makes sense, the challenges it introduces, and how to evolve your DAG design and platform architecture to make the most of it. Whether you’re migrating legacy workflows or designing new ones, this talk will help you evaluate the right scheduling model for your needs.

14:00 - 16:30.
By Philippe Gagnon
Track: Workshop
Room: 305
10/09/2025 2:00 PM 10/09/2025 4:30 PM America/Los_Angeles AS25: Implementing Operations Research Problems with Apache Airflow: From Modelling to Production

This workshop will provide an overview of implementing operations research problems using Apache Airflow. This is a hands-on session where attendees will gain experience creating DAGs to define and manage workflows for classical operations research problems. The workshop will include several examples of how Airflow can be used to optimize and automate various decision-making processes, including:

  • Inventory management: How to use Airflow to optimize inventory levels and reduce stockouts by analyzing demand patterns, lead times, and other factors.
  • Production planning: How to use Airflow to create optimized production schedules that minimize downtime, reduce costs, and increase throughput.
  • Logistics optimization: How to use Airflow to optimize transportation routes and other factors to improve the efficiency of logistics operations.

Attendees will come away with a solid understanding of using Airflow to automate decision-making processes with optimization solvers.

305
Hands-on session where attendees will gain experience creating DAGs to define and manage workflows for classical operations research problems.
14:00 - 14:25.
By Arthur Chen, Trevor DeVore & Deng Pan
Track: Airflow & ...
Room: Columbia C
10/09/2025 2:00 PM 10/09/2025 2:25 PM America/Los_Angeles AS25: Lessons learned from migrating to Airflow @ LI Scale

At LinkedIn, our data pipelines process exabytes of data, with our offline infrastructure executing 300K ETL workflows daily and 10K concurrent executions. Historically, these workloads ran on our legacy system, Azkaban, which faced UX, scalability, and operational challenges. To modernize our infra, we built a managed Airflow service, leveraging its enhanced developer & operator experience, rich feature set, and strong OSS community support. That initiated LinkedIn’s largest-ever infrastructure migration—transitioning thousands of legacy workflows to Airflow.

In this talk, we will share key lessons from migrating massive-scale pipelines with minimal production disruption. We will discuss:

  • Overall Migration Strategy
  • Custom Tooling Enhancements on testing, deployment, and observability
  • Architectural Innovations decoupling orchestration and compute
  • GenAI-powered Migration automating code rewrites
  • Post-Migration Challenges & Airflow 3.0.

Attendees will walk away with battle-tested strategies for large-scale Airflow adoption and practical insights into scaling Airflow in enterprise environments.

Columbia C

At LinkedIn, our data pipelines process exabytes of data, with our offline infrastructure executing 300K ETL workflows daily and 10K concurrent executions. Historically, these workloads ran on our legacy system, Azkaban, which faced UX, scalability, and operational challenges. To modernize our infra, we built a managed Airflow service, leveraging its enhanced developer & operator experience, rich feature set, and strong OSS community support. That initiated LinkedIn’s largest-ever infrastructure migration—transitioning thousands of legacy workflows to Airflow.

14:00 - 16:30.
By Pankaj Singh, Tatiana Al-Chueyr Martins & Pankaj Koti
Track: Workshop
Room: 301
10/09/2025 2:00 PM 10/09/2025 4:30 PM America/Los_Angeles AS25: Productionising dbt-core with Airflow

As a popular open-source library for analytics engineering, dbt is often combined with Airflow. Orchestrating and executing dbt models as DAGs ensures an additional layer of control over tasks, observability, and provides a reliable, scalable environment to run dbt models.

This workshop will cover a step-by-step guide to Cosmos, a popular open-source package from Astronomer that helps you quickly run your dbt Core projects as Airflow DAGs and Task Groups, all with just a few lines of code. We’ll walk through:

  • Running and visualising your dbt transformations
  • Managing dependency conflicts
  • Defining database credentials (profiles)
  • Configuring source and test nodes
  • Using dbt selectors
  • Customising arguments per model
  • Addressing performance challenges
  • Leveraging deferrable operators
  • Visualising dbt docs in the Airflow UI
  • Example of how to deploy to production
  • Troubleshooting

We encourage participants to bring their dbt project to follow this step-by-step workshop.

301
This workshop will cover a step-by-step guide to Cosmos, an open-source package that helps you quickly run your dbt Core projects as Airflow DAGs and Task Groups.
14:00 - 14:25.
By Khadija Al Ahyane
Track: Airflow & ...
Room: Beckler
10/09/2025 2:00 PM 10/09/2025 2:25 PM America/Los_Angeles AS25: Task failures troubleshooting based on Airflow & Kubernetes signals

Per Airflow community survey, Kubernetes is the most popular compute platform used to run Airflow and when run on Kubernetes, Airflow gains, out of the box, lots of benefits like monitoring, reliability, ease of deployment, scalability and autoscaling. On the other hand, running Airflow on Kubernetes means running a sophisticated distributed system on another distributed system which makes troubleshooting of Airflow tasks and DAGs failures harder.

This session tackles that bottleneck head-on, introducing a practical approach to building an automated diagnostic pipeline for Airflow on Kubernetes. Imagine offloading tedious investigations to a system that, on task failure, automatically collects and correlates key signals from Kubernetes components (linking Airflow tasks to specific Pods and their events), KubernetesGKE monitoring, and relevant logs—pinpointing root causes and suggesting actionable fixes.

Attendees will leave with a clear understanding of common Airflow-on-Kubernetes failure patterns—and more importantly, a blueprint and practical strategies to reduce MTTR and boost team efficiency.

Beckler

Per Airflow community survey, Kubernetes is the most popular compute platform used to run Airflow and when run on Kubernetes, Airflow gains, out of the box, lots of benefits like monitoring, reliability, ease of deployment, scalability and autoscaling. On the other hand, running Airflow on Kubernetes means running a sophisticated distributed system on another distributed system which makes troubleshooting of Airflow tasks and DAGs failures harder.

This session tackles that bottleneck head-on, introducing a practical approach to building an automated diagnostic pipeline for Airflow on Kubernetes. Imagine offloading tedious investigations to a system that, on task failure, automatically collects and correlates key signals from Kubernetes components (linking Airflow tasks to specific Pods and their events), KubernetesGKE monitoring, and relevant logs—pinpointing root causes and suggesting actionable fixes.

14:00 - 16:30.
By Ryan Hatter, Amogh Desai, Phani Kumar & Kalyan Reddy
Track: Workshop
Room: 306
10/09/2025 2:00 PM 10/09/2025 4:30 PM America/Los_Angeles AS25: Your first Apache Airflow Contribution

Ready to contribute to Apache Airflow? In this hands-on workshop, you’ll be expected to come prepared with your development environment already configured (Breeze installed is strongly recommended, but Codespaces works if you can’t install Docker). We’ll dive straight into finding issues that match your skills and walk you through the entire contribution process—from creating your first pull request to receiving community feedback. Whether you’re writing code, enhancing documentation, or offering feedback, there’s a place for you. Let’s get started and see your name among Airflow contributors!

306
Whether you’re writing code, enhancing documentation, or offering feedback, there’s a place for you. Let’s get started and see your name among Airflow contributors!"
14:30 - 14:55.
By Shoubhik Bose
Track: Use cases
Room: Columbia C
10/09/2025 2:30 PM 10/09/2025 2:55 PM America/Los_Angeles AS25: Applying Airflow to drive the digital workforce in the Enterprise

Red Hat’s unified data and AI platform relies on Apache Airflow for orchestration, alongside Snowflake, Fivetran, and Atlan. The platform prioritizes building a dependable data foundation, recognizing that effective AI depends on quality data. Airflow was selected for its predictability, extensive connectivity, reliability, and scalability.

The platform now supports business analytics, transitioning from ETL to ELT processes. This has resulted in a remarkable improvement in how we make data available for business decisions.

The platform’s capabilities are being extended to power Digital Workers (AI agents) using large language models, encompassing model training, fine-tuning, and inference. Two Digital Workers are currently deployed, with more in development.

This presentation will detail the rationale and background of this evolution, followed by an explanation of the architectural decisions made and the challenges encountered and resolved throughout the process of transforming into an AI-enabled data platform to power Red Hat’s business.

Columbia C

Red Hat’s unified data and AI platform relies on Apache Airflow for orchestration, alongside Snowflake, Fivetran, and Atlan. The platform prioritizes building a dependable data foundation, recognizing that effective AI depends on quality data. Airflow was selected for its predictability, extensive connectivity, reliability, and scalability.

The platform now supports business analytics, transitioning from ETL to ELT processes. This has resulted in a remarkable improvement in how we make data available for business decisions.

14:30 - 14:55.
By Christian Foernges
Track: Use cases
Room: Columbia A
10/09/2025 2:30 PM 10/09/2025 2:55 PM America/Los_Angeles AS25: Learn from Deutsche Bank: Using Apache Airflow in Regulated Environments

Operating within the stringent regulatory landscape of Corporate Banking, Deutsche Bank relies heavily on robust data orchestration. This session explores how Deutsche Bank’s Corporate Bank leverages Apache Airflow across diverse environments, including both on-premises infrastructure and cloud platforms. Discover their approach to managing critical data & analytics workflows, encompassing areas like regulatory reporting, data integration and complex data processing pipelines. Gain insights into the architectural patterns and operational best practices employed to ensure compliance, security, and scalability when running Airflow at scale in a highly regulated, hybrid setting.

Columbia A

Operating within the stringent regulatory landscape of Corporate Banking, Deutsche Bank relies heavily on robust data orchestration. This session explores how Deutsche Bank’s Corporate Bank leverages Apache Airflow across diverse environments, including both on-premises infrastructure and cloud platforms. Discover their approach to managing critical data & analytics workflows, encompassing areas like regulatory reporting, data integration and complex data processing pipelines. Gain insights into the architectural patterns and operational best practices employed to ensure compliance, security, and scalability when running Airflow at scale in a highly regulated, hybrid setting.

14:30 - 14:55.
By Purshotam Shah & Prakash Nandha Mukunthan
Track: Use cases
Room: Beckler
10/09/2025 2:30 PM 10/09/2025 2:55 PM America/Los_Angeles AS25: Navigating Secure and Cost-Efficient Flink Batch on Kubernetes with Airflow

At Yahoo, we built a secure, scalable, and cost-efficient batch processing platform using Amazon MWAA to orchestrate Apache Flink jobs on EKS, managed by the Flink Kubernetes Operator. This setup enables dynamic job orchestration while meeting strict enterprise compliance standards.

In this session, we’ll share how Airflow DAGs:

  • Dynamically launch, monitor, and clean up isolated Flink clusters per batch job, improving resource efficiency.

  • Securely fetch EKS kubeconfig, submit FlinkDeployment CRDs using FlinkKubernetesOperator, and poll job status using Airflow sensors.

  • Integrate IAM for access control and meet Yahoo’s security requirements, including mutual TLS (mTLS) with Athenz.

  • Optimize for cost and resilience through automated cleanup of jobs and the operator, and handle job failures and retries.

Join us for practical strategies and lessons from Yahoo’s production-scale Flink workflows in a Kubernetes environment.

Beckler
14:30 - 14:55.
By Katarzyna Kalek & Jakub Orlowski
Track: Airflow & ...
Room: Columbia D
10/09/2025 2:30 PM 10/09/2025 2:55 PM America/Los_Angeles AS25: Simplifying Data Management with DAG Factory

At OLX, we connect millions of people daily through our online marketplace while relying on robust data pipelines. In this talk, we explore how the DAG Factory concept elevates data governance, lineage, and discovery by centralizing operator logic and restricting direct DAG creation. This approach enforces code quality, optimizes resources, maintains infrastructure hygiene and enables smooth version upgrades. We then leverage consistent naming conventions in Airflow to build targeted namespaces, aligning teams with global policies while preserving autonomy. Integrating external tools like AWS Lake Formation and Open Metadata further unifies governance, making it straightforward to manage and secure data. This is critical when handling hundreds or even thousands of active DAGs. If the idea of storing 1,600 pipelines in one folder seems overwhelming, join us to learn how the DAG Factory concept simplifies pipeline management. We’ll also share insights from OLX, highlighting how thoughtful design fosters oversight, efficiency, and discoverability across diverse use cases.

Columbia D

At OLX, we connect millions of people daily through our online marketplace while relying on robust data pipelines. In this talk, we explore how the DAG Factory concept elevates data governance, lineage, and discovery by centralizing operator logic and restricting direct DAG creation. This approach enforces code quality, optimizes resources, maintains infrastructure hygiene and enables smooth version upgrades. We then leverage consistent naming conventions in Airflow to build targeted namespaces, aligning teams with global policies while preserving autonomy. Integrating external tools like AWS Lake Formation and Open Metadata further unifies governance, making it straightforward to manage and secure data. This is critical when handling hundreds or even thousands of active DAGs. If the idea of storing 1,600 pipelines in one folder seems overwhelming, join us to learn how the DAG Factory concept simplifies pipeline management. We’ll also share insights from OLX, highlighting how thoughtful design fosters oversight, efficiency, and discoverability across diverse use cases.

15:00 - 15:25.
By Niko Oliveira
Track: Airflow intro/overview
Room: Beckler
10/09/2025 3:00 PM 10/09/2025 3:25 PM America/Los_Angeles AS25: AWS Lambda Executor: The Speed of Local Execution with the Advantages of Remote

Apache Airflow’s executor landscape has traditionally presented users with a clear trade-off: choose either the speed of local execution or the scalability, isolation and configurability of remote execution. The AWS Lambda Executor introduces a new paradigm that bridges this gap, offering near-local execution speeds with the benefits of remote containerization.

This talk will begin with a brief overview of Airflow’s executors, how they work and what they are responsible for, highlighting the compromises between different executors. We will explore the emerging niche for fast, yet remote execution and demonstrate how the AWS Lambda Executor fills this space. We will also address practical considerations when using such an executor, such as working within Lambda’s 15 minute execution limit, and how to mitigate this using multi-executor configuration.

Whether you’re new to Airflow or an experienced user, this session will provide valuable insights into task execution and how you can combine the best of both local and remote execution paradigms.

Beckler

Apache Airflow’s executor landscape has traditionally presented users with a clear trade-off: choose either the speed of local execution or the scalability, isolation and configurability of remote execution. The AWS Lambda Executor introduces a new paradigm that bridges this gap, offering near-local execution speeds with the benefits of remote containerization.

This talk will begin with a brief overview of Airflow’s executors, how they work and what they are responsible for, highlighting the compromises between different executors. We will explore the emerging niche for fast, yet remote execution and demonstrate how the AWS Lambda Executor fills this space. We will also address practical considerations when using such an executor, such as working within Lambda’s 15 minute execution limit, and how to mitigate this using multi-executor configuration.

15:00 - 15:25.
By Yifan Wang
Track: Roadmap
Room: Columbia D
10/09/2025 3:00 PM 10/09/2025 3:25 PM America/Los_Angeles AS25: DAGnostics: Shift-Left Airflow Governance with Policy Enforcement Framework

DAGnostics seamlessly integrates Airflow Cluster Policy hooks to enforce governance from local DAG authoring through CI pipelines to production runtime. Learn how it closes validation gaps, collapses feedback loops from hours to seconds, and ensures consistent policies across stages. We examine current runtime-only enforcement and fractured CI checks, then unveil our architecture: a pluggable policy registry via Airflow entry points, local static analysis for pre-commit validation, GitHub Actions CI integration, and runtime hook enforcement. See real-world use cases: alerting standards, resource quotas, naming conventions, and exemption handling. Next, dive into implementation: authoring policies in Python, auto-discovery, cross-environment enforcement, upstream contribution, and testing strategies. We share LinkedIn’s metrics—2,000+ DAG repos, 10,000+ daily executions supporting trunk-based development across isolated teams/use-cases, and 78% fewer runtime violations—and lessons learned scaling policy-as-code at enterprise scale. Leave with a blueprint to adopt DAGnostics and strengthen your Airflow governance while preserving full compatibility with existing systems.

Columbia D

DAGnostics seamlessly integrates Airflow Cluster Policy hooks to enforce governance from local DAG authoring through CI pipelines to production runtime. Learn how it closes validation gaps, collapses feedback loops from hours to seconds, and ensures consistent policies across stages. We examine current runtime-only enforcement and fractured CI checks, then unveil our architecture: a pluggable policy registry via Airflow entry points, local static analysis for pre-commit validation, GitHub Actions CI integration, and runtime hook enforcement. See real-world use cases: alerting standards, resource quotas, naming conventions, and exemption handling. Next, dive into implementation: authoring policies in Python, auto-discovery, cross-environment enforcement, upstream contribution, and testing strategies. We share LinkedIn’s metrics—2,000+ DAG repos, 10,000+ daily executions supporting trunk-based development across isolated teams/use-cases, and 78% fewer runtime violations—and lessons learned scaling policy-as-code at enterprise scale. Leave with a blueprint to adopt DAGnostics and strengthen your Airflow governance while preserving full compatibility with existing systems.

15:00 - 15:25.
By Kunal Jain
Track: Use cases
Room: Columbia C
10/09/2025 3:00 PM 10/09/2025 3:25 PM America/Los_Angeles AS25: How Airflow can help with Data Management and Governance

Metadata management is a cornerstone of effective data governance, yet it presents unique challenges distinct from traditional data engineering. At scale, efficiently extracting metadata from relational and NoSQL databases demands specialized solutions. To address this, our team has developed custom Airflow operators that scan and extract metadata across various database technologies, orchestrating 100+ production jobs to ensure continuous and reliable metadata collection.

Now, we’re expanding beyond databases to tackle non-traditional data sources such as file repositories and message queues. This shift introduces new complexities, including processing structured and unstructured files, managing schema evolution in streaming data, and maintaining metadata consistency across heterogeneous sources. In this session, we’ll share our approach to building scalable metadata scanners, optimizing performance, and ensuring adaptability across diverse data environments. Attendees will gain insights into designing efficient metadata pipelines, overcoming common pitfalls, and leveraging Airflow to drive metadata governance at scale.

Columbia C

Metadata management is a cornerstone of effective data governance, yet it presents unique challenges distinct from traditional data engineering. At scale, efficiently extracting metadata from relational and NoSQL databases demands specialized solutions. To address this, our team has developed custom Airflow operators that scan and extract metadata across various database technologies, orchestrating 100+ production jobs to ensure continuous and reliable metadata collection.

Now, we’re expanding beyond databases to tackle non-traditional data sources such as file repositories and message queues. This shift introduces new complexities, including processing structured and unstructured files, managing schema evolution in streaming data, and maintaining metadata consistency across heterogeneous sources. In this session, we’ll share our approach to building scalable metadata scanners, optimizing performance, and ensuring adaptability across diverse data environments. Attendees will gain insights into designing efficient metadata pipelines, overcoming common pitfalls, and leveraging Airflow to drive metadata governance at scale.

15:00 - 15:25.
By Oluwafemi Olawoyin
Track: Use cases
Room: Columbia A
10/09/2025 3:00 PM 10/09/2025 3:25 PM America/Los_Angeles AS25: Modernizing Automation in Secure, Regulated Environments: Lessons from Deploying Airflow

This session details practical strategies for introducing Apache Airflow in strict, compliance-heavy organizations. Learn how on-premise deployment and hybrid tooling can help modernize legacy workflows when public cloud solutions and container technologies are restricted. Discover how cross-platform engineering teams can collaborate securely using CI/CD bridges, and what it takes to meet rigorous security and governance standards. Key lessons address navigating resistance to change, achieving production sign-off, and avoiding common compliance pitfalls, relevant to anyone automating in public sector settings.

Columbia A

This session details practical strategies for introducing Apache Airflow in strict, compliance-heavy organizations. Learn how on-premise deployment and hybrid tooling can help modernize legacy workflows when public cloud solutions and container technologies are restricted. Discover how cross-platform engineering teams can collaborate securely using CI/CD bridges, and what it takes to meet rigorous security and governance standards. Key lessons address navigating resistance to change, achieving production sign-off, and avoiding common compliance pitfalls, relevant to anyone automating in public sector settings.

15:45 - 16:10.
By John Robert
Track: Airflow & ...
Room: Columbia C
10/09/2025 3:45 PM 10/09/2025 4:10 PM America/Los_Angeles AS25: Building a Transparent Data Workflow with Airflow and Data Catalog

As modern data ecosystems grow in complexity, ensuring transparency, discoverability, and governance in data workflows becomes critical. Apache Airflow, a powerful workflow orchestration tool, enables data engineers to build scalable pipelines, but without proper visibility into data lineage, ownership, and quality, teams risk operating in a black box.

In this talk, we will explore how integrating Airflow with a data catalog can bring clarity and transparency to data workflows. We’ll discuss how metadata-driven orchestration enhances data governance, enables lineage tracking, and improves collaboration across teams. Through real-world use cases, we will demonstrate how Airflow can automate metadata collection, update data catalogs dynamically, and ensure data quality at every stage of the pipeline.

Attendees will walk away with practical strategies for implementing a transparent data workflow that fosters trust, efficiency, and compliance in their data infrastructure.

Columbia C

As modern data ecosystems grow in complexity, ensuring transparency, discoverability, and governance in data workflows becomes critical. Apache Airflow, a powerful workflow orchestration tool, enables data engineers to build scalable pipelines, but without proper visibility into data lineage, ownership, and quality, teams risk operating in a black box.

In this talk, we will explore how integrating Airflow with a data catalog can bring clarity and transparency to data workflows. We’ll discuss how metadata-driven orchestration enhances data governance, enables lineage tracking, and improves collaboration across teams. Through real-world use cases, we will demonstrate how Airflow can automate metadata collection, update data catalogs dynamically, and ensure data quality at every stage of the pipeline.

15:45 - 16:10.
By Gurmeet Saran & Kushal Thakkar
Track: Airflow & ...
Room: Columbia D
10/09/2025 3:45 PM 10/09/2025 4:10 PM America/Los_Angeles AS25: Enabling SQL testing in Airflow workflows using Pydantic types

This session explores how to bring unit testing to SQL pipelines using Airflow. I’ll walk through the development of a SQL testing library that allows isolated testing of SQL logic by injecting mock data into base tables. To support this, we built a type system for AWS Glue tables using Pydantic, enabling schema validation and mock data generation. Over time, this type system also powered production data quality checks via a custom Airflow operator. Learn how this approach improves reliability, accelerates development, and scales testing across data workflows.

Columbia D

This session explores how to bring unit testing to SQL pipelines using Airflow. I’ll walk through the development of a SQL testing library that allows isolated testing of SQL logic by injecting mock data into base tables. To support this, we built a type system for AWS Glue tables using Pydantic, enabling schema validation and mock data generation. Over time, this type system also powered production data quality checks via a custom Airflow operator. Learn how this approach improves reliability, accelerates development, and scales testing across data workflows.

15:45 - 16:10.
By Yu Lung Law & Ivan Sayapin
Track: Use cases
Room: Columbia A
10/09/2025 3:45 PM 10/09/2025 4:10 PM America/Los_Angeles AS25: Fine-Tuning Airflow: Parameters You May Not Know About

The Bloomberg Data Platform Engineering team is responsible for managing, storing, and providing access to business and financial data used by financial professionals across the global capital markets. Our team utilizes Apache Airflow to orchestrate data workflows across various applications and Bloomberg Terminal functions. Over the years, we have fine-tuned our Airflow cluster to handle more than 1,000 ingestion DAGs, which has presented unique scalability challenges. In this session, we will share insights into several key Airflow parameters — some of which you may not be all that familiar with — that our team uses to optimize and scale the platform effectively.

Columbia A

The Bloomberg Data Platform Engineering team is responsible for managing, storing, and providing access to business and financial data used by financial professionals across the global capital markets. Our team utilizes Apache Airflow to orchestrate data workflows across various applications and Bloomberg Terminal functions. Over the years, we have fine-tuned our Airflow cluster to handle more than 1,000 ingestion DAGs, which has presented unique scalability challenges. In this session, we will share insights into several key Airflow parameters — some of which you may not be all that familiar with — that our team uses to optimize and scale the platform effectively.

15:45 - 16:10.
By Steven Woods
Track: Best practices
Room: Beckler
10/09/2025 3:45 PM 10/09/2025 4:10 PM America/Los_Angeles AS25: From Repetition to Refactor: Smarter DAG Design in Airflow 3

We will explore how Apache Airflow 3 unlocks new possibilities for smarter, more flexible DAG design. We’ll start by breaking down common anti-patterns in early DAG implementations, such as hardcoded operators, duplicated task logic, and rigid sequencing, that lead to brittle, unscalable workflows. From there, we’ll show how refactoring with the D.R.Y. (Don’t Repeat Yourself) principle, using techniques like task factories, parameterization, dynamic task mapping, and modular DAG construction, transforms these workflows into clean, reusable patterns.

With Airflow 3, these strategies go further: enabling DAGs that are reusable across both batch pipelines and streaming/event-driven workloads, while also supporting ad-hoc runs for testing, one-off jobs, or backfills. The result is not just more concise code, but workflows that can flexibly serve different data processing modes without duplication. Attendees will leave with concrete patterns and best practices for building maintainable, production-grade DAGs that are scalable, observable, and aligned with modern data engineering standards.

Beckler

We will explore how Apache Airflow 3 unlocks new possibilities for smarter, more flexible DAG design. We’ll start by breaking down common anti-patterns in early DAG implementations, such as hardcoded operators, duplicated task logic, and rigid sequencing, that lead to brittle, unscalable workflows. From there, we’ll show how refactoring with the D.R.Y. (Don’t Repeat Yourself) principle, using techniques like task factories, parameterization, dynamic task mapping, and modular DAG construction, transforms these workflows into clean, reusable patterns.

16:15 - 16:40.
By Annie Friedman & Caitlin Petro
Track: Best practices
Room: Beckler
10/09/2025 4:15 PM 10/09/2025 4:40 PM America/Los_Angeles AS25: Lessons from Airflow gone wrong: How to set yourself up to scale successfully

Ever seen a DAG go rogue and deploy itself? Or try to time travel back to 1999? Join us for a light-hearted yet painfully relatable look at how not to scale your Airflow deployment to avoid chaos and debugging nightmares.

We’ll cover the classics: hardcoded secrets, unbounded retries (hello, immortal task!), and the infamous spaghetti DAG where 200 tasks are lovingly connected by hand and no one dares open the Airflow UI anymore. If you’ve ever used datetime.now() in your DAG definition and watched your backfills implode, this talk is for you.

From the BashOperator that became sentient to the XCom that tried to pass a whole Pandas DataFrame and the key to your mother’s house, we’ll walk through real-world bloopers with practical takeaways. You’ll learn why overusing PythonOperator is a recipe for mess, how not to use sensors unless you enjoy resource starvation, and why scheduling in local timezones is basically asking for a daylight savings time horror story. Other highlights include:

Over-provisioning resources in KubernetesPodOperator: many teams allocate excessive memory/CPU “just in case”, leading to cluster contention and resource waste.

Dynamic task mapping gone wild: 10,000 mapped tasks later… the scheduler is still crying.

SLAs used as data quality guarantees: creating alerts so noisy, nobody listens. Design-free DAGs: no docs, no comments, no idea why a task has a 3-day timeout. Finally, we’ll round it out with some dos and don’ts: using environment variables, avoiding memory-hungry monolith DAGs, skipping global imports, and not allocating 10x more memory “just in case.” Whether you’re new to Airflow or battle-hardened from a thousand failed backfills, come learn how to scale your pipelines without losing your mind (or your cluster).

Beckler

Ever seen a DAG go rogue and deploy itself? Or try to time travel back to 1999? Join us for a light-hearted yet painfully relatable look at how not to scale your Airflow deployment to avoid chaos and debugging nightmares.

We’ll cover the classics: hardcoded secrets, unbounded retries (hello, immortal task!), and the infamous spaghetti DAG where 200 tasks are lovingly connected by hand and no one dares open the Airflow UI anymore. If you’ve ever used datetime.now() in your DAG definition and watched your backfills implode, this talk is for you.

16:15 - 16:40.
By Abhishek Bhakat & Sudarshan Chaudhari
Track: Airflow & ...
Room: Columbia D
10/09/2025 4:15 PM 10/09/2025 4:40 PM America/Los_Angeles AS25: Model Context Protocol with Airflow

In today’s data-driven world, effective workflow management and AI are crucial for success. However, there’s a notable gap between Airflow and AI. Our presentation offers a solution to close this gap.

Proposing MCP (Model Context Protocol) server to act as a bridge. We’ll dive into two paths:

  • AI-Augmented Airflow: Enhancing Airflow with AI to improve error handling, automate DAG generation, proactively detect issues, and optimize resource use.
  • Airflow-Powered AI: Utilizing Airflow’s reliability to empower LLMs in executing complex tasks, orchestrating AI agents, and supporting decision-making with real-time data.

Key takeaways:

  • Understanding how to integrate AI insights directly into your workflow orchestration.
  • Learning how MCP empowers AI with robust orchestration capabilities, offering full logging, monitoring, and auditability.
  • Gaining insights into how to transform LLMS from a reactive responder to a proactive, intelligent, and reliable executor.

Inviting you to explore how MCP can help workflow management, making AI-driven decisions more reliable and turning workflow systems into intelligent, autonomous agents.

Columbia D

In today’s data-driven world, effective workflow management and AI are crucial for success. However, there’s a notable gap between Airflow and AI. Our presentation offers a solution to close this gap.

Proposing MCP (Model Context Protocol) server to act as a bridge. We’ll dive into two paths:

  • AI-Augmented Airflow: Enhancing Airflow with AI to improve error handling, automate DAG generation, proactively detect issues, and optimize resource use.
  • Airflow-Powered AI: Utilizing Airflow’s reliability to empower LLMs in executing complex tasks, orchestrating AI agents, and supporting decision-making with real-time data.

Key takeaways:

16:15 - 16:40.
By Sebastien Crocquevieille
Track: Use cases
Room: Columbia C
10/09/2025 4:15 PM 10/09/2025 4:40 PM America/Los_Angeles AS25: Multi-Instance Asset Synchronization - push or pull?

As Data Engineers, our jobs regularly include scheduling or scaling workflows.

But have you ever asked yourself, can I scale my scheduling ?

It turns out that you can! But doing so raises a number of issues that need to be addressed.

In this talk we’ll be:

  • Recapping Asset-aware scheduling in Apache Airflow
  • Discussing diverse methods to upscale our scheduling
  • Solving the issue of maintaining our Airflow Asset synchronized between instances
  • Comparing our professional push based solution and the built-in solution from AIP-82 and the pros and cons of each method.

I hope you will enjoy it!

Columbia C

As Data Engineers, our jobs regularly include scheduling or scaling workflows.

But have you ever asked yourself, can I scale my scheduling ?

It turns out that you can! But doing so raises a number of issues that need to be addressed.

In this talk we’ll be:

  • Recapping Asset-aware scheduling in Apache Airflow
  • Discussing diverse methods to upscale our scheduling
  • Solving the issue of maintaining our Airflow Asset synchronized between instances
  • Comparing our professional push based solution and the built-in solution from AIP-82 and the pros and cons of each method.

I hope you will enjoy it!

16:15 - 16:40.
By pei-chi-miko-chen
Track: Use cases
Room: Columbia A
10/09/2025 4:15 PM 10/09/2025 4:40 PM America/Los_Angeles AS25: No More Missed Beats: How Airflow Rescued Our Analytics Pipeline

Before Airflow, our BigQuery pipelines at Create Music Group operated like musicians without a conductor—each playing on its own schedule, regardless of whether upstream data was ready. As our data platform grew, this chaos led to spiralling costs, performance bottlenecks, and became utterly unsustainable.

This talk tells the story of how Create Music Group brought harmony to its data workflows by adopting Apache Airflow and the Medallion architecture, ultimately slashing our data processing costs by 50%. We’ll show how moving to event-driven scheduling with datasets helped eliminate stale data issues, dramatically improved performance, and unlocked faster iteration across teams. Discover how we replaced repetitive SQL with standardized dimension/fact tables, empowering analysts in a safer sandbox.

Columbia A

Before Airflow, our BigQuery pipelines at Create Music Group operated like musicians without a conductor—each playing on its own schedule, regardless of whether upstream data was ready. As our data platform grew, this chaos led to spiralling costs, performance bottlenecks, and became utterly unsustainable.

This talk tells the story of how Create Music Group brought harmony to its data workflows by adopting Apache Airflow and the Medallion architecture, ultimately slashing our data processing costs by 50%. We’ll show how moving to event-driven scheduling with datasets helped eliminate stale data issues, dramatically improved performance, and unlocked faster iteration across teams. Discover how we replaced repetitive SQL with standardized dimension/fact tables, empowering analysts in a safer sandbox.

16:45 - 17:10.
By Yuhang Huang & Arunav Gupta
Track: Community
Room: Columbia A
10/09/2025 4:45 PM 10/09/2025 5:10 PM America/Los_Angeles AS25: Lessons learned for building open source Airflow operators at AWS

In this talk, we’ll share our journey and lessons learned from developing a new open-source Airflow operator that integrates a newly-launched AWS service with the Airflow ecosystem. This real-world case study will illuminate the complete lifecycle of building an Airflow operator, from initial design to successful community contribution.

We’ll dive deep into the practical challenges and solutions encountered throughout the journey, including:

  • Evaluating when to build a new operator versus extending existing ones
  • Navigating the Apache Airflow Open-source contribution process
  • Best practices for operator design and implementation
  • Key learnings and common pitfalls to avoid during the testing and release process

Whether you’re looking to contribute to Apache Airflow or build custom operators, this session will provide valuable insights into the development process, common pitfalls to avoid, and best practices when contributing to and collaborating with the Apache Airflow community.

Expect to leave with a practical roadmap for your own contributions and the confidence to successfully engage with the Apache Airflow ecosystem.

Columbia A

In this talk, we’ll share our journey and lessons learned from developing a new open-source Airflow operator that integrates a newly-launched AWS service with the Airflow ecosystem. This real-world case study will illuminate the complete lifecycle of building an Airflow operator, from initial design to successful community contribution.

We’ll dive deep into the practical challenges and solutions encountered throughout the journey, including:

  • Evaluating when to build a new operator versus extending existing ones
  • Navigating the Apache Airflow Open-source contribution process
  • Best practices for operator design and implementation
  • Key learnings and common pitfalls to avoid during the testing and release process

Whether you’re looking to contribute to Apache Airflow or build custom operators, this session will provide valuable insights into the development process, common pitfalls to avoid, and best practices when contributing to and collaborating with the Apache Airflow community.

17:30 - 17:35.
By Shahar Epstein
Track: Airflow & ...
Room: Columbia A
10/09/2025 5:30 PM 10/09/2025 5:35 PM America/Los_Angeles AS25: Lightning talk: Supercharging Apache Airflow: Enhancing Core Components with Rust

Apache Airflow is a powerful workflow orchestrator, but as workloads grow, its Python-based components can become performance bottlenecks. This talk explores how Rust, with its speed, safety, and concurrency advantages, can enhance Airflow’s core components (e.g, scheduler, DAG processor, etc). We’ll dive into the motivations behind using Rust, architectural trade-offs, and the challenges of bridging the gap between Python and Rust. A proof-of-concept showcasing an Airflow scheduler rewritten in Rust will demonstrate the potential benefits of this approach.

Columbia A

Apache Airflow is a powerful workflow orchestrator, but as workloads grow, its Python-based components can become performance bottlenecks. This talk explores how Rust, with its speed, safety, and concurrency advantages, can enhance Airflow’s core components (e.g, scheduler, DAG processor, etc). We’ll dive into the motivations behind using Rust, architectural trade-offs, and the challenges of bridging the gap between Python and Rust. A proof-of-concept showcasing an Airflow scheduler rewritten in Rust will demonstrate the potential benefits of this approach.

10:00 - 10:30
Coffee break
13:00 - 14:00
Lunch
15:30 - 15:45
Coffee break
17:35 - 17:40
Lightning talk (sign up for this slot at registration desk)
16:40 - 17:45
Lightning talk (sign up for this slot at registration desk)
17:45 - 17:50
Lightning talk (sign up for this slot at registration desk)
17:50 - 17:55
Lightning talk (sign up for this slot at registration desk)
18:00 - 18:15
Event wrap-up
09:15 - 10:00. Columbia A
By Peeyush Rai & Vikram Koka
Track: Keynote
10/09/2025 9:15 AM 10/09/2025 10:00 AM America/Los_Angeles AS25: Airflow as a Platform for Agentic AI Digital Products Within Enterprises

In this keynote, Peeyush Rai and Vikram Koka will be walking through how Airflow is being used as part of a Agentic AI platform servicing insurance companies, which runs on all the major public clouds, leveraging models from Open AI, Google (Gemini), AWS (Claude and Bedrock).

This talk walks through the details of the actual end user business workflow including gathering relevant financial data to make a decision, as well as the tricky challenge of handling AI hallucinations, with new Airflow capabilities such as “Human in the loop”.

This talk offers something for both business and technical audiences. Business users will get a clear view of what it takes to bring an AI application into production and how to align their operations and business teams with an AI enabled workflow. Meanwhile, technical users will walk away with practical insights on how to orchestrate complex business processes enabling a seamless collaboration between Airflow, AI Agents and Human in the loop.

Columbia A

In this keynote, Peeyush Rai and Vikram Koka will be walking through how Airflow is being used as part of a Agentic AI platform servicing insurance companies, which runs on all the major public clouds, leveraging models from Open AI, Google (Gemini), AWS (Claude and Bedrock).

This talk walks through the details of the actual end user business workflow including gathering relevant financial data to make a decision, as well as the tricky challenge of handling AI hallucinations, with new Airflow capabilities such as “Human in the loop”.

10:30 - 10:55. Columbia A
By Karthik Dulam
Track: Use cases
10/09/2025 10:30 AM 10/09/2025 10:55 AM America/Los_Angeles AS25: Orchestrating MLOps and Data Transformation at EDB with Airflow

This talk explores EDB’s journey from siloed reporting to a unified data platform, powered by Airflow. We’ll delve into the architectural evolution, showcasing how Airflow orchestrates a diverse range of use cases, from Analytics Engineering to complex MLOps pipelines.

Learn how EDB leverages Airflow and Cosmos to integrate dbt for robust data transformations, ensuring data quality and consistency.

We’ll provide a detailed case study of our MLOps implementation, demonstrating how Airflow manages training, inference, and model monitoring pipelines for Azure Machine Learning models.

Discover the design considerations driven by our internal data governance framework and gain insights into our future plans for AIOps integration with Airflow.

Columbia A

This talk explores EDB’s journey from siloed reporting to a unified data platform, powered by Airflow. We’ll delve into the architectural evolution, showcasing how Airflow orchestrates a diverse range of use cases, from Analytics Engineering to complex MLOps pipelines.

Learn how EDB leverages Airflow and Cosmos to integrate dbt for robust data transformations, ensuring data quality and consistency.

We’ll provide a detailed case study of our MLOps implementation, demonstrating how Airflow manages training, inference, and model monitoring pipelines for Azure Machine Learning models.

10:30 - 10:55. Columbia C
By Hannah Lundrigan
Track: Use cases
10/09/2025 10:30 AM 10/09/2025 10:55 AM America/Los_Angeles AS25: Enhancing Small Retailer Visibility: Machine Learning Pipelines with Apache Airflow

Small retailers often lack the data visibility that larger companies rely on for decision-making. In this session, we’ll dive into how Apache Airflow powers end-to-end machine learning pipelines that process inventory and sales data, enabling retailers and suppliers to gain valuable industry insights. We’ll cover feature engineering, model training, and automated inference workflows, along with strategies for handling messy, incomplete retail data. We will discuss how Airflow enables scalable ML-driven insights that improve demand forecasting, product categorization, and supply chain optimization.

Columbia C

Small retailers often lack the data visibility that larger companies rely on for decision-making. In this session, we’ll dive into how Apache Airflow powers end-to-end machine learning pipelines that process inventory and sales data, enabling retailers and suppliers to gain valuable industry insights. We’ll cover feature engineering, model training, and automated inference workflows, along with strategies for handling messy, incomplete retail data. We will discuss how Airflow enables scalable ML-driven insights that improve demand forecasting, product categorization, and supply chain optimization.

10:30 - 10:55. Columbia D
By Ashok Prakash
Track: Best practices
10/09/2025 10:30 AM 10/09/2025 10:55 AM America/Los_Angeles AS25: Scaling ML Infrastructure: Lessons from Building Distributed Systems

In today’s data-driven world, scalable ML infrastructure is mission-critical. As ML workloads grow, orchestration tools like Apache Airflow become essential for managing pipelines, training, deployment, and observability. In this talk, I’ll share lessons from building distributed ML systems across cloud platforms, including GPU-based training and AI-powered healthcare. We’ll cover patterns for scaling Airflow DAGs, integrating telemetry and auto-healing, and aligning cross-functional teams. Whether you’re launching your first pipeline or managing ML at scale, you’ll gain practical strategies to make Airflow the backbone of your ML infrastructure.

Columbia D

In today’s data-driven world, scalable ML infrastructure is mission-critical. As ML workloads grow, orchestration tools like Apache Airflow become essential for managing pipelines, training, deployment, and observability. In this talk, I’ll share lessons from building distributed ML systems across cloud platforms, including GPU-based training and AI-powered healthcare. We’ll cover patterns for scaling Airflow DAGs, integrating telemetry and auto-healing, and aligning cross-functional teams. Whether you’re launching your first pipeline or managing ML at scale, you’ll gain practical strategies to make Airflow the backbone of your ML infrastructure.

10:30 - 10:55. Beckler
By Zhe-You Liu
Track: Airflow intro/overview
10/09/2025 10:30 AM 10/09/2025 10:55 AM America/Los_Angeles AS25: Becoming an Apache Airflow Committer from 0

How a Complete Beginner in Data Engineering / Junior Computer Science Student Became an Apache Airflow Committer in Just 5 Months—With 70+ PRs and 300 Hours of Contributions

This talk is aimed at those who are still hesitant about contributing to Apache Airflow. I hope to inspire and encourage anyone to take the first step and start their journey in open-source—let’s build together!

Beckler

How a Complete Beginner in Data Engineering / Junior Computer Science Student Became an Apache Airflow Committer in Just 5 Months—With 70+ PRs and 300 Hours of Contributions

This talk is aimed at those who are still hesitant about contributing to Apache Airflow. I hope to inspire and encourage anyone to take the first step and start their journey in open-source—let’s build together!

10:30 - 13:00. 301
By Marc Lamberti
Track: Workshop
10/09/2025 10:30 AM 10/09/2025 1:00 PM America/Los_Angeles AS25: Get Certified: DAG Authoring for Apache Airflow 3

We’re excited to offer Airflow Summit 2025 attendees an exclusive opportunity to earn their DAG Authoring certification in person, now updated to include all the latest Airflow 3.0 features. This certification workshop comes at no additional cost to summit attendees.

The DAG Authoring for Apache Airflow certification validates your expertise in advanced Airflow concepts and demonstrates your ability to build production-grade data pipelines. It covers TaskFlow API, Dynamic task mapping, Templating, Asset-driven scheduling, Best practices for production DAGs, and new Airflow 3.0 features and optimizations.

The certification session includes:

  • 20-minute preparation period with expert guidance
  • Live Q&A session with Marc Lamberti from Astronomer
  • 60-minute examination period
  • Real-time results and immediate feedback

To prepare for the Airflow Certification, visit the Astronomer Academy (https://academy.astronomer.io/page/astronomer-certification).

301
The DAG Authoring for Apache Airflow certification validates your expertise in advanced Airflow concepts and demonstrates your ability to build production-grade data pipelines.
10:30 - 13:00. 305
By Jon Fink & Amy Pitcher
Track: Workshop
10/09/2025 10:30 AM 10/09/2025 1:00 PM America/Los_Angeles AS25: Bridging Data Pipelines and Business Applications with Airflow and Control-M

AI and ML pipelines built in Airflow often power critical business outcomes, but they rarely operate in isolation. In this hands-on workshop, learn how Control-M integrates with Airflow to orchestrate end-to-end workflows that include upstream and downstream enterprise systems like Supply Chain and Billing. Gain visibility, reliability, and seamless coordination across your data pipelines and the business operations they support.

305
Learn how Control-M integrates with Airflow to orchestrate end-to-end workflows that include upstream and downstream enterprise systems like Supply Chain and Billing. Gain visibility, reliability, and seamless coordination across your data pipelines and the business operations they support.
11:00 - 11:25. Columbia A
By Bolke de Bruin
Track: Community
10/09/2025 11:00 AM 10/09/2025 11:25 AM America/Los_Angeles AS25: Your privacy or our progress: rethinking telemetry in Airflow

We face a paradox: we could use usage data to build better software, but collecting that data seems to contradict the very principles of user freedom that open source represents. Apache Airflow’s current telemetry - already purged - system has become a battleground for this conflict, with some users voicing concerns over privacy while maintainers struggle to make informed decisions without data. What can we do to strike the right balance?

Columbia A

We face a paradox: we could use usage data to build better software, but collecting that data seems to contradict the very principles of user freedom that open source represents. Apache Airflow’s current telemetry - already purged - system has become a battleground for this conflict, with some users voicing concerns over privacy while maintainers struggle to make informed decisions without data. What can we do to strike the right balance?

11:00 - 11:25. Columbia C
By Jonathan Leek & Michelle Winters
Track: Best practices
10/09/2025 11:00 AM 10/09/2025 11:25 AM America/Los_Angeles AS25: Building an Airflow Center of Excellence: Lessons from the Frontlines

As organizations scale their data infrastructure, Apache Airflow becomes a mission-critical component for orchestrating workflows efficiently. But scaling Airflow successfully isn’t just about running pipelines—it’s about building a Center of Excellence (CoE) that empowers teams with the right strategy, best practices, and long-term enablement. Join Jon Leek and Michelle Winters as they share their experiences helping customers design and implement Airflow Centers of Excellence. They’ll walk through real-world challenges, best practices, and the structured approach Astronomer takes to ensure teams have the right plan, resources, and support to succeed. Whether you’re just starting with Airflow or looking to optimize and scale your workflows, this session will give you a proven framework to build a sustainable Airflow Center of Excellence within your organization. 🚀

Columbia C

As organizations scale their data infrastructure, Apache Airflow becomes a mission-critical component for orchestrating workflows efficiently. But scaling Airflow successfully isn’t just about running pipelines—it’s about building a Center of Excellence (CoE) that empowers teams with the right strategy, best practices, and long-term enablement. Join Jon Leek and Michelle Winters as they share their experiences helping customers design and implement Airflow Centers of Excellence. They’ll walk through real-world challenges, best practices, and the structured approach Astronomer takes to ensure teams have the right plan, resources, and support to succeed. Whether you’re just starting with Airflow or looking to optimize and scale your workflows, this session will give you a proven framework to build a sustainable Airflow Center of Excellence within your organization. 🚀

11:00 - 11:25. Columbia D
By Rachel Sun
Track: Airflow & ...
10/09/2025 11:00 AM 10/09/2025 11:25 AM America/Los_Angeles AS25: How Pinterest Uses Ai to Empower Airflow Users for Troubleshooting

At Pinterest, there are over 10,000 DAGs supporting various use cases across different teams and roles. With this scale and diversity, user support has been an ongoing challenge to unlock productivity. As Airflow increasingly serves as a user interface to a variety of data and ML infrastructure behind the scenes, it’s common for issues from multiple areas to surface in Airflow, making triage and troubleshooting a challenge.

In this session, we will discuss the scale of the problem we are facing, how we have addressed it so far, and how we are introducing LLM AI to help solve this problem.

Columbia D

At Pinterest, there are over 10,000 DAGs supporting various use cases across different teams and roles. With this scale and diversity, user support has been an ongoing challenge to unlock productivity. As Airflow increasingly serves as a user interface to a variety of data and ML infrastructure behind the scenes, it’s common for issues from multiple areas to surface in Airflow, making triage and troubleshooting a challenge.

In this session, we will discuss the scale of the problem we are facing, how we have addressed it so far, and how we are introducing LLM AI to help solve this problem.

11:30 - 11:55. Columbia A
By Theo Lebrun
Track: Use cases
10/09/2025 11:30 AM 10/09/2025 11:55 AM America/Los_Angeles AS25: Orchestrating AI Knowledge Bases with Apache Airflow

In the age of Generative AI, knowledge bases are the backbone of intelligent systems, enabling them to deliver accurate and context-aware responses. But how do you ensure that these knowledge bases remain up-to-date and relevant in a rapidly changing world? Enter Apache Airflow, a robust orchestration tool that streamlines the automation of data workflows.

This talk will explore how Airflow can be leveraged to manage and update AI knowledge bases across multiple data sources. We’ll dive into the architecture, demonstrate how Airflow enables efficient data extraction, transformation, and loading (ETL), and share insights on tackling challenges like data consistency, scheduling, and scalability.

Whether you’re building your own AI-driven systems or looking to optimize existing workflows, this session will provide practical takeaways to make the most of Apache Airflow in orchestrating intelligent solutions.

Columbia A

In the age of Generative AI, knowledge bases are the backbone of intelligent systems, enabling them to deliver accurate and context-aware responses. But how do you ensure that these knowledge bases remain up-to-date and relevant in a rapidly changing world? Enter Apache Airflow, a robust orchestration tool that streamlines the automation of data workflows.

This talk will explore how Airflow can be leveraged to manage and update AI knowledge bases across multiple data sources. We’ll dive into the architecture, demonstrate how Airflow enables efficient data extraction, transformation, and loading (ETL), and share insights on tackling challenges like data consistency, scheduling, and scalability.

11:30 - 11:55. Columbia C
By Brandon Abear
Track: Airflow & ...
10/09/2025 11:30 AM 10/09/2025 11:55 AM America/Los_Angeles AS25: Data & AI Orchestration at GoDaddy

As the adoption of Airflow increases within large enterprises to orchestrate their data pipelines, more than one team needs to create, manage, and run their workflows in isolation. With multi-tenancy not yet supported natively in Airflow, customers are adopting alternate ways to enable multiple teams to share infrastructure. In this session, we will explore how GoDaddy uses MWAA to build a Single Pane Airflow setup for multiple teams with a common observability platform, and how this foundation enables orchestration expansion beyond data workflows to AI workflows as well. We’ll discuss our roadmap for leveraging upcoming Airflow 3 features, including the task execution API for enhanced workflow management and DAG versioning capabilities for comprehensive auditing and governance. This session will help attendees gain insights into the use case, the solution architecture, implementation challenges and benefits, and our strategic vision for unified orchestration across data and AI workloads.

Outline:

  • About GoDaddy
  • GoDaddy Data & AI Orchestration Vision
  • Current State & Airflow Usage
  • Airflow Monitoring & Observability
  • Lessons Learned & Best Practices
  • Airflow 3 Adoption
Columbia C

As the adoption of Airflow increases within large enterprises to orchestrate their data pipelines, more than one team needs to create, manage, and run their workflows in isolation. With multi-tenancy not yet supported natively in Airflow, customers are adopting alternate ways to enable multiple teams to share infrastructure. In this session, we will explore how GoDaddy uses MWAA to build a Single Pane Airflow setup for multiple teams with a common observability platform, and how this foundation enables orchestration expansion beyond data workflows to AI workflows as well. We’ll discuss our roadmap for leveraging upcoming Airflow 3 features, including the task execution API for enhanced workflow management and DAG versioning capabilities for comprehensive auditing and governance. This session will help attendees gain insights into the use case, the solution architecture, implementation challenges and benefits, and our strategic vision for unified orchestration across data and AI workloads.

11:30 - 11:55. Columbia D
By Nathan Hadfield
Track: Airflow & ...
10/09/2025 11:30 AM 10/09/2025 11:55 AM America/Los_Angeles AS25: From Oops to Secure Ops: Self-Hosted AI for Airflow Failure Diagnosis

Last year, ‘From Oops to Ops’ showed how AI-powered failure analysis could help diagnose why Airflow tasks fail. But do we really need large, expensive cloud-based AI models to answer simple diagnostic questions? Relying on external AI APIs introduces privacy risks, unpredictable costs, and latency, often without clear benefits for this use case.

With the rise of distilled, open-source models, self-hosted failure analysis is now a practical alternative. This talk will explore how to deploy an AI service on infrastructure you control, compare cost, speed, and accuracy between OpenAI’s API and self-hosted models, and showcase a live demo of AI-powered task failure diagnosis using DeepSeek and Llama—running without external dependencies to keep data private and costs predictable.

Columbia D

Last year, ‘From Oops to Ops’ showed how AI-powered failure analysis could help diagnose why Airflow tasks fail. But do we really need large, expensive cloud-based AI models to answer simple diagnostic questions? Relying on external AI APIs introduces privacy risks, unpredictable costs, and latency, often without clear benefits for this use case.

With the rise of distilled, open-source models, self-hosted failure analysis is now a practical alternative. This talk will explore how to deploy an AI service on infrastructure you control, compare cost, speed, and accuracy between OpenAI’s API and self-hosted models, and showcase a live demo of AI-powered task failure diagnosis using DeepSeek and Llama—running without external dependencies to keep data private and costs predictable.

11:30 - 11:55. Beckler
By Shalabh Agarwal
Track: Airflow & ...
10/09/2025 11:30 AM 10/09/2025 11:55 AM America/Los_Angeles AS25: Custom Operators in Action: A Guide to Extending Airflow's Capabilities

Custom operators are the secret weapon for solving Airflow’s unique & challenging orchestration problems.

This session will cover:

  • When to build custom operators vs. using existing solutions
  • Architecture patterns for creating maintainable, reusable operators
  • Live coding demonstration: Building a custom operator from scratch
  • Real-world examples: How custom operators solve specific business challenges

Through practical code examples and architecture patterns, attendees will walk away with the knowledge to implement custom operators that enhance their Airflow deployments.

This session is ideal for experienced Airflow users looking to extend functionality beyond out-of-the-box solutions.

Beckler

Custom operators are the secret weapon for solving Airflow’s unique & challenging orchestration problems.

This session will cover:

  • When to build custom operators vs. using existing solutions
  • Architecture patterns for creating maintainable, reusable operators
  • Live coding demonstration: Building a custom operator from scratch
  • Real-world examples: How custom operators solve specific business challenges

Through practical code examples and architecture patterns, attendees will walk away with the knowledge to implement custom operators that enhance their Airflow deployments.

12:00 - 12:25. Columbia A
By Lawrence Gerstley
Track: Use cases
10/09/2025 12:00 PM 10/09/2025 12:25 PM America/Los_Angeles AS25: Airflow Uses in an on-prem Research Setting

KP Division of Research uses Airflow as a central technology for integrating diverse technologies in an agile setting. We wish to present a set of use-cases for AI/ML workloads, including imaging analysis (tissue segmentation, mammography), NLP (early identification of psychosis), LLM processing (identification of vessel diameter from radiological impressions), and other large data processing tasks. We create these “short-lived” project workflows to accomplish specific aims, and then may never run the job again, so leveraging generalized patterns are crucial to quickly implementing these jobs. Our Advanced Computational Infrastructure is comprised of multiple Kubernetes clusters, and we use Airflow to democratize the use of our batch level resources in those clusters. We use Airflow form-based parameters to deploy pods running R and Python scripts where generalized parameters are injected into scripts that follow internal programming patterns. Finally, we also leverage Airflow to create headless services inside Kubernetes for large computational workloads (Spark & H2O) that subsequent pods consume ephemerally.

Columbia A

KP Division of Research uses Airflow as a central technology for integrating diverse technologies in an agile setting. We wish to present a set of use-cases for AI/ML workloads, including imaging analysis (tissue segmentation, mammography), NLP (early identification of psychosis), LLM processing (identification of vessel diameter from radiological impressions), and other large data processing tasks. We create these “short-lived” project workflows to accomplish specific aims, and then may never run the job again, so leveraging generalized patterns are crucial to quickly implementing these jobs. Our Advanced Computational Infrastructure is comprised of multiple Kubernetes clusters, and we use Airflow to democratize the use of our batch level resources in those clusters. We use Airflow form-based parameters to deploy pods running R and Python scripts where generalized parameters are injected into scripts that follow internal programming patterns. Finally, we also leverage Airflow to create headless services inside Kubernetes for large computational workloads (Spark & H2O) that subsequent pods consume ephemerally.

12:00 - 12:25. Columbia C
By Philippe Gagnon
Track: Airflow & ...
10/09/2025 12:00 PM 10/09/2025 12:25 PM America/Los_Angeles AS25: Using Apache Airflow with Trino for (almost) all your data problems

Trino is incredibly effective at enabling users to extract insights quickly and effectively from large amount of data located in dispersed and heterogeneous federated data systems.

However, some business data problems are more complex than interactive analytics use cases, and are best broken down into a sequence of interdependent steps, a.k.a. a workflow. For these use cases, dedicated software is often required in order to schedule and manage these processes with a principled approach.

In this session, we will look at how we can leverage Apache Airflow to orchestrate Trino queries into complex workflows that solve practical batch processing problems, all the while avoiding the use of repetitive, redundant data movement.

Columbia C

Trino is incredibly effective at enabling users to extract insights quickly and effectively from large amount of data located in dispersed and heterogeneous federated data systems.

However, some business data problems are more complex than interactive analytics use cases, and are best broken down into a sequence of interdependent steps, a.k.a. a workflow. For these use cases, dedicated software is often required in order to schedule and manage these processes with a principled approach.

12:00 - 12:25. Columbia D
By William Orgertrice
Track: Best practices
10/09/2025 12:00 PM 10/09/2025 12:25 PM America/Los_Angeles AS25: 5 Simple Strategies To Enhance Your DAGs For Data Processing

Take your DAGs in Apache Airflow to the next level? This is an insightful session where we’ll uncover 5 transformative strategies to enhance your data workflows. Whether you’re a data engineering pro or just getting started, this presentation is packed with practical tips and actionable insights that you can apply right away.

We’ll dive into the magic of using powerful libraries like Pandas, share techniques to trim down data volumes for faster processing, and highlight the importance of modularizing your code for easier maintenance. Plus, you’ll discover efficient ways to monitor and debug your DAGs, and how to make the most of Airflow’s built-in features.

By the end of this session, you’ll have a toolkit of strategies to boost the efficiency and performance of your DAGs, making your data processing tasks smoother and more effective. Don’t miss out on this opportunity to elevate your Airflow DAGs!

Columbia D

Take your DAGs in Apache Airflow to the next level? This is an insightful session where we’ll uncover 5 transformative strategies to enhance your data workflows. Whether you’re a data engineering pro or just getting started, this presentation is packed with practical tips and actionable insights that you can apply right away.

We’ll dive into the magic of using powerful libraries like Pandas, share techniques to trim down data volumes for faster processing, and highlight the importance of modularizing your code for easier maintenance. Plus, you’ll discover efficient ways to monitor and debug your DAGs, and how to make the most of Airflow’s built-in features.

12:00 - 12:25. Beckler
By Vishal Vijayvargiya
Track: Airflow & ...
10/09/2025 12:00 PM 10/09/2025 12:25 PM America/Los_Angeles AS25: Enhancing Airflow REST API: From Basic Integration to Enterprise Scale

Apache Airflow’s REST API has evolved to support diverse orchestration needs, with managed services like MWAA introducing custom enhancements. One such feature, InvokeRestApi, enables dynamic interactions with external services while maintaining Airflow’s core orchestration capabilities.

In this talk, we will explore the architectural design behind InvokeRestApi, detailing how it enhances API-driven workflows. Beyond the architecture, we’ll share key challenges and learnings from implementing and scaling Airflow’s REST API in production environments. Topics include authentication, performance considerations, error handling, and best practices for integrating external APIs efficiently.

Attendees will gain a deeper understanding of Airflow’s API extensibility, its implications for workflow automation, and actionable insights for building robust, API-driven orchestration solutions. Whether you’re an Airflow user or an architect, this session will provide valuable takeaways for simplifying API interactions across airflow environments.

Beckler

Apache Airflow’s REST API has evolved to support diverse orchestration needs, with managed services like MWAA introducing custom enhancements. One such feature, InvokeRestApi, enables dynamic interactions with external services while maintaining Airflow’s core orchestration capabilities.

In this talk, we will explore the architectural design behind InvokeRestApi, detailing how it enhances API-driven workflows. Beyond the architecture, we’ll share key challenges and learnings from implementing and scaling Airflow’s REST API in production environments. Topics include authentication, performance considerations, error handling, and best practices for integrating external APIs efficiently.

12:30 - 12:55. Columbia A
By Vikram Koka
Track: Best practices
10/09/2025 12:30 PM 10/09/2025 12:55 PM America/Los_Angeles AS25: Common provider abstractions: Key for multi-cloud data handling

Enterprises want the flexibility to operate across multiple clouds, whether to optimize costs, improve resiliency, to avoid vendor lock-in, or for data sovereignty. But for developers, that flexibility usually comes at the cost of extra complexity and redundant code. The goal here is simple: write once, run anywhere, with minimum boilerplate. In Apache Airflow, we’ve already begun tackling this problem with abstractions like Common-SQL, which lets you write database queries once and run them on 20+ databases, from Snowflake to Postgres to SQLite to SAP HANA. Similarly, Common-IO standardizes cloud blob storage interactions across all public clouds. With Airflow 3.0, we are pushing this further by introducing a Common Message Bus provider, which is an abstraction, initially supporting Amazon SQS and expanding to Google PubSub and Apache Kafka soon after. We expect additional implementations such as Amazon Kinesis and Managed Kafka over time.

This talk will dive into why these abstractions matter, how they reduce friction for developers while giving enterprises true multi-cloud optionality, and what’s next for Airflow’s evolving provider ecosystem.

Columbia A

Enterprises want the flexibility to operate across multiple clouds, whether to optimize costs, improve resiliency, to avoid vendor lock-in, or for data sovereignty. But for developers, that flexibility usually comes at the cost of extra complexity and redundant code. The goal here is simple: write once, run anywhere, with minimum boilerplate. In Apache Airflow, we’ve already begun tackling this problem with abstractions like Common-SQL, which lets you write database queries once and run them on 20+ databases, from Snowflake to Postgres to SQLite to SAP HANA. Similarly, Common-IO standardizes cloud blob storage interactions across all public clouds. With Airflow 3.0, we are pushing this further by introducing a Common Message Bus provider, which is an abstraction, initially supporting Amazon SQS and expanding to Google PubSub and Apache Kafka soon after. We expect additional implementations such as Amazon Kinesis and Managed Kafka over time.

12:30 - 12:55. Columbia C
By Ashir Alam & Gangfeng Huang
Track: Use cases
10/09/2025 12:30 PM 10/09/2025 12:55 PM America/Los_Angeles AS25: Dynamic DAGs and Data Quality using DAGFactory

We have a similar pattern of DAGs running for different data quality dimensions like accuracy, timeliness, & completeness. To do this again and again, we would be duplicating and potentially introducing human error while doing copy paste of code or making people write same code again.

To solve for this, we are doing few things:

  • Run DAGs via DagFactory to dynamically generate DAGs using just some YAML code for all the steps we want to run in our DQ checks.
  • Hide this behind a UI which is hooked to github PR open step, now the user just provides some inputs or selects from dropdown in UI and a YAML DAG is generated for them.

This highlights the potential for DAGFactory to hide Airflow Python code from users and make it more accessible to Data Analysts and Business Intelligence along with normal Software Engg, along with reducing human error. YAML is the perfect format to be able to generate code, create a PR and DagFactory is the perfect fir for that. All of this is running in GCP Cloud Composer.

Columbia C

We have a similar pattern of DAGs running for different data quality dimensions like accuracy, timeliness, & completeness. To do this again and again, we would be duplicating and potentially introducing human error while doing copy paste of code or making people write same code again.

To solve for this, we are doing few things:

  • Run DAGs via DagFactory to dynamically generate DAGs using just some YAML code for all the steps we want to run in our DQ checks.
  • Hide this behind a UI which is hooked to github PR open step, now the user just provides some inputs or selects from dropdown in UI and a YAML DAG is generated for them.

This highlights the potential for DAGFactory to hide Airflow Python code from users and make it more accessible to Data Analysts and Business Intelligence along with normal Software Engg, along with reducing human error. YAML is the perfect format to be able to generate code, create a PR and DagFactory is the perfect fir for that. All of this is running in GCP Cloud Composer.

12:30 - 12:55. Columbia D
By Chirag Todarka & Alvin Zhang
Track: Airflow & ...
10/09/2025 12:30 PM 10/09/2025 12:55 PM America/Los_Angeles AS25: Scaling and Unifying Multiple Airflow Instances with Orchestration Frederator

In large organizations, multiple Apache Airflow instances often arise organically—driven by team-specific needs, distinct use cases, or tiered workloads. This fragmentation introduces complexity, operational overhead, and higher infrastructure costs. To address these challenges, we developed the “Orchestration Frederator,” a solution designed to unify and horizontally scale multiple Airflow deployments seamlessly.

This session will detail our journey in implementing Orchestration Frederator, highlighting how we achieved:

  • Horizontal Scalability: Seamlessly scaling Airflow across multiple instances without operational overhead.

  • End-to-End Data Lineage: Constructing comprehensive data lineage across disparate Airflow deployments to simplify monitoring and debugging.

  • Multi-Region Support: Introducing multi-region capabilities, enhancing reliability and disaster recovery.

  • Unified Ecosystem: Consolidating previously fragmented Airflow environments into a cohesive orchestration platform.

Join us to explore practical strategies, technical challenges, lessons learned, and best practices for enhancing scalability, reliability, and maintainability in large-scale Airflow deployments.

Columbia D

In large organizations, multiple Apache Airflow instances often arise organically—driven by team-specific needs, distinct use cases, or tiered workloads. This fragmentation introduces complexity, operational overhead, and higher infrastructure costs. To address these challenges, we developed the “Orchestration Frederator,” a solution designed to unify and horizontally scale multiple Airflow deployments seamlessly.

This session will detail our journey in implementing Orchestration Frederator, highlighting how we achieved:

  • Horizontal Scalability: Seamlessly scaling Airflow across multiple instances without operational overhead.

12:30 - 12:55. Beckler
By Rakesh Kumar Tai & Mili Tripathi
Track: Use cases
10/09/2025 12:30 PM 10/09/2025 12:55 PM America/Los_Angeles AS25: Transforming Data Engineering: Achieving Efficiency and Ease with an Intuitive Orchestration Solution

In the rapidly evolving field of data engineering and data science, efficiency and ease of use are crucial. Our innovative solution offers a user-friendly interface to manage and schedule custom PySpark, PySQL, Python, and SQL code, streamlining the process from development to production. Using Airflow at the backend, this tool eliminates the complexities of infrastructure management, version control, CI/CD processes, and workflow orchestration.The intuitive UI allows users to upload code, configure job parameters, and set schedules effortlessly, without the need for additional scripting or coding. Additionally, users have the flexibility to bring their own custom artifactory solution and run their code. In summary, our solution significantly enhances the orchestration and scheduling of custom code, breaking down traditional barriers and empowering organizations to maximize their data’s potential and drive innovation efficiently. Whether you are an individual data scientist or part of a large data engineering team, this tool provides the resources needed to streamline your workflow and achieve your goals faster than ever before.

Beckler

In the rapidly evolving field of data engineering and data science, efficiency and ease of use are crucial. Our innovative solution offers a user-friendly interface to manage and schedule custom PySpark, PySQL, Python, and SQL code, streamlining the process from development to production. Using Airflow at the backend, this tool eliminates the complexities of infrastructure management, version control, CI/CD processes, and workflow orchestration.The intuitive UI allows users to upload code, configure job parameters, and set schedules effortlessly, without the need for additional scripting or coding. Additionally, users have the flexibility to bring their own custom artifactory solution and run their code. In summary, our solution significantly enhances the orchestration and scheduling of custom code, breaking down traditional barriers and empowering organizations to maximize their data’s potential and drive innovation efficiently. Whether you are an individual data scientist or part of a large data engineering team, this tool provides the resources needed to streamline your workflow and achieve your goals faster than ever before.

14:00 - 14:25. Columbia A
By Yunhao Qing
Track: Use cases
10/09/2025 2:00 PM 10/09/2025 2:25 PM America/Los_Angeles AS25: From Cron to Data-Aware: Evolving Airflow Scheduling at Scale

As data platforms grow in complexity, so do the orchestration needs behind them. Time-based (cron) scheduling has long been the default in Airflow, but dataset-based scheduling promises a more data-aware, efficient alternative. In this session, I’ll share lessons learned from operating Airflow at scale—supporting thousands of DAGs across teams with varied use cases, from simple ETL to complex ML workflows. We’ll explore when dataset scheduling makes sense, the challenges it introduces, and how to evolve your DAG design and platform architecture to make the most of it. Whether you’re migrating legacy workflows or designing new ones, this talk will help you evaluate the right scheduling model for your needs.

Columbia A

As data platforms grow in complexity, so do the orchestration needs behind them. Time-based (cron) scheduling has long been the default in Airflow, but dataset-based scheduling promises a more data-aware, efficient alternative. In this session, I’ll share lessons learned from operating Airflow at scale—supporting thousands of DAGs across teams with varied use cases, from simple ETL to complex ML workflows. We’ll explore when dataset scheduling makes sense, the challenges it introduces, and how to evolve your DAG design and platform architecture to make the most of it. Whether you’re migrating legacy workflows or designing new ones, this talk will help you evaluate the right scheduling model for your needs.

14:00 - 14:25. Columbia C
By Arthur Chen, Trevor DeVore & Deng Pan
Track: Airflow & ...
10/09/2025 2:00 PM 10/09/2025 2:25 PM America/Los_Angeles AS25: Lessons learned from migrating to Airflow @ LI Scale

At LinkedIn, our data pipelines process exabytes of data, with our offline infrastructure executing 300K ETL workflows daily and 10K concurrent executions. Historically, these workloads ran on our legacy system, Azkaban, which faced UX, scalability, and operational challenges. To modernize our infra, we built a managed Airflow service, leveraging its enhanced developer & operator experience, rich feature set, and strong OSS community support. That initiated LinkedIn’s largest-ever infrastructure migration—transitioning thousands of legacy workflows to Airflow.

In this talk, we will share key lessons from migrating massive-scale pipelines with minimal production disruption. We will discuss:

  • Overall Migration Strategy
  • Custom Tooling Enhancements on testing, deployment, and observability
  • Architectural Innovations decoupling orchestration and compute
  • GenAI-powered Migration automating code rewrites
  • Post-Migration Challenges & Airflow 3.0.

Attendees will walk away with battle-tested strategies for large-scale Airflow adoption and practical insights into scaling Airflow in enterprise environments.

Columbia C

At LinkedIn, our data pipelines process exabytes of data, with our offline infrastructure executing 300K ETL workflows daily and 10K concurrent executions. Historically, these workloads ran on our legacy system, Azkaban, which faced UX, scalability, and operational challenges. To modernize our infra, we built a managed Airflow service, leveraging its enhanced developer & operator experience, rich feature set, and strong OSS community support. That initiated LinkedIn’s largest-ever infrastructure migration—transitioning thousands of legacy workflows to Airflow.

14:00 - 14:25. Columbia D
By Ben Rogojan
Track: Community
10/09/2025 2:00 PM 10/09/2025 2:25 PM America/Los_Angeles AS25: A Decade in Data Engineering - Lessons Realities and Where We Go From Here

There was a post on the data engineering subreddit recently that discussed how difficult it is to keep up with the data engineering world.

Did you learn Hadoop, great we are on Snowflake, BigQuery and Databricks now.

Just learned Airflow, well now we have Airflow 3.0.

And the list goes on.

But what doesn’t change, and what have been the lessons over the past decade. That’s what I’ll be covering in this talk. Real lessons and realities that come up time and time again whether you’re working for a start-up or a large enterprise.

Columbia D

There was a post on the data engineering subreddit recently that discussed how difficult it is to keep up with the data engineering world.

Did you learn Hadoop, great we are on Snowflake, BigQuery and Databricks now.

Just learned Airflow, well now we have Airflow 3.0.

And the list goes on.

But what doesn’t change, and what have been the lessons over the past decade. That’s what I’ll be covering in this talk. Real lessons and realities that come up time and time again whether you’re working for a start-up or a large enterprise.

14:00 - 14:25. Beckler
By Khadija Al Ahyane
Track: Airflow & ...
10/09/2025 2:00 PM 10/09/2025 2:25 PM America/Los_Angeles AS25: Task failures troubleshooting based on Airflow & Kubernetes signals

Per Airflow community survey, Kubernetes is the most popular compute platform used to run Airflow and when run on Kubernetes, Airflow gains, out of the box, lots of benefits like monitoring, reliability, ease of deployment, scalability and autoscaling. On the other hand, running Airflow on Kubernetes means running a sophisticated distributed system on another distributed system which makes troubleshooting of Airflow tasks and DAGs failures harder.

This session tackles that bottleneck head-on, introducing a practical approach to building an automated diagnostic pipeline for Airflow on Kubernetes. Imagine offloading tedious investigations to a system that, on task failure, automatically collects and correlates key signals from Kubernetes components (linking Airflow tasks to specific Pods and their events), KubernetesGKE monitoring, and relevant logs—pinpointing root causes and suggesting actionable fixes.

Attendees will leave with a clear understanding of common Airflow-on-Kubernetes failure patterns—and more importantly, a blueprint and practical strategies to reduce MTTR and boost team efficiency.

Beckler

Per Airflow community survey, Kubernetes is the most popular compute platform used to run Airflow and when run on Kubernetes, Airflow gains, out of the box, lots of benefits like monitoring, reliability, ease of deployment, scalability and autoscaling. On the other hand, running Airflow on Kubernetes means running a sophisticated distributed system on another distributed system which makes troubleshooting of Airflow tasks and DAGs failures harder.

This session tackles that bottleneck head-on, introducing a practical approach to building an automated diagnostic pipeline for Airflow on Kubernetes. Imagine offloading tedious investigations to a system that, on task failure, automatically collects and correlates key signals from Kubernetes components (linking Airflow tasks to specific Pods and their events), KubernetesGKE monitoring, and relevant logs—pinpointing root causes and suggesting actionable fixes.

14:00 - 16:30. 301
By Pankaj Singh, Tatiana Al-Chueyr Martins & Pankaj Koti
Track: Workshop
10/09/2025 2:00 PM 10/09/2025 4:30 PM America/Los_Angeles AS25: Productionising dbt-core with Airflow

As a popular open-source library for analytics engineering, dbt is often combined with Airflow. Orchestrating and executing dbt models as DAGs ensures an additional layer of control over tasks, observability, and provides a reliable, scalable environment to run dbt models.

This workshop will cover a step-by-step guide to Cosmos, a popular open-source package from Astronomer that helps you quickly run your dbt Core projects as Airflow DAGs and Task Groups, all with just a few lines of code. We’ll walk through:

  • Running and visualising your dbt transformations
  • Managing dependency conflicts
  • Defining database credentials (profiles)
  • Configuring source and test nodes
  • Using dbt selectors
  • Customising arguments per model
  • Addressing performance challenges
  • Leveraging deferrable operators
  • Visualising dbt docs in the Airflow UI
  • Example of how to deploy to production
  • Troubleshooting

We encourage participants to bring their dbt project to follow this step-by-step workshop.

301
This workshop will cover a step-by-step guide to Cosmos, an open-source package that helps you quickly run your dbt Core projects as Airflow DAGs and Task Groups.
14:00 - 16:30. 305
By Philippe Gagnon
Track: Workshop
10/09/2025 2:00 PM 10/09/2025 4:30 PM America/Los_Angeles AS25: Implementing Operations Research Problems with Apache Airflow: From Modelling to Production

This workshop will provide an overview of implementing operations research problems using Apache Airflow. This is a hands-on session where attendees will gain experience creating DAGs to define and manage workflows for classical operations research problems. The workshop will include several examples of how Airflow can be used to optimize and automate various decision-making processes, including:

  • Inventory management: How to use Airflow to optimize inventory levels and reduce stockouts by analyzing demand patterns, lead times, and other factors.
  • Production planning: How to use Airflow to create optimized production schedules that minimize downtime, reduce costs, and increase throughput.
  • Logistics optimization: How to use Airflow to optimize transportation routes and other factors to improve the efficiency of logistics operations.

Attendees will come away with a solid understanding of using Airflow to automate decision-making processes with optimization solvers.

305
Hands-on session where attendees will gain experience creating DAGs to define and manage workflows for classical operations research problems.
14:00 - 16:30. 306
By Ryan Hatter, Amogh Desai, Phani Kumar & Kalyan Reddy
Track: Workshop
10/09/2025 2:00 PM 10/09/2025 4:30 PM America/Los_Angeles AS25: Your first Apache Airflow Contribution

Ready to contribute to Apache Airflow? In this hands-on workshop, you’ll be expected to come prepared with your development environment already configured (Breeze installed is strongly recommended, but Codespaces works if you can’t install Docker). We’ll dive straight into finding issues that match your skills and walk you through the entire contribution process—from creating your first pull request to receiving community feedback. Whether you’re writing code, enhancing documentation, or offering feedback, there’s a place for you. Let’s get started and see your name among Airflow contributors!

306
Whether you’re writing code, enhancing documentation, or offering feedback, there’s a place for you. Let’s get started and see your name among Airflow contributors!"
14:30 - 14:55. Columbia A
By Christian Foernges
Track: Use cases
10/09/2025 2:30 PM 10/09/2025 2:55 PM America/Los_Angeles AS25: Learn from Deutsche Bank: Using Apache Airflow in Regulated Environments

Operating within the stringent regulatory landscape of Corporate Banking, Deutsche Bank relies heavily on robust data orchestration. This session explores how Deutsche Bank’s Corporate Bank leverages Apache Airflow across diverse environments, including both on-premises infrastructure and cloud platforms. Discover their approach to managing critical data & analytics workflows, encompassing areas like regulatory reporting, data integration and complex data processing pipelines. Gain insights into the architectural patterns and operational best practices employed to ensure compliance, security, and scalability when running Airflow at scale in a highly regulated, hybrid setting.

Columbia A

Operating within the stringent regulatory landscape of Corporate Banking, Deutsche Bank relies heavily on robust data orchestration. This session explores how Deutsche Bank’s Corporate Bank leverages Apache Airflow across diverse environments, including both on-premises infrastructure and cloud platforms. Discover their approach to managing critical data & analytics workflows, encompassing areas like regulatory reporting, data integration and complex data processing pipelines. Gain insights into the architectural patterns and operational best practices employed to ensure compliance, security, and scalability when running Airflow at scale in a highly regulated, hybrid setting.

14:30 - 14:55. Columbia C
By Shoubhik Bose
Track: Use cases
10/09/2025 2:30 PM 10/09/2025 2:55 PM America/Los_Angeles AS25: Applying Airflow to drive the digital workforce in the Enterprise

Red Hat’s unified data and AI platform relies on Apache Airflow for orchestration, alongside Snowflake, Fivetran, and Atlan. The platform prioritizes building a dependable data foundation, recognizing that effective AI depends on quality data. Airflow was selected for its predictability, extensive connectivity, reliability, and scalability.

The platform now supports business analytics, transitioning from ETL to ELT processes. This has resulted in a remarkable improvement in how we make data available for business decisions.

The platform’s capabilities are being extended to power Digital Workers (AI agents) using large language models, encompassing model training, fine-tuning, and inference. Two Digital Workers are currently deployed, with more in development.

This presentation will detail the rationale and background of this evolution, followed by an explanation of the architectural decisions made and the challenges encountered and resolved throughout the process of transforming into an AI-enabled data platform to power Red Hat’s business.

Columbia C

Red Hat’s unified data and AI platform relies on Apache Airflow for orchestration, alongside Snowflake, Fivetran, and Atlan. The platform prioritizes building a dependable data foundation, recognizing that effective AI depends on quality data. Airflow was selected for its predictability, extensive connectivity, reliability, and scalability.

The platform now supports business analytics, transitioning from ETL to ELT processes. This has resulted in a remarkable improvement in how we make data available for business decisions.

14:30 - 14:55. Columbia D
By Katarzyna Kalek & Jakub Orlowski
Track: Airflow & ...
10/09/2025 2:30 PM 10/09/2025 2:55 PM America/Los_Angeles AS25: Simplifying Data Management with DAG Factory

At OLX, we connect millions of people daily through our online marketplace while relying on robust data pipelines. In this talk, we explore how the DAG Factory concept elevates data governance, lineage, and discovery by centralizing operator logic and restricting direct DAG creation. This approach enforces code quality, optimizes resources, maintains infrastructure hygiene and enables smooth version upgrades. We then leverage consistent naming conventions in Airflow to build targeted namespaces, aligning teams with global policies while preserving autonomy. Integrating external tools like AWS Lake Formation and Open Metadata further unifies governance, making it straightforward to manage and secure data. This is critical when handling hundreds or even thousands of active DAGs. If the idea of storing 1,600 pipelines in one folder seems overwhelming, join us to learn how the DAG Factory concept simplifies pipeline management. We’ll also share insights from OLX, highlighting how thoughtful design fosters oversight, efficiency, and discoverability across diverse use cases.

Columbia D

At OLX, we connect millions of people daily through our online marketplace while relying on robust data pipelines. In this talk, we explore how the DAG Factory concept elevates data governance, lineage, and discovery by centralizing operator logic and restricting direct DAG creation. This approach enforces code quality, optimizes resources, maintains infrastructure hygiene and enables smooth version upgrades. We then leverage consistent naming conventions in Airflow to build targeted namespaces, aligning teams with global policies while preserving autonomy. Integrating external tools like AWS Lake Formation and Open Metadata further unifies governance, making it straightforward to manage and secure data. This is critical when handling hundreds or even thousands of active DAGs. If the idea of storing 1,600 pipelines in one folder seems overwhelming, join us to learn how the DAG Factory concept simplifies pipeline management. We’ll also share insights from OLX, highlighting how thoughtful design fosters oversight, efficiency, and discoverability across diverse use cases.

14:30 - 14:55. Beckler
By Purshotam Shah & Prakash Nandha Mukunthan
Track: Use cases
10/09/2025 2:30 PM 10/09/2025 2:55 PM America/Los_Angeles AS25: Navigating Secure and Cost-Efficient Flink Batch on Kubernetes with Airflow

At Yahoo, we built a secure, scalable, and cost-efficient batch processing platform using Amazon MWAA to orchestrate Apache Flink jobs on EKS, managed by the Flink Kubernetes Operator. This setup enables dynamic job orchestration while meeting strict enterprise compliance standards.

In this session, we’ll share how Airflow DAGs:

  • Dynamically launch, monitor, and clean up isolated Flink clusters per batch job, improving resource efficiency.

  • Securely fetch EKS kubeconfig, submit FlinkDeployment CRDs using FlinkKubernetesOperator, and poll job status using Airflow sensors.

  • Integrate IAM for access control and meet Yahoo’s security requirements, including mutual TLS (mTLS) with Athenz.

  • Optimize for cost and resilience through automated cleanup of jobs and the operator, and handle job failures and retries.

Join us for practical strategies and lessons from Yahoo’s production-scale Flink workflows in a Kubernetes environment.

Beckler
15:00 - 15:25. Columbia A
By Oluwafemi Olawoyin
Track: Use cases
10/09/2025 3:00 PM 10/09/2025 3:25 PM America/Los_Angeles AS25: Modernizing Automation in Secure, Regulated Environments: Lessons from Deploying Airflow

This session details practical strategies for introducing Apache Airflow in strict, compliance-heavy organizations. Learn how on-premise deployment and hybrid tooling can help modernize legacy workflows when public cloud solutions and container technologies are restricted. Discover how cross-platform engineering teams can collaborate securely using CI/CD bridges, and what it takes to meet rigorous security and governance standards. Key lessons address navigating resistance to change, achieving production sign-off, and avoiding common compliance pitfalls, relevant to anyone automating in public sector settings.

Columbia A

This session details practical strategies for introducing Apache Airflow in strict, compliance-heavy organizations. Learn how on-premise deployment and hybrid tooling can help modernize legacy workflows when public cloud solutions and container technologies are restricted. Discover how cross-platform engineering teams can collaborate securely using CI/CD bridges, and what it takes to meet rigorous security and governance standards. Key lessons address navigating resistance to change, achieving production sign-off, and avoiding common compliance pitfalls, relevant to anyone automating in public sector settings.

15:00 - 15:25. Columbia C
By Kunal Jain
Track: Use cases
10/09/2025 3:00 PM 10/09/2025 3:25 PM America/Los_Angeles AS25: How Airflow can help with Data Management and Governance

Metadata management is a cornerstone of effective data governance, yet it presents unique challenges distinct from traditional data engineering. At scale, efficiently extracting metadata from relational and NoSQL databases demands specialized solutions. To address this, our team has developed custom Airflow operators that scan and extract metadata across various database technologies, orchestrating 100+ production jobs to ensure continuous and reliable metadata collection.

Now, we’re expanding beyond databases to tackle non-traditional data sources such as file repositories and message queues. This shift introduces new complexities, including processing structured and unstructured files, managing schema evolution in streaming data, and maintaining metadata consistency across heterogeneous sources. In this session, we’ll share our approach to building scalable metadata scanners, optimizing performance, and ensuring adaptability across diverse data environments. Attendees will gain insights into designing efficient metadata pipelines, overcoming common pitfalls, and leveraging Airflow to drive metadata governance at scale.

Columbia C

Metadata management is a cornerstone of effective data governance, yet it presents unique challenges distinct from traditional data engineering. At scale, efficiently extracting metadata from relational and NoSQL databases demands specialized solutions. To address this, our team has developed custom Airflow operators that scan and extract metadata across various database technologies, orchestrating 100+ production jobs to ensure continuous and reliable metadata collection.

Now, we’re expanding beyond databases to tackle non-traditional data sources such as file repositories and message queues. This shift introduces new complexities, including processing structured and unstructured files, managing schema evolution in streaming data, and maintaining metadata consistency across heterogeneous sources. In this session, we’ll share our approach to building scalable metadata scanners, optimizing performance, and ensuring adaptability across diverse data environments. Attendees will gain insights into designing efficient metadata pipelines, overcoming common pitfalls, and leveraging Airflow to drive metadata governance at scale.

15:00 - 15:25. Columbia D
By Yifan Wang
Track: Roadmap
10/09/2025 3:00 PM 10/09/2025 3:25 PM America/Los_Angeles AS25: DAGnostics: Shift-Left Airflow Governance with Policy Enforcement Framework

DAGnostics seamlessly integrates Airflow Cluster Policy hooks to enforce governance from local DAG authoring through CI pipelines to production runtime. Learn how it closes validation gaps, collapses feedback loops from hours to seconds, and ensures consistent policies across stages. We examine current runtime-only enforcement and fractured CI checks, then unveil our architecture: a pluggable policy registry via Airflow entry points, local static analysis for pre-commit validation, GitHub Actions CI integration, and runtime hook enforcement. See real-world use cases: alerting standards, resource quotas, naming conventions, and exemption handling. Next, dive into implementation: authoring policies in Python, auto-discovery, cross-environment enforcement, upstream contribution, and testing strategies. We share LinkedIn’s metrics—2,000+ DAG repos, 10,000+ daily executions supporting trunk-based development across isolated teams/use-cases, and 78% fewer runtime violations—and lessons learned scaling policy-as-code at enterprise scale. Leave with a blueprint to adopt DAGnostics and strengthen your Airflow governance while preserving full compatibility with existing systems.

Columbia D

DAGnostics seamlessly integrates Airflow Cluster Policy hooks to enforce governance from local DAG authoring through CI pipelines to production runtime. Learn how it closes validation gaps, collapses feedback loops from hours to seconds, and ensures consistent policies across stages. We examine current runtime-only enforcement and fractured CI checks, then unveil our architecture: a pluggable policy registry via Airflow entry points, local static analysis for pre-commit validation, GitHub Actions CI integration, and runtime hook enforcement. See real-world use cases: alerting standards, resource quotas, naming conventions, and exemption handling. Next, dive into implementation: authoring policies in Python, auto-discovery, cross-environment enforcement, upstream contribution, and testing strategies. We share LinkedIn’s metrics—2,000+ DAG repos, 10,000+ daily executions supporting trunk-based development across isolated teams/use-cases, and 78% fewer runtime violations—and lessons learned scaling policy-as-code at enterprise scale. Leave with a blueprint to adopt DAGnostics and strengthen your Airflow governance while preserving full compatibility with existing systems.

15:00 - 15:25. Beckler
By Niko Oliveira
Track: Airflow intro/overview
10/09/2025 3:00 PM 10/09/2025 3:25 PM America/Los_Angeles AS25: AWS Lambda Executor: The Speed of Local Execution with the Advantages of Remote

Apache Airflow’s executor landscape has traditionally presented users with a clear trade-off: choose either the speed of local execution or the scalability, isolation and configurability of remote execution. The AWS Lambda Executor introduces a new paradigm that bridges this gap, offering near-local execution speeds with the benefits of remote containerization.

This talk will begin with a brief overview of Airflow’s executors, how they work and what they are responsible for, highlighting the compromises between different executors. We will explore the emerging niche for fast, yet remote execution and demonstrate how the AWS Lambda Executor fills this space. We will also address practical considerations when using such an executor, such as working within Lambda’s 15 minute execution limit, and how to mitigate this using multi-executor configuration.

Whether you’re new to Airflow or an experienced user, this session will provide valuable insights into task execution and how you can combine the best of both local and remote execution paradigms.

Beckler

Apache Airflow’s executor landscape has traditionally presented users with a clear trade-off: choose either the speed of local execution or the scalability, isolation and configurability of remote execution. The AWS Lambda Executor introduces a new paradigm that bridges this gap, offering near-local execution speeds with the benefits of remote containerization.

This talk will begin with a brief overview of Airflow’s executors, how they work and what they are responsible for, highlighting the compromises between different executors. We will explore the emerging niche for fast, yet remote execution and demonstrate how the AWS Lambda Executor fills this space. We will also address practical considerations when using such an executor, such as working within Lambda’s 15 minute execution limit, and how to mitigate this using multi-executor configuration.

15:45 - 16:10. Columbia A
By Yu Lung Law & Ivan Sayapin
Track: Use cases
10/09/2025 3:45 PM 10/09/2025 4:10 PM America/Los_Angeles AS25: Fine-Tuning Airflow: Parameters You May Not Know About

The Bloomberg Data Platform Engineering team is responsible for managing, storing, and providing access to business and financial data used by financial professionals across the global capital markets. Our team utilizes Apache Airflow to orchestrate data workflows across various applications and Bloomberg Terminal functions. Over the years, we have fine-tuned our Airflow cluster to handle more than 1,000 ingestion DAGs, which has presented unique scalability challenges. In this session, we will share insights into several key Airflow parameters — some of which you may not be all that familiar with — that our team uses to optimize and scale the platform effectively.

Columbia A

The Bloomberg Data Platform Engineering team is responsible for managing, storing, and providing access to business and financial data used by financial professionals across the global capital markets. Our team utilizes Apache Airflow to orchestrate data workflows across various applications and Bloomberg Terminal functions. Over the years, we have fine-tuned our Airflow cluster to handle more than 1,000 ingestion DAGs, which has presented unique scalability challenges. In this session, we will share insights into several key Airflow parameters — some of which you may not be all that familiar with — that our team uses to optimize and scale the platform effectively.

15:45 - 16:10. Columbia C
By John Robert
Track: Airflow & ...
10/09/2025 3:45 PM 10/09/2025 4:10 PM America/Los_Angeles AS25: Building a Transparent Data Workflow with Airflow and Data Catalog

As modern data ecosystems grow in complexity, ensuring transparency, discoverability, and governance in data workflows becomes critical. Apache Airflow, a powerful workflow orchestration tool, enables data engineers to build scalable pipelines, but without proper visibility into data lineage, ownership, and quality, teams risk operating in a black box.

In this talk, we will explore how integrating Airflow with a data catalog can bring clarity and transparency to data workflows. We’ll discuss how metadata-driven orchestration enhances data governance, enables lineage tracking, and improves collaboration across teams. Through real-world use cases, we will demonstrate how Airflow can automate metadata collection, update data catalogs dynamically, and ensure data quality at every stage of the pipeline.

Attendees will walk away with practical strategies for implementing a transparent data workflow that fosters trust, efficiency, and compliance in their data infrastructure.

Columbia C

As modern data ecosystems grow in complexity, ensuring transparency, discoverability, and governance in data workflows becomes critical. Apache Airflow, a powerful workflow orchestration tool, enables data engineers to build scalable pipelines, but without proper visibility into data lineage, ownership, and quality, teams risk operating in a black box.

In this talk, we will explore how integrating Airflow with a data catalog can bring clarity and transparency to data workflows. We’ll discuss how metadata-driven orchestration enhances data governance, enables lineage tracking, and improves collaboration across teams. Through real-world use cases, we will demonstrate how Airflow can automate metadata collection, update data catalogs dynamically, and ensure data quality at every stage of the pipeline.

15:45 - 16:10. Columbia D
By Gurmeet Saran & Kushal Thakkar
Track: Airflow & ...
10/09/2025 3:45 PM 10/09/2025 4:10 PM America/Los_Angeles AS25: Enabling SQL testing in Airflow workflows using Pydantic types

This session explores how to bring unit testing to SQL pipelines using Airflow. I’ll walk through the development of a SQL testing library that allows isolated testing of SQL logic by injecting mock data into base tables. To support this, we built a type system for AWS Glue tables using Pydantic, enabling schema validation and mock data generation. Over time, this type system also powered production data quality checks via a custom Airflow operator. Learn how this approach improves reliability, accelerates development, and scales testing across data workflows.

Columbia D

This session explores how to bring unit testing to SQL pipelines using Airflow. I’ll walk through the development of a SQL testing library that allows isolated testing of SQL logic by injecting mock data into base tables. To support this, we built a type system for AWS Glue tables using Pydantic, enabling schema validation and mock data generation. Over time, this type system also powered production data quality checks via a custom Airflow operator. Learn how this approach improves reliability, accelerates development, and scales testing across data workflows.

15:45 - 16:10. Beckler
By Steven Woods
Track: Best practices
10/09/2025 3:45 PM 10/09/2025 4:10 PM America/Los_Angeles AS25: From Repetition to Refactor: Smarter DAG Design in Airflow 3

We will explore how Apache Airflow 3 unlocks new possibilities for smarter, more flexible DAG design. We’ll start by breaking down common anti-patterns in early DAG implementations, such as hardcoded operators, duplicated task logic, and rigid sequencing, that lead to brittle, unscalable workflows. From there, we’ll show how refactoring with the D.R.Y. (Don’t Repeat Yourself) principle, using techniques like task factories, parameterization, dynamic task mapping, and modular DAG construction, transforms these workflows into clean, reusable patterns.

With Airflow 3, these strategies go further: enabling DAGs that are reusable across both batch pipelines and streaming/event-driven workloads, while also supporting ad-hoc runs for testing, one-off jobs, or backfills. The result is not just more concise code, but workflows that can flexibly serve different data processing modes without duplication. Attendees will leave with concrete patterns and best practices for building maintainable, production-grade DAGs that are scalable, observable, and aligned with modern data engineering standards.

Beckler

We will explore how Apache Airflow 3 unlocks new possibilities for smarter, more flexible DAG design. We’ll start by breaking down common anti-patterns in early DAG implementations, such as hardcoded operators, duplicated task logic, and rigid sequencing, that lead to brittle, unscalable workflows. From there, we’ll show how refactoring with the D.R.Y. (Don’t Repeat Yourself) principle, using techniques like task factories, parameterization, dynamic task mapping, and modular DAG construction, transforms these workflows into clean, reusable patterns.

16:15 - 16:40. Columbia A
By pei-chi-miko-chen
Track: Use cases
10/09/2025 4:15 PM 10/09/2025 4:40 PM America/Los_Angeles AS25: No More Missed Beats: How Airflow Rescued Our Analytics Pipeline

Before Airflow, our BigQuery pipelines at Create Music Group operated like musicians without a conductor—each playing on its own schedule, regardless of whether upstream data was ready. As our data platform grew, this chaos led to spiralling costs, performance bottlenecks, and became utterly unsustainable.

This talk tells the story of how Create Music Group brought harmony to its data workflows by adopting Apache Airflow and the Medallion architecture, ultimately slashing our data processing costs by 50%. We’ll show how moving to event-driven scheduling with datasets helped eliminate stale data issues, dramatically improved performance, and unlocked faster iteration across teams. Discover how we replaced repetitive SQL with standardized dimension/fact tables, empowering analysts in a safer sandbox.

Columbia A

Before Airflow, our BigQuery pipelines at Create Music Group operated like musicians without a conductor—each playing on its own schedule, regardless of whether upstream data was ready. As our data platform grew, this chaos led to spiralling costs, performance bottlenecks, and became utterly unsustainable.

This talk tells the story of how Create Music Group brought harmony to its data workflows by adopting Apache Airflow and the Medallion architecture, ultimately slashing our data processing costs by 50%. We’ll show how moving to event-driven scheduling with datasets helped eliminate stale data issues, dramatically improved performance, and unlocked faster iteration across teams. Discover how we replaced repetitive SQL with standardized dimension/fact tables, empowering analysts in a safer sandbox.

16:15 - 16:40. Columbia C
By Sebastien Crocquevieille
Track: Use cases
10/09/2025 4:15 PM 10/09/2025 4:40 PM America/Los_Angeles AS25: Multi-Instance Asset Synchronization - push or pull?

As Data Engineers, our jobs regularly include scheduling or scaling workflows.

But have you ever asked yourself, can I scale my scheduling ?

It turns out that you can! But doing so raises a number of issues that need to be addressed.

In this talk we’ll be:

  • Recapping Asset-aware scheduling in Apache Airflow
  • Discussing diverse methods to upscale our scheduling
  • Solving the issue of maintaining our Airflow Asset synchronized between instances
  • Comparing our professional push based solution and the built-in solution from AIP-82 and the pros and cons of each method.

I hope you will enjoy it!

Columbia C

As Data Engineers, our jobs regularly include scheduling or scaling workflows.

But have you ever asked yourself, can I scale my scheduling ?

It turns out that you can! But doing so raises a number of issues that need to be addressed.

In this talk we’ll be:

  • Recapping Asset-aware scheduling in Apache Airflow
  • Discussing diverse methods to upscale our scheduling
  • Solving the issue of maintaining our Airflow Asset synchronized between instances
  • Comparing our professional push based solution and the built-in solution from AIP-82 and the pros and cons of each method.

I hope you will enjoy it!

16:15 - 16:40. Columbia D
By Abhishek Bhakat & Sudarshan Chaudhari
Track: Airflow & ...
10/09/2025 4:15 PM 10/09/2025 4:40 PM America/Los_Angeles AS25: Model Context Protocol with Airflow

In today’s data-driven world, effective workflow management and AI are crucial for success. However, there’s a notable gap between Airflow and AI. Our presentation offers a solution to close this gap.

Proposing MCP (Model Context Protocol) server to act as a bridge. We’ll dive into two paths:

  • AI-Augmented Airflow: Enhancing Airflow with AI to improve error handling, automate DAG generation, proactively detect issues, and optimize resource use.
  • Airflow-Powered AI: Utilizing Airflow’s reliability to empower LLMs in executing complex tasks, orchestrating AI agents, and supporting decision-making with real-time data.

Key takeaways:

  • Understanding how to integrate AI insights directly into your workflow orchestration.
  • Learning how MCP empowers AI with robust orchestration capabilities, offering full logging, monitoring, and auditability.
  • Gaining insights into how to transform LLMS from a reactive responder to a proactive, intelligent, and reliable executor.

Inviting you to explore how MCP can help workflow management, making AI-driven decisions more reliable and turning workflow systems into intelligent, autonomous agents.

Columbia D

In today’s data-driven world, effective workflow management and AI are crucial for success. However, there’s a notable gap between Airflow and AI. Our presentation offers a solution to close this gap.

Proposing MCP (Model Context Protocol) server to act as a bridge. We’ll dive into two paths:

  • AI-Augmented Airflow: Enhancing Airflow with AI to improve error handling, automate DAG generation, proactively detect issues, and optimize resource use.
  • Airflow-Powered AI: Utilizing Airflow’s reliability to empower LLMs in executing complex tasks, orchestrating AI agents, and supporting decision-making with real-time data.

Key takeaways:

16:15 - 16:40. Beckler
By Annie Friedman & Caitlin Petro
Track: Best practices
10/09/2025 4:15 PM 10/09/2025 4:40 PM America/Los_Angeles AS25: Lessons from Airflow gone wrong: How to set yourself up to scale successfully

Ever seen a DAG go rogue and deploy itself? Or try to time travel back to 1999? Join us for a light-hearted yet painfully relatable look at how not to scale your Airflow deployment to avoid chaos and debugging nightmares.

We’ll cover the classics: hardcoded secrets, unbounded retries (hello, immortal task!), and the infamous spaghetti DAG where 200 tasks are lovingly connected by hand and no one dares open the Airflow UI anymore. If you’ve ever used datetime.now() in your DAG definition and watched your backfills implode, this talk is for you.

From the BashOperator that became sentient to the XCom that tried to pass a whole Pandas DataFrame and the key to your mother’s house, we’ll walk through real-world bloopers with practical takeaways. You’ll learn why overusing PythonOperator is a recipe for mess, how not to use sensors unless you enjoy resource starvation, and why scheduling in local timezones is basically asking for a daylight savings time horror story. Other highlights include:

Over-provisioning resources in KubernetesPodOperator: many teams allocate excessive memory/CPU “just in case”, leading to cluster contention and resource waste.

Dynamic task mapping gone wild: 10,000 mapped tasks later… the scheduler is still crying.

SLAs used as data quality guarantees: creating alerts so noisy, nobody listens. Design-free DAGs: no docs, no comments, no idea why a task has a 3-day timeout. Finally, we’ll round it out with some dos and don’ts: using environment variables, avoiding memory-hungry monolith DAGs, skipping global imports, and not allocating 10x more memory “just in case.” Whether you’re new to Airflow or battle-hardened from a thousand failed backfills, come learn how to scale your pipelines without losing your mind (or your cluster).

Beckler

Ever seen a DAG go rogue and deploy itself? Or try to time travel back to 1999? Join us for a light-hearted yet painfully relatable look at how not to scale your Airflow deployment to avoid chaos and debugging nightmares.

We’ll cover the classics: hardcoded secrets, unbounded retries (hello, immortal task!), and the infamous spaghetti DAG where 200 tasks are lovingly connected by hand and no one dares open the Airflow UI anymore. If you’ve ever used datetime.now() in your DAG definition and watched your backfills implode, this talk is for you.

16:45 - 17:10. Columbia A
By Yuhang Huang & Arunav Gupta
Track: Community
10/09/2025 4:45 PM 10/09/2025 5:10 PM America/Los_Angeles AS25: Lessons learned for building open source Airflow operators at AWS

In this talk, we’ll share our journey and lessons learned from developing a new open-source Airflow operator that integrates a newly-launched AWS service with the Airflow ecosystem. This real-world case study will illuminate the complete lifecycle of building an Airflow operator, from initial design to successful community contribution.

We’ll dive deep into the practical challenges and solutions encountered throughout the journey, including:

  • Evaluating when to build a new operator versus extending existing ones
  • Navigating the Apache Airflow Open-source contribution process
  • Best practices for operator design and implementation
  • Key learnings and common pitfalls to avoid during the testing and release process

Whether you’re looking to contribute to Apache Airflow or build custom operators, this session will provide valuable insights into the development process, common pitfalls to avoid, and best practices when contributing to and collaborating with the Apache Airflow community.

Expect to leave with a practical roadmap for your own contributions and the confidence to successfully engage with the Apache Airflow ecosystem.

Columbia A

In this talk, we’ll share our journey and lessons learned from developing a new open-source Airflow operator that integrates a newly-launched AWS service with the Airflow ecosystem. This real-world case study will illuminate the complete lifecycle of building an Airflow operator, from initial design to successful community contribution.

We’ll dive deep into the practical challenges and solutions encountered throughout the journey, including:

  • Evaluating when to build a new operator versus extending existing ones
  • Navigating the Apache Airflow Open-source contribution process
  • Best practices for operator design and implementation
  • Key learnings and common pitfalls to avoid during the testing and release process

Whether you’re looking to contribute to Apache Airflow or build custom operators, this session will provide valuable insights into the development process, common pitfalls to avoid, and best practices when contributing to and collaborating with the Apache Airflow community.

17:30 - 17:35. Columbia A
By Shahar Epstein
Track: Airflow & ...
10/09/2025 5:30 PM 10/09/2025 5:35 PM America/Los_Angeles AS25: Lightning talk: Supercharging Apache Airflow: Enhancing Core Components with Rust

Apache Airflow is a powerful workflow orchestrator, but as workloads grow, its Python-based components can become performance bottlenecks. This talk explores how Rust, with its speed, safety, and concurrency advantages, can enhance Airflow’s core components (e.g, scheduler, DAG processor, etc). We’ll dive into the motivations behind using Rust, architectural trade-offs, and the challenges of bridging the gap between Python and Rust. A proof-of-concept showcasing an Airflow scheduler rewritten in Rust will demonstrate the potential benefits of this approach.

Columbia A

Apache Airflow is a powerful workflow orchestrator, but as workloads grow, its Python-based components can become performance bottlenecks. This talk explores how Rust, with its speed, safety, and concurrency advantages, can enhance Airflow’s core components (e.g, scheduler, DAG processor, etc). We’ll dive into the motivations behind using Rust, architectural trade-offs, and the challenges of bridging the gap between Python and Rust. A proof-of-concept showcasing an Airflow scheduler rewritten in Rust will demonstrate the potential benefits of this approach.