Check out the full program for Airflow Summit.
If you prefer, you can also see this as sessionize layout or list of sessions.
Thursday, October 9, 2025
In this keynote, Peeyush Rai and Vikram Koka will be walking through how Airflow is being used as part of a Agentic AI platform servicing insurance companies, which runs on all the major public clouds, leveraging models from Open AI, Google (Gemini), AWS (Claude and Bedrock).
This talk walks through the details of the actual end user business workflow including gathering relevant financial data to make a decision, as well as the tricky challenge of handling AI hallucinations, with new Airflow capabilities such as “Human in the loop”.
This talk offers something for both business and technical audiences. Business users will get a clear view of what it takes to bring an AI application into production and how to align their operations and business teams with an AI enabled workflow. Meanwhile, technical users will walk away with practical insights on how to orchestrate complex business processes enabling a seamless collaboration between Airflow, AI Agents and Human in the loop.
Columbia AIn this keynote, Peeyush Rai and Vikram Koka will be walking through how Airflow is being used as part of a Agentic AI platform servicing insurance companies, which runs on all the major public clouds, leveraging models from Open AI, Google (Gemini), AWS (Claude and Bedrock).
This talk walks through the details of the actual end user business workflow including gathering relevant financial data to make a decision, as well as the tricky challenge of handling AI hallucinations, with new Airflow capabilities such as “Human in the loop”.
How a Complete Beginner in Data Engineering / Junior Computer Science Student Became an Apache Airflow Committer in Just 5 Months—With 70+ PRs and 300 Hours of Contributions
This talk is aimed at those who are still hesitant about contributing to Apache Airflow. I hope to inspire and encourage anyone to take the first step and start their journey in open-source—let’s build together!
BecklerHow a Complete Beginner in Data Engineering / Junior Computer Science Student Became an Apache Airflow Committer in Just 5 Months—With 70+ PRs and 300 Hours of Contributions
This talk is aimed at those who are still hesitant about contributing to Apache Airflow. I hope to inspire and encourage anyone to take the first step and start their journey in open-source—let’s build together!
AI and ML pipelines built in Airflow often power critical business outcomes, but they rarely operate in isolation. In this hands-on workshop, learn how Control-M integrates with Airflow to orchestrate end-to-end workflows that include upstream and downstream enterprise systems like Supply Chain and Billing. Gain visibility, reliability, and seamless coordination across your data pipelines and the business operations they support.
305Small retailers often lack the data visibility that larger companies rely on for decision-making. In this session, we’ll dive into how Apache Airflow powers end-to-end machine learning pipelines that process inventory and sales data, enabling retailers and suppliers to gain valuable industry insights. We’ll cover feature engineering, model training, and automated inference workflows, along with strategies for handling messy, incomplete retail data. We will discuss how Airflow enables scalable ML-driven insights that improve demand forecasting, product categorization, and supply chain optimization.
Columbia CSmall retailers often lack the data visibility that larger companies rely on for decision-making. In this session, we’ll dive into how Apache Airflow powers end-to-end machine learning pipelines that process inventory and sales data, enabling retailers and suppliers to gain valuable industry insights. We’ll cover feature engineering, model training, and automated inference workflows, along with strategies for handling messy, incomplete retail data. We will discuss how Airflow enables scalable ML-driven insights that improve demand forecasting, product categorization, and supply chain optimization.
We’re excited to offer Airflow Summit 2025 attendees an exclusive opportunity to earn their DAG Authoring certification in person, now updated to include all the latest Airflow 3.0 features. This certification workshop comes at no additional cost to summit attendees.
The DAG Authoring for Apache Airflow certification validates your expertise in advanced Airflow concepts and demonstrates your ability to build production-grade data pipelines. It covers TaskFlow API, Dynamic task mapping, Templating, Asset-driven scheduling, Best practices for production DAGs, and new Airflow 3.0 features and optimizations.
The certification session includes:
- 20-minute preparation period with expert guidance
- Live Q&A session with Marc Lamberti from Astronomer
- 60-minute examination period
- Real-time results and immediate feedback
To prepare for the Airflow Certification, visit the Astronomer Academy (https://academy.astronomer.io/page/astronomer-certification).
301This talk explores EDB’s journey from siloed reporting to a unified data platform, powered by Airflow. We’ll delve into the architectural evolution, showcasing how Airflow orchestrates a diverse range of use cases, from Analytics Engineering to complex MLOps pipelines.
Learn how EDB leverages Airflow and Cosmos to integrate dbt for robust data transformations, ensuring data quality and consistency.
We’ll provide a detailed case study of our MLOps implementation, demonstrating how Airflow manages training, inference, and model monitoring pipelines for Azure Machine Learning models.
Discover the design considerations driven by our internal data governance framework and gain insights into our future plans for AIOps integration with Airflow.
Columbia AThis talk explores EDB’s journey from siloed reporting to a unified data platform, powered by Airflow. We’ll delve into the architectural evolution, showcasing how Airflow orchestrates a diverse range of use cases, from Analytics Engineering to complex MLOps pipelines.
Learn how EDB leverages Airflow and Cosmos to integrate dbt for robust data transformations, ensuring data quality and consistency.
We’ll provide a detailed case study of our MLOps implementation, demonstrating how Airflow manages training, inference, and model monitoring pipelines for Azure Machine Learning models.
In today’s data-driven world, scalable ML infrastructure is mission-critical. As ML workloads grow, orchestration tools like Apache Airflow become essential for managing pipelines, training, deployment, and observability. In this talk, I’ll share lessons from building distributed ML systems across cloud platforms, including GPU-based training and AI-powered healthcare. We’ll cover patterns for scaling Airflow DAGs, integrating telemetry and auto-healing, and aligning cross-functional teams. Whether you’re launching your first pipeline or managing ML at scale, you’ll gain practical strategies to make Airflow the backbone of your ML infrastructure.
Columbia DIn today’s data-driven world, scalable ML infrastructure is mission-critical. As ML workloads grow, orchestration tools like Apache Airflow become essential for managing pipelines, training, deployment, and observability. In this talk, I’ll share lessons from building distributed ML systems across cloud platforms, including GPU-based training and AI-powered healthcare. We’ll cover patterns for scaling Airflow DAGs, integrating telemetry and auto-healing, and aligning cross-functional teams. Whether you’re launching your first pipeline or managing ML at scale, you’ll gain practical strategies to make Airflow the backbone of your ML infrastructure.
As organizations scale their data infrastructure, Apache Airflow becomes a mission-critical component for orchestrating workflows efficiently. But scaling Airflow successfully isn’t just about running pipelines—it’s about building a Center of Excellence (CoE) that empowers teams with the right strategy, best practices, and long-term enablement. Join Jon Leek and Michelle Winters as they share their experiences helping customers design and implement Airflow Centers of Excellence. They’ll walk through real-world challenges, best practices, and the structured approach Astronomer takes to ensure teams have the right plan, resources, and support to succeed. Whether you’re just starting with Airflow or looking to optimize and scale your workflows, this session will give you a proven framework to build a sustainable Airflow Center of Excellence within your organization. 🚀
Columbia CAs organizations scale their data infrastructure, Apache Airflow becomes a mission-critical component for orchestrating workflows efficiently. But scaling Airflow successfully isn’t just about running pipelines—it’s about building a Center of Excellence (CoE) that empowers teams with the right strategy, best practices, and long-term enablement. Join Jon Leek and Michelle Winters as they share their experiences helping customers design and implement Airflow Centers of Excellence. They’ll walk through real-world challenges, best practices, and the structured approach Astronomer takes to ensure teams have the right plan, resources, and support to succeed. Whether you’re just starting with Airflow or looking to optimize and scale your workflows, this session will give you a proven framework to build a sustainable Airflow Center of Excellence within your organization. 🚀
At Pinterest, there are over 10,000 DAGs supporting various use cases across different teams and roles. With this scale and diversity, user support has been an ongoing challenge to unlock productivity. As Airflow increasingly serves as a user interface to a variety of data and ML infrastructure behind the scenes, it’s common for issues from multiple areas to surface in Airflow, making triage and troubleshooting a challenge.
In this session, we will discuss the scale of the problem we are facing, how we have addressed it so far, and how we are introducing LLM AI to help solve this problem.
Columbia DAt Pinterest, there are over 10,000 DAGs supporting various use cases across different teams and roles. With this scale and diversity, user support has been an ongoing challenge to unlock productivity. As Airflow increasingly serves as a user interface to a variety of data and ML infrastructure behind the scenes, it’s common for issues from multiple areas to surface in Airflow, making triage and troubleshooting a challenge.
In this session, we will discuss the scale of the problem we are facing, how we have addressed it so far, and how we are introducing LLM AI to help solve this problem.
We face a paradox: we could use usage data to build better software, but collecting that data seems to contradict the very principles of user freedom that open source represents. Apache Airflow’s current telemetry - already purged - system has become a battleground for this conflict, with some users voicing concerns over privacy while maintainers struggle to make informed decisions without data. What can we do to strike the right balance?
Columbia AWe face a paradox: we could use usage data to build better software, but collecting that data seems to contradict the very principles of user freedom that open source represents. Apache Airflow’s current telemetry - already purged - system has become a battleground for this conflict, with some users voicing concerns over privacy while maintainers struggle to make informed decisions without data. What can we do to strike the right balance?
Custom operators are the secret weapon for solving Airflow’s unique & challenging orchestration problems.
This session will cover:
- When to build custom operators vs. using existing solutions
- Architecture patterns for creating maintainable, reusable operators
- Live coding demonstration: Building a custom operator from scratch
- Real-world examples: How custom operators solve specific business challenges
Through practical code examples and architecture patterns, attendees will walk away with the knowledge to implement custom operators that enhance their Airflow deployments.
This session is ideal for experienced Airflow users looking to extend functionality beyond out-of-the-box solutions.
BecklerCustom operators are the secret weapon for solving Airflow’s unique & challenging orchestration problems.
This session will cover:
- When to build custom operators vs. using existing solutions
- Architecture patterns for creating maintainable, reusable operators
- Live coding demonstration: Building a custom operator from scratch
- Real-world examples: How custom operators solve specific business challenges
Through practical code examples and architecture patterns, attendees will walk away with the knowledge to implement custom operators that enhance their Airflow deployments.
As the adoption of Airflow increases within large enterprises to orchestrate their data pipelines, more than one team needs to create, manage, and run their workflows in isolation. With multi-tenancy not yet supported natively in Airflow, customers are adopting alternate ways to enable multiple teams to share infrastructure. In this session, we will explore how GoDaddy uses MWAA to build a Single Pane Airflow setup for multiple teams with a common observability platform, and how this foundation enables orchestration expansion beyond data workflows to AI workflows as well. We’ll discuss our roadmap for leveraging upcoming Airflow 3 features, including the task execution API for enhanced workflow management and DAG versioning capabilities for comprehensive auditing and governance. This session will help attendees gain insights into the use case, the solution architecture, implementation challenges and benefits, and our strategic vision for unified orchestration across data and AI workloads.
Outline:
- About GoDaddy
- GoDaddy Data & AI Orchestration Vision
- Current State & Airflow Usage
- Airflow Monitoring & Observability
- Lessons Learned & Best Practices
- Airflow 3 Adoption
As the adoption of Airflow increases within large enterprises to orchestrate their data pipelines, more than one team needs to create, manage, and run their workflows in isolation. With multi-tenancy not yet supported natively in Airflow, customers are adopting alternate ways to enable multiple teams to share infrastructure. In this session, we will explore how GoDaddy uses MWAA to build a Single Pane Airflow setup for multiple teams with a common observability platform, and how this foundation enables orchestration expansion beyond data workflows to AI workflows as well. We’ll discuss our roadmap for leveraging upcoming Airflow 3 features, including the task execution API for enhanced workflow management and DAG versioning capabilities for comprehensive auditing and governance. This session will help attendees gain insights into the use case, the solution architecture, implementation challenges and benefits, and our strategic vision for unified orchestration across data and AI workloads.
Last year, ‘From Oops to Ops’ showed how AI-powered failure analysis could help diagnose why Airflow tasks fail. But do we really need large, expensive cloud-based AI models to answer simple diagnostic questions? Relying on external AI APIs introduces privacy risks, unpredictable costs, and latency, often without clear benefits for this use case.
With the rise of distilled, open-source models, self-hosted failure analysis is now a practical alternative. This talk will explore how to deploy an AI service on infrastructure you control, compare cost, speed, and accuracy between OpenAI’s API and self-hosted models, and showcase a live demo of AI-powered task failure diagnosis using DeepSeek and Llama—running without external dependencies to keep data private and costs predictable.
Columbia DLast year, ‘From Oops to Ops’ showed how AI-powered failure analysis could help diagnose why Airflow tasks fail. But do we really need large, expensive cloud-based AI models to answer simple diagnostic questions? Relying on external AI APIs introduces privacy risks, unpredictable costs, and latency, often without clear benefits for this use case.
With the rise of distilled, open-source models, self-hosted failure analysis is now a practical alternative. This talk will explore how to deploy an AI service on infrastructure you control, compare cost, speed, and accuracy between OpenAI’s API and self-hosted models, and showcase a live demo of AI-powered task failure diagnosis using DeepSeek and Llama—running without external dependencies to keep data private and costs predictable.
In the age of Generative AI, knowledge bases are the backbone of intelligent systems, enabling them to deliver accurate and context-aware responses. But how do you ensure that these knowledge bases remain up-to-date and relevant in a rapidly changing world? Enter Apache Airflow, a robust orchestration tool that streamlines the automation of data workflows.
This talk will explore how Airflow can be leveraged to manage and update AI knowledge bases across multiple data sources. We’ll dive into the architecture, demonstrate how Airflow enables efficient data extraction, transformation, and loading (ETL), and share insights on tackling challenges like data consistency, scheduling, and scalability.
Whether you’re building your own AI-driven systems or looking to optimize existing workflows, this session will provide practical takeaways to make the most of Apache Airflow in orchestrating intelligent solutions.
Columbia AIn the age of Generative AI, knowledge bases are the backbone of intelligent systems, enabling them to deliver accurate and context-aware responses. But how do you ensure that these knowledge bases remain up-to-date and relevant in a rapidly changing world? Enter Apache Airflow, a robust orchestration tool that streamlines the automation of data workflows.
This talk will explore how Airflow can be leveraged to manage and update AI knowledge bases across multiple data sources. We’ll dive into the architecture, demonstrate how Airflow enables efficient data extraction, transformation, and loading (ETL), and share insights on tackling challenges like data consistency, scheduling, and scalability.
Take your DAGs in Apache Airflow to the next level? This is an insightful session where we’ll uncover 5 transformative strategies to enhance your data workflows. Whether you’re a data engineering pro or just getting started, this presentation is packed with practical tips and actionable insights that you can apply right away.
We’ll dive into the magic of using powerful libraries like Pandas, share techniques to trim down data volumes for faster processing, and highlight the importance of modularizing your code for easier maintenance. Plus, you’ll discover efficient ways to monitor and debug your DAGs, and how to make the most of Airflow’s built-in features.
By the end of this session, you’ll have a toolkit of strategies to boost the efficiency and performance of your DAGs, making your data processing tasks smoother and more effective. Don’t miss out on this opportunity to elevate your Airflow DAGs!
Columbia DTake your DAGs in Apache Airflow to the next level? This is an insightful session where we’ll uncover 5 transformative strategies to enhance your data workflows. Whether you’re a data engineering pro or just getting started, this presentation is packed with practical tips and actionable insights that you can apply right away.
We’ll dive into the magic of using powerful libraries like Pandas, share techniques to trim down data volumes for faster processing, and highlight the importance of modularizing your code for easier maintenance. Plus, you’ll discover efficient ways to monitor and debug your DAGs, and how to make the most of Airflow’s built-in features.
KP Division of Research uses Airflow as a central technology for integrating diverse technologies in an agile setting. We wish to present a set of use-cases for AI/ML workloads, including imaging analysis (tissue segmentation, mammography), NLP (early identification of psychosis), LLM processing (identification of vessel diameter from radiological impressions), and other large data processing tasks. We create these “short-lived” project workflows to accomplish specific aims, and then may never run the job again, so leveraging generalized patterns are crucial to quickly implementing these jobs. Our Advanced Computational Infrastructure is comprised of multiple Kubernetes clusters, and we use Airflow to democratize the use of our batch level resources in those clusters. We use Airflow form-based parameters to deploy pods running R and Python scripts where generalized parameters are injected into scripts that follow internal programming patterns. Finally, we also leverage Airflow to create headless services inside Kubernetes for large computational workloads (Spark & H2O) that subsequent pods consume ephemerally.
Columbia AKP Division of Research uses Airflow as a central technology for integrating diverse technologies in an agile setting. We wish to present a set of use-cases for AI/ML workloads, including imaging analysis (tissue segmentation, mammography), NLP (early identification of psychosis), LLM processing (identification of vessel diameter from radiological impressions), and other large data processing tasks. We create these “short-lived” project workflows to accomplish specific aims, and then may never run the job again, so leveraging generalized patterns are crucial to quickly implementing these jobs. Our Advanced Computational Infrastructure is comprised of multiple Kubernetes clusters, and we use Airflow to democratize the use of our batch level resources in those clusters. We use Airflow form-based parameters to deploy pods running R and Python scripts where generalized parameters are injected into scripts that follow internal programming patterns. Finally, we also leverage Airflow to create headless services inside Kubernetes for large computational workloads (Spark & H2O) that subsequent pods consume ephemerally.
Apache Airflow’s REST API has evolved to support diverse orchestration needs, with managed services like MWAA introducing custom enhancements. One such feature, InvokeRestApi, enables dynamic interactions with external services while maintaining Airflow’s core orchestration capabilities.
In this talk, we will explore the architectural design behind InvokeRestApi, detailing how it enhances API-driven workflows. Beyond the architecture, we’ll share key challenges and learnings from implementing and scaling Airflow’s REST API in production environments. Topics include authentication, performance considerations, error handling, and best practices for integrating external APIs efficiently.
Attendees will gain a deeper understanding of Airflow’s API extensibility, its implications for workflow automation, and actionable insights for building robust, API-driven orchestration solutions. Whether you’re an Airflow user or an architect, this session will provide valuable takeaways for simplifying API interactions across airflow environments.
BecklerApache Airflow’s REST API has evolved to support diverse orchestration needs, with managed services like MWAA introducing custom enhancements. One such feature, InvokeRestApi, enables dynamic interactions with external services while maintaining Airflow’s core orchestration capabilities.
In this talk, we will explore the architectural design behind InvokeRestApi, detailing how it enhances API-driven workflows. Beyond the architecture, we’ll share key challenges and learnings from implementing and scaling Airflow’s REST API in production environments. Topics include authentication, performance considerations, error handling, and best practices for integrating external APIs efficiently.
Trino is incredibly effective at enabling users to extract insights quickly and effectively from large amount of data located in dispersed and heterogeneous federated data systems.
However, some business data problems are more complex than interactive analytics use cases, and are best broken down into a sequence of interdependent steps, a.k.a. a workflow. For these use cases, dedicated software is often required in order to schedule and manage these processes with a principled approach.
In this session, we will look at how we can leverage Apache Airflow to orchestrate Trino queries into complex workflows that solve practical batch processing problems, all the while avoiding the use of repetitive, redundant data movement.
Columbia CTrino is incredibly effective at enabling users to extract insights quickly and effectively from large amount of data located in dispersed and heterogeneous federated data systems.
However, some business data problems are more complex than interactive analytics use cases, and are best broken down into a sequence of interdependent steps, a.k.a. a workflow. For these use cases, dedicated software is often required in order to schedule and manage these processes with a principled approach.
Enterprises want the flexibility to operate across multiple clouds, whether to optimize costs, improve resiliency, to avoid vendor lock-in, or for data sovereignty. But for developers, that flexibility usually comes at the cost of extra complexity and redundant code. The goal here is simple: write once, run anywhere, with minimum boilerplate. In Apache Airflow, we’ve already begun tackling this problem with abstractions like Common-SQL, which lets you write database queries once and run them on 20+ databases, from Snowflake to Postgres to SQLite to SAP HANA. Similarly, Common-IO standardizes cloud blob storage interactions across all public clouds. With Airflow 3.0, we are pushing this further by introducing a Common Message Bus provider, which is an abstraction, initially supporting Amazon SQS and expanding to Google PubSub and Apache Kafka soon after. We expect additional implementations such as Amazon Kinesis and Managed Kafka over time.
This talk will dive into why these abstractions matter, how they reduce friction for developers while giving enterprises true multi-cloud optionality, and what’s next for Airflow’s evolving provider ecosystem.
Columbia AEnterprises want the flexibility to operate across multiple clouds, whether to optimize costs, improve resiliency, to avoid vendor lock-in, or for data sovereignty. But for developers, that flexibility usually comes at the cost of extra complexity and redundant code. The goal here is simple: write once, run anywhere, with minimum boilerplate. In Apache Airflow, we’ve already begun tackling this problem with abstractions like Common-SQL, which lets you write database queries once and run them on 20+ databases, from Snowflake to Postgres to SQLite to SAP HANA. Similarly, Common-IO standardizes cloud blob storage interactions across all public clouds. With Airflow 3.0, we are pushing this further by introducing a Common Message Bus provider, which is an abstraction, initially supporting Amazon SQS and expanding to Google PubSub and Apache Kafka soon after. We expect additional implementations such as Amazon Kinesis and Managed Kafka over time.
We have a similar pattern of DAGs running for different data quality dimensions like accuracy, timeliness, & completeness. To do this again and again, we would be duplicating and potentially introducing human error while doing copy paste of code or making people write same code again.
To solve for this, we are doing few things:
- Run DAGs via DagFactory to dynamically generate DAGs using just some YAML code for all the steps we want to run in our DQ checks.
- Hide this behind a UI which is hooked to github PR open step, now the user just provides some inputs or selects from dropdown in UI and a YAML DAG is generated for them.
This highlights the potential for DAGFactory to hide Airflow Python code from users and make it more accessible to Data Analysts and Business Intelligence along with normal Software Engg, along with reducing human error. YAML is the perfect format to be able to generate code, create a PR and DagFactory is the perfect fir for that. All of this is running in GCP Cloud Composer.
Columbia CWe have a similar pattern of DAGs running for different data quality dimensions like accuracy, timeliness, & completeness. To do this again and again, we would be duplicating and potentially introducing human error while doing copy paste of code or making people write same code again.
To solve for this, we are doing few things:
- Run DAGs via DagFactory to dynamically generate DAGs using just some YAML code for all the steps we want to run in our DQ checks.
- Hide this behind a UI which is hooked to github PR open step, now the user just provides some inputs or selects from dropdown in UI and a YAML DAG is generated for them.
This highlights the potential for DAGFactory to hide Airflow Python code from users and make it more accessible to Data Analysts and Business Intelligence along with normal Software Engg, along with reducing human error. YAML is the perfect format to be able to generate code, create a PR and DagFactory is the perfect fir for that. All of this is running in GCP Cloud Composer.
In large organizations, multiple Apache Airflow instances often arise organically—driven by team-specific needs, distinct use cases, or tiered workloads. This fragmentation introduces complexity, operational overhead, and higher infrastructure costs. To address these challenges, we developed the “Orchestration Frederator,” a solution designed to unify and horizontally scale multiple Airflow deployments seamlessly.
This session will detail our journey in implementing Orchestration Frederator, highlighting how we achieved:
-
Horizontal Scalability: Seamlessly scaling Airflow across multiple instances without operational overhead.
-
End-to-End Data Lineage: Constructing comprehensive data lineage across disparate Airflow deployments to simplify monitoring and debugging.
-
Multi-Region Support: Introducing multi-region capabilities, enhancing reliability and disaster recovery.
-
Unified Ecosystem: Consolidating previously fragmented Airflow environments into a cohesive orchestration platform.
Join us to explore practical strategies, technical challenges, lessons learned, and best practices for enhancing scalability, reliability, and maintainability in large-scale Airflow deployments.
Columbia DIn large organizations, multiple Apache Airflow instances often arise organically—driven by team-specific needs, distinct use cases, or tiered workloads. This fragmentation introduces complexity, operational overhead, and higher infrastructure costs. To address these challenges, we developed the “Orchestration Frederator,” a solution designed to unify and horizontally scale multiple Airflow deployments seamlessly.
This session will detail our journey in implementing Orchestration Frederator, highlighting how we achieved:
-
Horizontal Scalability: Seamlessly scaling Airflow across multiple instances without operational overhead.
In the rapidly evolving field of data engineering and data science, efficiency and ease of use are crucial. Our innovative solution offers a user-friendly interface to manage and schedule custom PySpark, PySQL, Python, and SQL code, streamlining the process from development to production. Using Airflow at the backend, this tool eliminates the complexities of infrastructure management, version control, CI/CD processes, and workflow orchestration.The intuitive UI allows users to upload code, configure job parameters, and set schedules effortlessly, without the need for additional scripting or coding. Additionally, users have the flexibility to bring their own custom artifactory solution and run their code. In summary, our solution significantly enhances the orchestration and scheduling of custom code, breaking down traditional barriers and empowering organizations to maximize their data’s potential and drive innovation efficiently. Whether you are an individual data scientist or part of a large data engineering team, this tool provides the resources needed to streamline your workflow and achieve your goals faster than ever before.
BecklerIn the rapidly evolving field of data engineering and data science, efficiency and ease of use are crucial. Our innovative solution offers a user-friendly interface to manage and schedule custom PySpark, PySQL, Python, and SQL code, streamlining the process from development to production. Using Airflow at the backend, this tool eliminates the complexities of infrastructure management, version control, CI/CD processes, and workflow orchestration.The intuitive UI allows users to upload code, configure job parameters, and set schedules effortlessly, without the need for additional scripting or coding. Additionally, users have the flexibility to bring their own custom artifactory solution and run their code. In summary, our solution significantly enhances the orchestration and scheduling of custom code, breaking down traditional barriers and empowering organizations to maximize their data’s potential and drive innovation efficiently. Whether you are an individual data scientist or part of a large data engineering team, this tool provides the resources needed to streamline your workflow and achieve your goals faster than ever before.
There was a post on the data engineering subreddit recently that discussed how difficult it is to keep up with the data engineering world.
Did you learn Hadoop, great we are on Snowflake, BigQuery and Databricks now.
Just learned Airflow, well now we have Airflow 3.0.
And the list goes on.
But what doesn’t change, and what have been the lessons over the past decade. That’s what I’ll be covering in this talk. Real lessons and realities that come up time and time again whether you’re working for a start-up or a large enterprise.
Columbia DThere was a post on the data engineering subreddit recently that discussed how difficult it is to keep up with the data engineering world.
Did you learn Hadoop, great we are on Snowflake, BigQuery and Databricks now.
Just learned Airflow, well now we have Airflow 3.0.
And the list goes on.
But what doesn’t change, and what have been the lessons over the past decade. That’s what I’ll be covering in this talk. Real lessons and realities that come up time and time again whether you’re working for a start-up or a large enterprise.
As data platforms grow in complexity, so do the orchestration needs behind them. Time-based (cron) scheduling has long been the default in Airflow, but dataset-based scheduling promises a more data-aware, efficient alternative. In this session, I’ll share lessons learned from operating Airflow at scale—supporting thousands of DAGs across teams with varied use cases, from simple ETL to complex ML workflows. We’ll explore when dataset scheduling makes sense, the challenges it introduces, and how to evolve your DAG design and platform architecture to make the most of it. Whether you’re migrating legacy workflows or designing new ones, this talk will help you evaluate the right scheduling model for your needs.
Columbia AAs data platforms grow in complexity, so do the orchestration needs behind them. Time-based (cron) scheduling has long been the default in Airflow, but dataset-based scheduling promises a more data-aware, efficient alternative. In this session, I’ll share lessons learned from operating Airflow at scale—supporting thousands of DAGs across teams with varied use cases, from simple ETL to complex ML workflows. We’ll explore when dataset scheduling makes sense, the challenges it introduces, and how to evolve your DAG design and platform architecture to make the most of it. Whether you’re migrating legacy workflows or designing new ones, this talk will help you evaluate the right scheduling model for your needs.
This workshop will provide an overview of implementing operations research problems using Apache Airflow. This is a hands-on session where attendees will gain experience creating DAGs to define and manage workflows for classical operations research problems. The workshop will include several examples of how Airflow can be used to optimize and automate various decision-making processes, including:
- Inventory management: How to use Airflow to optimize inventory levels and reduce stockouts by analyzing demand patterns, lead times, and other factors.
- Production planning: How to use Airflow to create optimized production schedules that minimize downtime, reduce costs, and increase throughput.
- Logistics optimization: How to use Airflow to optimize transportation routes and other factors to improve the efficiency of logistics operations.
Attendees will come away with a solid understanding of using Airflow to automate decision-making processes with optimization solvers.
305At LinkedIn, our data pipelines process exabytes of data, with our offline infrastructure executing 300K ETL workflows daily and 10K concurrent executions. Historically, these workloads ran on our legacy system, Azkaban, which faced UX, scalability, and operational challenges. To modernize our infra, we built a managed Airflow service, leveraging its enhanced developer & operator experience, rich feature set, and strong OSS community support. That initiated LinkedIn’s largest-ever infrastructure migration—transitioning thousands of legacy workflows to Airflow.
In this talk, we will share key lessons from migrating massive-scale pipelines with minimal production disruption. We will discuss:
- Overall Migration Strategy
- Custom Tooling Enhancements on testing, deployment, and observability
- Architectural Innovations decoupling orchestration and compute
- GenAI-powered Migration automating code rewrites
- Post-Migration Challenges & Airflow 3.0.
Attendees will walk away with battle-tested strategies for large-scale Airflow adoption and practical insights into scaling Airflow in enterprise environments.
Columbia CAt LinkedIn, our data pipelines process exabytes of data, with our offline infrastructure executing 300K ETL workflows daily and 10K concurrent executions. Historically, these workloads ran on our legacy system, Azkaban, which faced UX, scalability, and operational challenges. To modernize our infra, we built a managed Airflow service, leveraging its enhanced developer & operator experience, rich feature set, and strong OSS community support. That initiated LinkedIn’s largest-ever infrastructure migration—transitioning thousands of legacy workflows to Airflow.
As a popular open-source library for analytics engineering, dbt is often combined with Airflow. Orchestrating and executing dbt models as DAGs ensures an additional layer of control over tasks, observability, and provides a reliable, scalable environment to run dbt models.
This workshop will cover a step-by-step guide to Cosmos, a popular open-source package from Astronomer that helps you quickly run your dbt Core projects as Airflow DAGs and Task Groups, all with just a few lines of code. We’ll walk through:
- Running and visualising your dbt transformations
- Managing dependency conflicts
- Defining database credentials (profiles)
- Configuring source and test nodes
- Using dbt selectors
- Customising arguments per model
- Addressing performance challenges
- Leveraging deferrable operators
- Visualising dbt docs in the Airflow UI
- Example of how to deploy to production
- Troubleshooting
We encourage participants to bring their dbt project to follow this step-by-step workshop.
301Per Airflow community survey, Kubernetes is the most popular compute platform used to run Airflow and when run on Kubernetes, Airflow gains, out of the box, lots of benefits like monitoring, reliability, ease of deployment, scalability and autoscaling. On the other hand, running Airflow on Kubernetes means running a sophisticated distributed system on another distributed system which makes troubleshooting of Airflow tasks and DAGs failures harder.
This session tackles that bottleneck head-on, introducing a practical approach to building an automated diagnostic pipeline for Airflow on Kubernetes. Imagine offloading tedious investigations to a system that, on task failure, automatically collects and correlates key signals from Kubernetes components (linking Airflow tasks to specific Pods and their events), KubernetesGKE monitoring, and relevant logs—pinpointing root causes and suggesting actionable fixes.
Attendees will leave with a clear understanding of common Airflow-on-Kubernetes failure patterns—and more importantly, a blueprint and practical strategies to reduce MTTR and boost team efficiency.
BecklerPer Airflow community survey, Kubernetes is the most popular compute platform used to run Airflow and when run on Kubernetes, Airflow gains, out of the box, lots of benefits like monitoring, reliability, ease of deployment, scalability and autoscaling. On the other hand, running Airflow on Kubernetes means running a sophisticated distributed system on another distributed system which makes troubleshooting of Airflow tasks and DAGs failures harder.
This session tackles that bottleneck head-on, introducing a practical approach to building an automated diagnostic pipeline for Airflow on Kubernetes. Imagine offloading tedious investigations to a system that, on task failure, automatically collects and correlates key signals from Kubernetes components (linking Airflow tasks to specific Pods and their events), KubernetesGKE monitoring, and relevant logs—pinpointing root causes and suggesting actionable fixes.
Ready to contribute to Apache Airflow? In this hands-on workshop, you’ll be expected to come prepared with your development environment already configured (Breeze installed is strongly recommended, but Codespaces works if you can’t install Docker). We’ll dive straight into finding issues that match your skills and walk you through the entire contribution process—from creating your first pull request to receiving community feedback. Whether you’re writing code, enhancing documentation, or offering feedback, there’s a place for you. Let’s get started and see your name among Airflow contributors!
306Red Hat’s unified data and AI platform relies on Apache Airflow for orchestration, alongside Snowflake, Fivetran, and Atlan. The platform prioritizes building a dependable data foundation, recognizing that effective AI depends on quality data. Airflow was selected for its predictability, extensive connectivity, reliability, and scalability.
The platform now supports business analytics, transitioning from ETL to ELT processes. This has resulted in a remarkable improvement in how we make data available for business decisions.
The platform’s capabilities are being extended to power Digital Workers (AI agents) using large language models, encompassing model training, fine-tuning, and inference. Two Digital Workers are currently deployed, with more in development.
This presentation will detail the rationale and background of this evolution, followed by an explanation of the architectural decisions made and the challenges encountered and resolved throughout the process of transforming into an AI-enabled data platform to power Red Hat’s business.
Columbia CRed Hat’s unified data and AI platform relies on Apache Airflow for orchestration, alongside Snowflake, Fivetran, and Atlan. The platform prioritizes building a dependable data foundation, recognizing that effective AI depends on quality data. Airflow was selected for its predictability, extensive connectivity, reliability, and scalability.
The platform now supports business analytics, transitioning from ETL to ELT processes. This has resulted in a remarkable improvement in how we make data available for business decisions.
Operating within the stringent regulatory landscape of Corporate Banking, Deutsche Bank relies heavily on robust data orchestration. This session explores how Deutsche Bank’s Corporate Bank leverages Apache Airflow across diverse environments, including both on-premises infrastructure and cloud platforms. Discover their approach to managing critical data & analytics workflows, encompassing areas like regulatory reporting, data integration and complex data processing pipelines. Gain insights into the architectural patterns and operational best practices employed to ensure compliance, security, and scalability when running Airflow at scale in a highly regulated, hybrid setting.
Columbia AOperating within the stringent regulatory landscape of Corporate Banking, Deutsche Bank relies heavily on robust data orchestration. This session explores how Deutsche Bank’s Corporate Bank leverages Apache Airflow across diverse environments, including both on-premises infrastructure and cloud platforms. Discover their approach to managing critical data & analytics workflows, encompassing areas like regulatory reporting, data integration and complex data processing pipelines. Gain insights into the architectural patterns and operational best practices employed to ensure compliance, security, and scalability when running Airflow at scale in a highly regulated, hybrid setting.
At Yahoo, we built a secure, scalable, and cost-efficient batch processing platform using Amazon MWAA to orchestrate Apache Flink jobs on EKS, managed by the Flink Kubernetes Operator. This setup enables dynamic job orchestration while meeting strict enterprise compliance standards.
In this session, we’ll share how Airflow DAGs:
-
Dynamically launch, monitor, and clean up isolated Flink clusters per batch job, improving resource efficiency.
-
Securely fetch EKS kubeconfig, submit FlinkDeployment CRDs using FlinkKubernetesOperator, and poll job status using Airflow sensors.
-
Integrate IAM for access control and meet Yahoo’s security requirements, including mutual TLS (mTLS) with Athenz.
-
Optimize for cost and resilience through automated cleanup of jobs and the operator, and handle job failures and retries.
Join us for practical strategies and lessons from Yahoo’s production-scale Flink workflows in a Kubernetes environment.
BecklerAt OLX, we connect millions of people daily through our online marketplace while relying on robust data pipelines. In this talk, we explore how the DAG Factory concept elevates data governance, lineage, and discovery by centralizing operator logic and restricting direct DAG creation. This approach enforces code quality, optimizes resources, maintains infrastructure hygiene and enables smooth version upgrades. We then leverage consistent naming conventions in Airflow to build targeted namespaces, aligning teams with global policies while preserving autonomy. Integrating external tools like AWS Lake Formation and Open Metadata further unifies governance, making it straightforward to manage and secure data. This is critical when handling hundreds or even thousands of active DAGs. If the idea of storing 1,600 pipelines in one folder seems overwhelming, join us to learn how the DAG Factory concept simplifies pipeline management. We’ll also share insights from OLX, highlighting how thoughtful design fosters oversight, efficiency, and discoverability across diverse use cases.
Columbia DAt OLX, we connect millions of people daily through our online marketplace while relying on robust data pipelines. In this talk, we explore how the DAG Factory concept elevates data governance, lineage, and discovery by centralizing operator logic and restricting direct DAG creation. This approach enforces code quality, optimizes resources, maintains infrastructure hygiene and enables smooth version upgrades. We then leverage consistent naming conventions in Airflow to build targeted namespaces, aligning teams with global policies while preserving autonomy. Integrating external tools like AWS Lake Formation and Open Metadata further unifies governance, making it straightforward to manage and secure data. This is critical when handling hundreds or even thousands of active DAGs. If the idea of storing 1,600 pipelines in one folder seems overwhelming, join us to learn how the DAG Factory concept simplifies pipeline management. We’ll also share insights from OLX, highlighting how thoughtful design fosters oversight, efficiency, and discoverability across diverse use cases.
Apache Airflow’s executor landscape has traditionally presented users with a clear trade-off: choose either the speed of local execution or the scalability, isolation and configurability of remote execution. The AWS Lambda Executor introduces a new paradigm that bridges this gap, offering near-local execution speeds with the benefits of remote containerization.
This talk will begin with a brief overview of Airflow’s executors, how they work and what they are responsible for, highlighting the compromises between different executors. We will explore the emerging niche for fast, yet remote execution and demonstrate how the AWS Lambda Executor fills this space. We will also address practical considerations when using such an executor, such as working within Lambda’s 15 minute execution limit, and how to mitigate this using multi-executor configuration.
Whether you’re new to Airflow or an experienced user, this session will provide valuable insights into task execution and how you can combine the best of both local and remote execution paradigms.
BecklerApache Airflow’s executor landscape has traditionally presented users with a clear trade-off: choose either the speed of local execution or the scalability, isolation and configurability of remote execution. The AWS Lambda Executor introduces a new paradigm that bridges this gap, offering near-local execution speeds with the benefits of remote containerization.
This talk will begin with a brief overview of Airflow’s executors, how they work and what they are responsible for, highlighting the compromises between different executors. We will explore the emerging niche for fast, yet remote execution and demonstrate how the AWS Lambda Executor fills this space. We will also address practical considerations when using such an executor, such as working within Lambda’s 15 minute execution limit, and how to mitigate this using multi-executor configuration.
DAGnostics seamlessly integrates Airflow Cluster Policy hooks to enforce governance from local DAG authoring through CI pipelines to production runtime. Learn how it closes validation gaps, collapses feedback loops from hours to seconds, and ensures consistent policies across stages. We examine current runtime-only enforcement and fractured CI checks, then unveil our architecture: a pluggable policy registry via Airflow entry points, local static analysis for pre-commit validation, GitHub Actions CI integration, and runtime hook enforcement. See real-world use cases: alerting standards, resource quotas, naming conventions, and exemption handling. Next, dive into implementation: authoring policies in Python, auto-discovery, cross-environment enforcement, upstream contribution, and testing strategies. We share LinkedIn’s metrics—2,000+ DAG repos, 10,000+ daily executions supporting trunk-based development across isolated teams/use-cases, and 78% fewer runtime violations—and lessons learned scaling policy-as-code at enterprise scale. Leave with a blueprint to adopt DAGnostics and strengthen your Airflow governance while preserving full compatibility with existing systems.
Columbia DDAGnostics seamlessly integrates Airflow Cluster Policy hooks to enforce governance from local DAG authoring through CI pipelines to production runtime. Learn how it closes validation gaps, collapses feedback loops from hours to seconds, and ensures consistent policies across stages. We examine current runtime-only enforcement and fractured CI checks, then unveil our architecture: a pluggable policy registry via Airflow entry points, local static analysis for pre-commit validation, GitHub Actions CI integration, and runtime hook enforcement. See real-world use cases: alerting standards, resource quotas, naming conventions, and exemption handling. Next, dive into implementation: authoring policies in Python, auto-discovery, cross-environment enforcement, upstream contribution, and testing strategies. We share LinkedIn’s metrics—2,000+ DAG repos, 10,000+ daily executions supporting trunk-based development across isolated teams/use-cases, and 78% fewer runtime violations—and lessons learned scaling policy-as-code at enterprise scale. Leave with a blueprint to adopt DAGnostics and strengthen your Airflow governance while preserving full compatibility with existing systems.
Metadata management is a cornerstone of effective data governance, yet it presents unique challenges distinct from traditional data engineering. At scale, efficiently extracting metadata from relational and NoSQL databases demands specialized solutions. To address this, our team has developed custom Airflow operators that scan and extract metadata across various database technologies, orchestrating 100+ production jobs to ensure continuous and reliable metadata collection.
Now, we’re expanding beyond databases to tackle non-traditional data sources such as file repositories and message queues. This shift introduces new complexities, including processing structured and unstructured files, managing schema evolution in streaming data, and maintaining metadata consistency across heterogeneous sources. In this session, we’ll share our approach to building scalable metadata scanners, optimizing performance, and ensuring adaptability across diverse data environments. Attendees will gain insights into designing efficient metadata pipelines, overcoming common pitfalls, and leveraging Airflow to drive metadata governance at scale.
Columbia CMetadata management is a cornerstone of effective data governance, yet it presents unique challenges distinct from traditional data engineering. At scale, efficiently extracting metadata from relational and NoSQL databases demands specialized solutions. To address this, our team has developed custom Airflow operators that scan and extract metadata across various database technologies, orchestrating 100+ production jobs to ensure continuous and reliable metadata collection.
Now, we’re expanding beyond databases to tackle non-traditional data sources such as file repositories and message queues. This shift introduces new complexities, including processing structured and unstructured files, managing schema evolution in streaming data, and maintaining metadata consistency across heterogeneous sources. In this session, we’ll share our approach to building scalable metadata scanners, optimizing performance, and ensuring adaptability across diverse data environments. Attendees will gain insights into designing efficient metadata pipelines, overcoming common pitfalls, and leveraging Airflow to drive metadata governance at scale.
This session details practical strategies for introducing Apache Airflow in strict, compliance-heavy organizations. Learn how on-premise deployment and hybrid tooling can help modernize legacy workflows when public cloud solutions and container technologies are restricted. Discover how cross-platform engineering teams can collaborate securely using CI/CD bridges, and what it takes to meet rigorous security and governance standards. Key lessons address navigating resistance to change, achieving production sign-off, and avoiding common compliance pitfalls, relevant to anyone automating in public sector settings.
Columbia AThis session details practical strategies for introducing Apache Airflow in strict, compliance-heavy organizations. Learn how on-premise deployment and hybrid tooling can help modernize legacy workflows when public cloud solutions and container technologies are restricted. Discover how cross-platform engineering teams can collaborate securely using CI/CD bridges, and what it takes to meet rigorous security and governance standards. Key lessons address navigating resistance to change, achieving production sign-off, and avoiding common compliance pitfalls, relevant to anyone automating in public sector settings.
As modern data ecosystems grow in complexity, ensuring transparency, discoverability, and governance in data workflows becomes critical. Apache Airflow, a powerful workflow orchestration tool, enables data engineers to build scalable pipelines, but without proper visibility into data lineage, ownership, and quality, teams risk operating in a black box.
In this talk, we will explore how integrating Airflow with a data catalog can bring clarity and transparency to data workflows. We’ll discuss how metadata-driven orchestration enhances data governance, enables lineage tracking, and improves collaboration across teams. Through real-world use cases, we will demonstrate how Airflow can automate metadata collection, update data catalogs dynamically, and ensure data quality at every stage of the pipeline.
Attendees will walk away with practical strategies for implementing a transparent data workflow that fosters trust, efficiency, and compliance in their data infrastructure.
Columbia CAs modern data ecosystems grow in complexity, ensuring transparency, discoverability, and governance in data workflows becomes critical. Apache Airflow, a powerful workflow orchestration tool, enables data engineers to build scalable pipelines, but without proper visibility into data lineage, ownership, and quality, teams risk operating in a black box.
In this talk, we will explore how integrating Airflow with a data catalog can bring clarity and transparency to data workflows. We’ll discuss how metadata-driven orchestration enhances data governance, enables lineage tracking, and improves collaboration across teams. Through real-world use cases, we will demonstrate how Airflow can automate metadata collection, update data catalogs dynamically, and ensure data quality at every stage of the pipeline.
This session explores how to bring unit testing to SQL pipelines using Airflow. I’ll walk through the development of a SQL testing library that allows isolated testing of SQL logic by injecting mock data into base tables. To support this, we built a type system for AWS Glue tables using Pydantic, enabling schema validation and mock data generation. Over time, this type system also powered production data quality checks via a custom Airflow operator. Learn how this approach improves reliability, accelerates development, and scales testing across data workflows.
Columbia DThis session explores how to bring unit testing to SQL pipelines using Airflow. I’ll walk through the development of a SQL testing library that allows isolated testing of SQL logic by injecting mock data into base tables. To support this, we built a type system for AWS Glue tables using Pydantic, enabling schema validation and mock data generation. Over time, this type system also powered production data quality checks via a custom Airflow operator. Learn how this approach improves reliability, accelerates development, and scales testing across data workflows.
The Bloomberg Data Platform Engineering team is responsible for managing, storing, and providing access to business and financial data used by financial professionals across the global capital markets. Our team utilizes Apache Airflow to orchestrate data workflows across various applications and Bloomberg Terminal functions. Over the years, we have fine-tuned our Airflow cluster to handle more than 1,000 ingestion DAGs, which has presented unique scalability challenges. In this session, we will share insights into several key Airflow parameters — some of which you may not be all that familiar with — that our team uses to optimize and scale the platform effectively.
Columbia AThe Bloomberg Data Platform Engineering team is responsible for managing, storing, and providing access to business and financial data used by financial professionals across the global capital markets. Our team utilizes Apache Airflow to orchestrate data workflows across various applications and Bloomberg Terminal functions. Over the years, we have fine-tuned our Airflow cluster to handle more than 1,000 ingestion DAGs, which has presented unique scalability challenges. In this session, we will share insights into several key Airflow parameters — some of which you may not be all that familiar with — that our team uses to optimize and scale the platform effectively.
We will explore how Apache Airflow 3 unlocks new possibilities for smarter, more flexible DAG design. We’ll start by breaking down common anti-patterns in early DAG implementations, such as hardcoded operators, duplicated task logic, and rigid sequencing, that lead to brittle, unscalable workflows. From there, we’ll show how refactoring with the D.R.Y. (Don’t Repeat Yourself) principle, using techniques like task factories, parameterization, dynamic task mapping, and modular DAG construction, transforms these workflows into clean, reusable patterns.
With Airflow 3, these strategies go further: enabling DAGs that are reusable across both batch pipelines and streaming/event-driven workloads, while also supporting ad-hoc runs for testing, one-off jobs, or backfills. The result is not just more concise code, but workflows that can flexibly serve different data processing modes without duplication. Attendees will leave with concrete patterns and best practices for building maintainable, production-grade DAGs that are scalable, observable, and aligned with modern data engineering standards.
BecklerWe will explore how Apache Airflow 3 unlocks new possibilities for smarter, more flexible DAG design. We’ll start by breaking down common anti-patterns in early DAG implementations, such as hardcoded operators, duplicated task logic, and rigid sequencing, that lead to brittle, unscalable workflows. From there, we’ll show how refactoring with the D.R.Y. (Don’t Repeat Yourself) principle, using techniques like task factories, parameterization, dynamic task mapping, and modular DAG construction, transforms these workflows into clean, reusable patterns.
Ever seen a DAG go rogue and deploy itself? Or try to time travel back to 1999? Join us for a light-hearted yet painfully relatable look at how not to scale your Airflow deployment to avoid chaos and debugging nightmares.
We’ll cover the classics: hardcoded secrets, unbounded retries (hello, immortal task!), and the infamous spaghetti DAG where 200 tasks are lovingly connected by hand and no one dares open the Airflow UI anymore. If you’ve ever used datetime.now() in your DAG definition and watched your backfills implode, this talk is for you.
From the BashOperator that became sentient to the XCom that tried to pass a whole Pandas DataFrame and the key to your mother’s house, we’ll walk through real-world bloopers with practical takeaways. You’ll learn why overusing PythonOperator is a recipe for mess, how not to use sensors unless you enjoy resource starvation, and why scheduling in local timezones is basically asking for a daylight savings time horror story. Other highlights include:
Over-provisioning resources in KubernetesPodOperator: many teams allocate excessive memory/CPU “just in case”, leading to cluster contention and resource waste.
Dynamic task mapping gone wild: 10,000 mapped tasks later… the scheduler is still crying.
SLAs used as data quality guarantees: creating alerts so noisy, nobody listens. Design-free DAGs: no docs, no comments, no idea why a task has a 3-day timeout. Finally, we’ll round it out with some dos and don’ts: using environment variables, avoiding memory-hungry monolith DAGs, skipping global imports, and not allocating 10x more memory “just in case.” Whether you’re new to Airflow or battle-hardened from a thousand failed backfills, come learn how to scale your pipelines without losing your mind (or your cluster).
BecklerEver seen a DAG go rogue and deploy itself? Or try to time travel back to 1999? Join us for a light-hearted yet painfully relatable look at how not to scale your Airflow deployment to avoid chaos and debugging nightmares.
We’ll cover the classics: hardcoded secrets, unbounded retries (hello, immortal task!), and the infamous spaghetti DAG where 200 tasks are lovingly connected by hand and no one dares open the Airflow UI anymore. If you’ve ever used datetime.now() in your DAG definition and watched your backfills implode, this talk is for you.
In today’s data-driven world, effective workflow management and AI are crucial for success. However, there’s a notable gap between Airflow and AI. Our presentation offers a solution to close this gap.
Proposing MCP (Model Context Protocol) server to act as a bridge. We’ll dive into two paths:
- AI-Augmented Airflow: Enhancing Airflow with AI to improve error handling, automate DAG generation, proactively detect issues, and optimize resource use.
- Airflow-Powered AI: Utilizing Airflow’s reliability to empower LLMs in executing complex tasks, orchestrating AI agents, and supporting decision-making with real-time data.
Key takeaways:
- Understanding how to integrate AI insights directly into your workflow orchestration.
- Learning how MCP empowers AI with robust orchestration capabilities, offering full logging, monitoring, and auditability.
- Gaining insights into how to transform LLMS from a reactive responder to a proactive, intelligent, and reliable executor.
Inviting you to explore how MCP can help workflow management, making AI-driven decisions more reliable and turning workflow systems into intelligent, autonomous agents.
Columbia DIn today’s data-driven world, effective workflow management and AI are crucial for success. However, there’s a notable gap between Airflow and AI. Our presentation offers a solution to close this gap.
Proposing MCP (Model Context Protocol) server to act as a bridge. We’ll dive into two paths:
- AI-Augmented Airflow: Enhancing Airflow with AI to improve error handling, automate DAG generation, proactively detect issues, and optimize resource use.
- Airflow-Powered AI: Utilizing Airflow’s reliability to empower LLMs in executing complex tasks, orchestrating AI agents, and supporting decision-making with real-time data.
Key takeaways:
As Data Engineers, our jobs regularly include scheduling or scaling workflows.
But have you ever asked yourself, can I scale my scheduling ?
It turns out that you can! But doing so raises a number of issues that need to be addressed.
In this talk we’ll be:
- Recapping Asset-aware scheduling in Apache Airflow
- Discussing diverse methods to upscale our scheduling
- Solving the issue of maintaining our Airflow Asset synchronized between instances
- Comparing our professional push based solution and the built-in solution from AIP-82 and the pros and cons of each method.
I hope you will enjoy it!
Columbia CAs Data Engineers, our jobs regularly include scheduling or scaling workflows.
But have you ever asked yourself, can I scale my scheduling ?
It turns out that you can! But doing so raises a number of issues that need to be addressed.
In this talk we’ll be:
- Recapping Asset-aware scheduling in Apache Airflow
- Discussing diverse methods to upscale our scheduling
- Solving the issue of maintaining our Airflow Asset synchronized between instances
- Comparing our professional push based solution and the built-in solution from AIP-82 and the pros and cons of each method.
I hope you will enjoy it!
Before Airflow, our BigQuery pipelines at Create Music Group operated like musicians without a conductor—each playing on its own schedule, regardless of whether upstream data was ready. As our data platform grew, this chaos led to spiralling costs, performance bottlenecks, and became utterly unsustainable.
This talk tells the story of how Create Music Group brought harmony to its data workflows by adopting Apache Airflow and the Medallion architecture, ultimately slashing our data processing costs by 50%. We’ll show how moving to event-driven scheduling with datasets helped eliminate stale data issues, dramatically improved performance, and unlocked faster iteration across teams. Discover how we replaced repetitive SQL with standardized dimension/fact tables, empowering analysts in a safer sandbox.
Columbia ABefore Airflow, our BigQuery pipelines at Create Music Group operated like musicians without a conductor—each playing on its own schedule, regardless of whether upstream data was ready. As our data platform grew, this chaos led to spiralling costs, performance bottlenecks, and became utterly unsustainable.
This talk tells the story of how Create Music Group brought harmony to its data workflows by adopting Apache Airflow and the Medallion architecture, ultimately slashing our data processing costs by 50%. We’ll show how moving to event-driven scheduling with datasets helped eliminate stale data issues, dramatically improved performance, and unlocked faster iteration across teams. Discover how we replaced repetitive SQL with standardized dimension/fact tables, empowering analysts in a safer sandbox.
In this talk, we’ll share our journey and lessons learned from developing a new open-source Airflow operator that integrates a newly-launched AWS service with the Airflow ecosystem. This real-world case study will illuminate the complete lifecycle of building an Airflow operator, from initial design to successful community contribution.
We’ll dive deep into the practical challenges and solutions encountered throughout the journey, including:
- Evaluating when to build a new operator versus extending existing ones
- Navigating the Apache Airflow Open-source contribution process
- Best practices for operator design and implementation
- Key learnings and common pitfalls to avoid during the testing and release process
Whether you’re looking to contribute to Apache Airflow or build custom operators, this session will provide valuable insights into the development process, common pitfalls to avoid, and best practices when contributing to and collaborating with the Apache Airflow community.
Expect to leave with a practical roadmap for your own contributions and the confidence to successfully engage with the Apache Airflow ecosystem.
Columbia AIn this talk, we’ll share our journey and lessons learned from developing a new open-source Airflow operator that integrates a newly-launched AWS service with the Airflow ecosystem. This real-world case study will illuminate the complete lifecycle of building an Airflow operator, from initial design to successful community contribution.
We’ll dive deep into the practical challenges and solutions encountered throughout the journey, including:
- Evaluating when to build a new operator versus extending existing ones
- Navigating the Apache Airflow Open-source contribution process
- Best practices for operator design and implementation
- Key learnings and common pitfalls to avoid during the testing and release process
Whether you’re looking to contribute to Apache Airflow or build custom operators, this session will provide valuable insights into the development process, common pitfalls to avoid, and best practices when contributing to and collaborating with the Apache Airflow community.
Apache Airflow is a powerful workflow orchestrator, but as workloads grow, its Python-based components can become performance bottlenecks. This talk explores how Rust, with its speed, safety, and concurrency advantages, can enhance Airflow’s core components (e.g, scheduler, DAG processor, etc). We’ll dive into the motivations behind using Rust, architectural trade-offs, and the challenges of bridging the gap between Python and Rust. A proof-of-concept showcasing an Airflow scheduler rewritten in Rust will demonstrate the potential benefits of this approach.
Columbia AApache Airflow is a powerful workflow orchestrator, but as workloads grow, its Python-based components can become performance bottlenecks. This talk explores how Rust, with its speed, safety, and concurrency advantages, can enhance Airflow’s core components (e.g, scheduler, DAG processor, etc). We’ll dive into the motivations behind using Rust, architectural trade-offs, and the challenges of bridging the gap between Python and Rust. A proof-of-concept showcasing an Airflow scheduler rewritten in Rust will demonstrate the potential benefits of this approach.
In this keynote, Peeyush Rai and Vikram Koka will be walking through how Airflow is being used as part of a Agentic AI platform servicing insurance companies, which runs on all the major public clouds, leveraging models from Open AI, Google (Gemini), AWS (Claude and Bedrock).
This talk walks through the details of the actual end user business workflow including gathering relevant financial data to make a decision, as well as the tricky challenge of handling AI hallucinations, with new Airflow capabilities such as “Human in the loop”.
This talk offers something for both business and technical audiences. Business users will get a clear view of what it takes to bring an AI application into production and how to align their operations and business teams with an AI enabled workflow. Meanwhile, technical users will walk away with practical insights on how to orchestrate complex business processes enabling a seamless collaboration between Airflow, AI Agents and Human in the loop.
Columbia AIn this keynote, Peeyush Rai and Vikram Koka will be walking through how Airflow is being used as part of a Agentic AI platform servicing insurance companies, which runs on all the major public clouds, leveraging models from Open AI, Google (Gemini), AWS (Claude and Bedrock).
This talk walks through the details of the actual end user business workflow including gathering relevant financial data to make a decision, as well as the tricky challenge of handling AI hallucinations, with new Airflow capabilities such as “Human in the loop”.
This talk explores EDB’s journey from siloed reporting to a unified data platform, powered by Airflow. We’ll delve into the architectural evolution, showcasing how Airflow orchestrates a diverse range of use cases, from Analytics Engineering to complex MLOps pipelines.
Learn how EDB leverages Airflow and Cosmos to integrate dbt for robust data transformations, ensuring data quality and consistency.
We’ll provide a detailed case study of our MLOps implementation, demonstrating how Airflow manages training, inference, and model monitoring pipelines for Azure Machine Learning models.
Discover the design considerations driven by our internal data governance framework and gain insights into our future plans for AIOps integration with Airflow.
Columbia AThis talk explores EDB’s journey from siloed reporting to a unified data platform, powered by Airflow. We’ll delve into the architectural evolution, showcasing how Airflow orchestrates a diverse range of use cases, from Analytics Engineering to complex MLOps pipelines.
Learn how EDB leverages Airflow and Cosmos to integrate dbt for robust data transformations, ensuring data quality and consistency.
We’ll provide a detailed case study of our MLOps implementation, demonstrating how Airflow manages training, inference, and model monitoring pipelines for Azure Machine Learning models.
Small retailers often lack the data visibility that larger companies rely on for decision-making. In this session, we’ll dive into how Apache Airflow powers end-to-end machine learning pipelines that process inventory and sales data, enabling retailers and suppliers to gain valuable industry insights. We’ll cover feature engineering, model training, and automated inference workflows, along with strategies for handling messy, incomplete retail data. We will discuss how Airflow enables scalable ML-driven insights that improve demand forecasting, product categorization, and supply chain optimization.
Columbia CSmall retailers often lack the data visibility that larger companies rely on for decision-making. In this session, we’ll dive into how Apache Airflow powers end-to-end machine learning pipelines that process inventory and sales data, enabling retailers and suppliers to gain valuable industry insights. We’ll cover feature engineering, model training, and automated inference workflows, along with strategies for handling messy, incomplete retail data. We will discuss how Airflow enables scalable ML-driven insights that improve demand forecasting, product categorization, and supply chain optimization.
In today’s data-driven world, scalable ML infrastructure is mission-critical. As ML workloads grow, orchestration tools like Apache Airflow become essential for managing pipelines, training, deployment, and observability. In this talk, I’ll share lessons from building distributed ML systems across cloud platforms, including GPU-based training and AI-powered healthcare. We’ll cover patterns for scaling Airflow DAGs, integrating telemetry and auto-healing, and aligning cross-functional teams. Whether you’re launching your first pipeline or managing ML at scale, you’ll gain practical strategies to make Airflow the backbone of your ML infrastructure.
Columbia DIn today’s data-driven world, scalable ML infrastructure is mission-critical. As ML workloads grow, orchestration tools like Apache Airflow become essential for managing pipelines, training, deployment, and observability. In this talk, I’ll share lessons from building distributed ML systems across cloud platforms, including GPU-based training and AI-powered healthcare. We’ll cover patterns for scaling Airflow DAGs, integrating telemetry and auto-healing, and aligning cross-functional teams. Whether you’re launching your first pipeline or managing ML at scale, you’ll gain practical strategies to make Airflow the backbone of your ML infrastructure.
How a Complete Beginner in Data Engineering / Junior Computer Science Student Became an Apache Airflow Committer in Just 5 Months—With 70+ PRs and 300 Hours of Contributions
This talk is aimed at those who are still hesitant about contributing to Apache Airflow. I hope to inspire and encourage anyone to take the first step and start their journey in open-source—let’s build together!
BecklerHow a Complete Beginner in Data Engineering / Junior Computer Science Student Became an Apache Airflow Committer in Just 5 Months—With 70+ PRs and 300 Hours of Contributions
This talk is aimed at those who are still hesitant about contributing to Apache Airflow. I hope to inspire and encourage anyone to take the first step and start their journey in open-source—let’s build together!
We’re excited to offer Airflow Summit 2025 attendees an exclusive opportunity to earn their DAG Authoring certification in person, now updated to include all the latest Airflow 3.0 features. This certification workshop comes at no additional cost to summit attendees.
The DAG Authoring for Apache Airflow certification validates your expertise in advanced Airflow concepts and demonstrates your ability to build production-grade data pipelines. It covers TaskFlow API, Dynamic task mapping, Templating, Asset-driven scheduling, Best practices for production DAGs, and new Airflow 3.0 features and optimizations.
The certification session includes:
- 20-minute preparation period with expert guidance
- Live Q&A session with Marc Lamberti from Astronomer
- 60-minute examination period
- Real-time results and immediate feedback
To prepare for the Airflow Certification, visit the Astronomer Academy (https://academy.astronomer.io/page/astronomer-certification).
301AI and ML pipelines built in Airflow often power critical business outcomes, but they rarely operate in isolation. In this hands-on workshop, learn how Control-M integrates with Airflow to orchestrate end-to-end workflows that include upstream and downstream enterprise systems like Supply Chain and Billing. Gain visibility, reliability, and seamless coordination across your data pipelines and the business operations they support.
305We face a paradox: we could use usage data to build better software, but collecting that data seems to contradict the very principles of user freedom that open source represents. Apache Airflow’s current telemetry - already purged - system has become a battleground for this conflict, with some users voicing concerns over privacy while maintainers struggle to make informed decisions without data. What can we do to strike the right balance?
Columbia AWe face a paradox: we could use usage data to build better software, but collecting that data seems to contradict the very principles of user freedom that open source represents. Apache Airflow’s current telemetry - already purged - system has become a battleground for this conflict, with some users voicing concerns over privacy while maintainers struggle to make informed decisions without data. What can we do to strike the right balance?
As organizations scale their data infrastructure, Apache Airflow becomes a mission-critical component for orchestrating workflows efficiently. But scaling Airflow successfully isn’t just about running pipelines—it’s about building a Center of Excellence (CoE) that empowers teams with the right strategy, best practices, and long-term enablement. Join Jon Leek and Michelle Winters as they share their experiences helping customers design and implement Airflow Centers of Excellence. They’ll walk through real-world challenges, best practices, and the structured approach Astronomer takes to ensure teams have the right plan, resources, and support to succeed. Whether you’re just starting with Airflow or looking to optimize and scale your workflows, this session will give you a proven framework to build a sustainable Airflow Center of Excellence within your organization. 🚀
Columbia CAs organizations scale their data infrastructure, Apache Airflow becomes a mission-critical component for orchestrating workflows efficiently. But scaling Airflow successfully isn’t just about running pipelines—it’s about building a Center of Excellence (CoE) that empowers teams with the right strategy, best practices, and long-term enablement. Join Jon Leek and Michelle Winters as they share their experiences helping customers design and implement Airflow Centers of Excellence. They’ll walk through real-world challenges, best practices, and the structured approach Astronomer takes to ensure teams have the right plan, resources, and support to succeed. Whether you’re just starting with Airflow or looking to optimize and scale your workflows, this session will give you a proven framework to build a sustainable Airflow Center of Excellence within your organization. 🚀
At Pinterest, there are over 10,000 DAGs supporting various use cases across different teams and roles. With this scale and diversity, user support has been an ongoing challenge to unlock productivity. As Airflow increasingly serves as a user interface to a variety of data and ML infrastructure behind the scenes, it’s common for issues from multiple areas to surface in Airflow, making triage and troubleshooting a challenge.
In this session, we will discuss the scale of the problem we are facing, how we have addressed it so far, and how we are introducing LLM AI to help solve this problem.
Columbia DAt Pinterest, there are over 10,000 DAGs supporting various use cases across different teams and roles. With this scale and diversity, user support has been an ongoing challenge to unlock productivity. As Airflow increasingly serves as a user interface to a variety of data and ML infrastructure behind the scenes, it’s common for issues from multiple areas to surface in Airflow, making triage and troubleshooting a challenge.
In this session, we will discuss the scale of the problem we are facing, how we have addressed it so far, and how we are introducing LLM AI to help solve this problem.
In the age of Generative AI, knowledge bases are the backbone of intelligent systems, enabling them to deliver accurate and context-aware responses. But how do you ensure that these knowledge bases remain up-to-date and relevant in a rapidly changing world? Enter Apache Airflow, a robust orchestration tool that streamlines the automation of data workflows.
This talk will explore how Airflow can be leveraged to manage and update AI knowledge bases across multiple data sources. We’ll dive into the architecture, demonstrate how Airflow enables efficient data extraction, transformation, and loading (ETL), and share insights on tackling challenges like data consistency, scheduling, and scalability.
Whether you’re building your own AI-driven systems or looking to optimize existing workflows, this session will provide practical takeaways to make the most of Apache Airflow in orchestrating intelligent solutions.
Columbia AIn the age of Generative AI, knowledge bases are the backbone of intelligent systems, enabling them to deliver accurate and context-aware responses. But how do you ensure that these knowledge bases remain up-to-date and relevant in a rapidly changing world? Enter Apache Airflow, a robust orchestration tool that streamlines the automation of data workflows.
This talk will explore how Airflow can be leveraged to manage and update AI knowledge bases across multiple data sources. We’ll dive into the architecture, demonstrate how Airflow enables efficient data extraction, transformation, and loading (ETL), and share insights on tackling challenges like data consistency, scheduling, and scalability.
As the adoption of Airflow increases within large enterprises to orchestrate their data pipelines, more than one team needs to create, manage, and run their workflows in isolation. With multi-tenancy not yet supported natively in Airflow, customers are adopting alternate ways to enable multiple teams to share infrastructure. In this session, we will explore how GoDaddy uses MWAA to build a Single Pane Airflow setup for multiple teams with a common observability platform, and how this foundation enables orchestration expansion beyond data workflows to AI workflows as well. We’ll discuss our roadmap for leveraging upcoming Airflow 3 features, including the task execution API for enhanced workflow management and DAG versioning capabilities for comprehensive auditing and governance. This session will help attendees gain insights into the use case, the solution architecture, implementation challenges and benefits, and our strategic vision for unified orchestration across data and AI workloads.
Outline:
- About GoDaddy
- GoDaddy Data & AI Orchestration Vision
- Current State & Airflow Usage
- Airflow Monitoring & Observability
- Lessons Learned & Best Practices
- Airflow 3 Adoption
As the adoption of Airflow increases within large enterprises to orchestrate their data pipelines, more than one team needs to create, manage, and run their workflows in isolation. With multi-tenancy not yet supported natively in Airflow, customers are adopting alternate ways to enable multiple teams to share infrastructure. In this session, we will explore how GoDaddy uses MWAA to build a Single Pane Airflow setup for multiple teams with a common observability platform, and how this foundation enables orchestration expansion beyond data workflows to AI workflows as well. We’ll discuss our roadmap for leveraging upcoming Airflow 3 features, including the task execution API for enhanced workflow management and DAG versioning capabilities for comprehensive auditing and governance. This session will help attendees gain insights into the use case, the solution architecture, implementation challenges and benefits, and our strategic vision for unified orchestration across data and AI workloads.
Last year, ‘From Oops to Ops’ showed how AI-powered failure analysis could help diagnose why Airflow tasks fail. But do we really need large, expensive cloud-based AI models to answer simple diagnostic questions? Relying on external AI APIs introduces privacy risks, unpredictable costs, and latency, often without clear benefits for this use case.
With the rise of distilled, open-source models, self-hosted failure analysis is now a practical alternative. This talk will explore how to deploy an AI service on infrastructure you control, compare cost, speed, and accuracy between OpenAI’s API and self-hosted models, and showcase a live demo of AI-powered task failure diagnosis using DeepSeek and Llama—running without external dependencies to keep data private and costs predictable.
Columbia DLast year, ‘From Oops to Ops’ showed how AI-powered failure analysis could help diagnose why Airflow tasks fail. But do we really need large, expensive cloud-based AI models to answer simple diagnostic questions? Relying on external AI APIs introduces privacy risks, unpredictable costs, and latency, often without clear benefits for this use case.
With the rise of distilled, open-source models, self-hosted failure analysis is now a practical alternative. This talk will explore how to deploy an AI service on infrastructure you control, compare cost, speed, and accuracy between OpenAI’s API and self-hosted models, and showcase a live demo of AI-powered task failure diagnosis using DeepSeek and Llama—running without external dependencies to keep data private and costs predictable.
Custom operators are the secret weapon for solving Airflow’s unique & challenging orchestration problems.
This session will cover:
- When to build custom operators vs. using existing solutions
- Architecture patterns for creating maintainable, reusable operators
- Live coding demonstration: Building a custom operator from scratch
- Real-world examples: How custom operators solve specific business challenges
Through practical code examples and architecture patterns, attendees will walk away with the knowledge to implement custom operators that enhance their Airflow deployments.
This session is ideal for experienced Airflow users looking to extend functionality beyond out-of-the-box solutions.
BecklerCustom operators are the secret weapon for solving Airflow’s unique & challenging orchestration problems.
This session will cover:
- When to build custom operators vs. using existing solutions
- Architecture patterns for creating maintainable, reusable operators
- Live coding demonstration: Building a custom operator from scratch
- Real-world examples: How custom operators solve specific business challenges
Through practical code examples and architecture patterns, attendees will walk away with the knowledge to implement custom operators that enhance their Airflow deployments.
KP Division of Research uses Airflow as a central technology for integrating diverse technologies in an agile setting. We wish to present a set of use-cases for AI/ML workloads, including imaging analysis (tissue segmentation, mammography), NLP (early identification of psychosis), LLM processing (identification of vessel diameter from radiological impressions), and other large data processing tasks. We create these “short-lived” project workflows to accomplish specific aims, and then may never run the job again, so leveraging generalized patterns are crucial to quickly implementing these jobs. Our Advanced Computational Infrastructure is comprised of multiple Kubernetes clusters, and we use Airflow to democratize the use of our batch level resources in those clusters. We use Airflow form-based parameters to deploy pods running R and Python scripts where generalized parameters are injected into scripts that follow internal programming patterns. Finally, we also leverage Airflow to create headless services inside Kubernetes for large computational workloads (Spark & H2O) that subsequent pods consume ephemerally.
Columbia AKP Division of Research uses Airflow as a central technology for integrating diverse technologies in an agile setting. We wish to present a set of use-cases for AI/ML workloads, including imaging analysis (tissue segmentation, mammography), NLP (early identification of psychosis), LLM processing (identification of vessel diameter from radiological impressions), and other large data processing tasks. We create these “short-lived” project workflows to accomplish specific aims, and then may never run the job again, so leveraging generalized patterns are crucial to quickly implementing these jobs. Our Advanced Computational Infrastructure is comprised of multiple Kubernetes clusters, and we use Airflow to democratize the use of our batch level resources in those clusters. We use Airflow form-based parameters to deploy pods running R and Python scripts where generalized parameters are injected into scripts that follow internal programming patterns. Finally, we also leverage Airflow to create headless services inside Kubernetes for large computational workloads (Spark & H2O) that subsequent pods consume ephemerally.
Trino is incredibly effective at enabling users to extract insights quickly and effectively from large amount of data located in dispersed and heterogeneous federated data systems.
However, some business data problems are more complex than interactive analytics use cases, and are best broken down into a sequence of interdependent steps, a.k.a. a workflow. For these use cases, dedicated software is often required in order to schedule and manage these processes with a principled approach.
In this session, we will look at how we can leverage Apache Airflow to orchestrate Trino queries into complex workflows that solve practical batch processing problems, all the while avoiding the use of repetitive, redundant data movement.
Columbia CTrino is incredibly effective at enabling users to extract insights quickly and effectively from large amount of data located in dispersed and heterogeneous federated data systems.
However, some business data problems are more complex than interactive analytics use cases, and are best broken down into a sequence of interdependent steps, a.k.a. a workflow. For these use cases, dedicated software is often required in order to schedule and manage these processes with a principled approach.
Take your DAGs in Apache Airflow to the next level? This is an insightful session where we’ll uncover 5 transformative strategies to enhance your data workflows. Whether you’re a data engineering pro or just getting started, this presentation is packed with practical tips and actionable insights that you can apply right away.
We’ll dive into the magic of using powerful libraries like Pandas, share techniques to trim down data volumes for faster processing, and highlight the importance of modularizing your code for easier maintenance. Plus, you’ll discover efficient ways to monitor and debug your DAGs, and how to make the most of Airflow’s built-in features.
By the end of this session, you’ll have a toolkit of strategies to boost the efficiency and performance of your DAGs, making your data processing tasks smoother and more effective. Don’t miss out on this opportunity to elevate your Airflow DAGs!
Columbia DTake your DAGs in Apache Airflow to the next level? This is an insightful session where we’ll uncover 5 transformative strategies to enhance your data workflows. Whether you’re a data engineering pro or just getting started, this presentation is packed with practical tips and actionable insights that you can apply right away.
We’ll dive into the magic of using powerful libraries like Pandas, share techniques to trim down data volumes for faster processing, and highlight the importance of modularizing your code for easier maintenance. Plus, you’ll discover efficient ways to monitor and debug your DAGs, and how to make the most of Airflow’s built-in features.
Apache Airflow’s REST API has evolved to support diverse orchestration needs, with managed services like MWAA introducing custom enhancements. One such feature, InvokeRestApi, enables dynamic interactions with external services while maintaining Airflow’s core orchestration capabilities.
In this talk, we will explore the architectural design behind InvokeRestApi, detailing how it enhances API-driven workflows. Beyond the architecture, we’ll share key challenges and learnings from implementing and scaling Airflow’s REST API in production environments. Topics include authentication, performance considerations, error handling, and best practices for integrating external APIs efficiently.
Attendees will gain a deeper understanding of Airflow’s API extensibility, its implications for workflow automation, and actionable insights for building robust, API-driven orchestration solutions. Whether you’re an Airflow user or an architect, this session will provide valuable takeaways for simplifying API interactions across airflow environments.
BecklerApache Airflow’s REST API has evolved to support diverse orchestration needs, with managed services like MWAA introducing custom enhancements. One such feature, InvokeRestApi, enables dynamic interactions with external services while maintaining Airflow’s core orchestration capabilities.
In this talk, we will explore the architectural design behind InvokeRestApi, detailing how it enhances API-driven workflows. Beyond the architecture, we’ll share key challenges and learnings from implementing and scaling Airflow’s REST API in production environments. Topics include authentication, performance considerations, error handling, and best practices for integrating external APIs efficiently.
Enterprises want the flexibility to operate across multiple clouds, whether to optimize costs, improve resiliency, to avoid vendor lock-in, or for data sovereignty. But for developers, that flexibility usually comes at the cost of extra complexity and redundant code. The goal here is simple: write once, run anywhere, with minimum boilerplate. In Apache Airflow, we’ve already begun tackling this problem with abstractions like Common-SQL, which lets you write database queries once and run them on 20+ databases, from Snowflake to Postgres to SQLite to SAP HANA. Similarly, Common-IO standardizes cloud blob storage interactions across all public clouds. With Airflow 3.0, we are pushing this further by introducing a Common Message Bus provider, which is an abstraction, initially supporting Amazon SQS and expanding to Google PubSub and Apache Kafka soon after. We expect additional implementations such as Amazon Kinesis and Managed Kafka over time.
This talk will dive into why these abstractions matter, how they reduce friction for developers while giving enterprises true multi-cloud optionality, and what’s next for Airflow’s evolving provider ecosystem.
Columbia AEnterprises want the flexibility to operate across multiple clouds, whether to optimize costs, improve resiliency, to avoid vendor lock-in, or for data sovereignty. But for developers, that flexibility usually comes at the cost of extra complexity and redundant code. The goal here is simple: write once, run anywhere, with minimum boilerplate. In Apache Airflow, we’ve already begun tackling this problem with abstractions like Common-SQL, which lets you write database queries once and run them on 20+ databases, from Snowflake to Postgres to SQLite to SAP HANA. Similarly, Common-IO standardizes cloud blob storage interactions across all public clouds. With Airflow 3.0, we are pushing this further by introducing a Common Message Bus provider, which is an abstraction, initially supporting Amazon SQS and expanding to Google PubSub and Apache Kafka soon after. We expect additional implementations such as Amazon Kinesis and Managed Kafka over time.
We have a similar pattern of DAGs running for different data quality dimensions like accuracy, timeliness, & completeness. To do this again and again, we would be duplicating and potentially introducing human error while doing copy paste of code or making people write same code again.
To solve for this, we are doing few things:
- Run DAGs via DagFactory to dynamically generate DAGs using just some YAML code for all the steps we want to run in our DQ checks.
- Hide this behind a UI which is hooked to github PR open step, now the user just provides some inputs or selects from dropdown in UI and a YAML DAG is generated for them.
This highlights the potential for DAGFactory to hide Airflow Python code from users and make it more accessible to Data Analysts and Business Intelligence along with normal Software Engg, along with reducing human error. YAML is the perfect format to be able to generate code, create a PR and DagFactory is the perfect fir for that. All of this is running in GCP Cloud Composer.
Columbia CWe have a similar pattern of DAGs running for different data quality dimensions like accuracy, timeliness, & completeness. To do this again and again, we would be duplicating and potentially introducing human error while doing copy paste of code or making people write same code again.
To solve for this, we are doing few things:
- Run DAGs via DagFactory to dynamically generate DAGs using just some YAML code for all the steps we want to run in our DQ checks.
- Hide this behind a UI which is hooked to github PR open step, now the user just provides some inputs or selects from dropdown in UI and a YAML DAG is generated for them.
This highlights the potential for DAGFactory to hide Airflow Python code from users and make it more accessible to Data Analysts and Business Intelligence along with normal Software Engg, along with reducing human error. YAML is the perfect format to be able to generate code, create a PR and DagFactory is the perfect fir for that. All of this is running in GCP Cloud Composer.
In large organizations, multiple Apache Airflow instances often arise organically—driven by team-specific needs, distinct use cases, or tiered workloads. This fragmentation introduces complexity, operational overhead, and higher infrastructure costs. To address these challenges, we developed the “Orchestration Frederator,” a solution designed to unify and horizontally scale multiple Airflow deployments seamlessly.
This session will detail our journey in implementing Orchestration Frederator, highlighting how we achieved:
-
Horizontal Scalability: Seamlessly scaling Airflow across multiple instances without operational overhead.
-
End-to-End Data Lineage: Constructing comprehensive data lineage across disparate Airflow deployments to simplify monitoring and debugging.
-
Multi-Region Support: Introducing multi-region capabilities, enhancing reliability and disaster recovery.
-
Unified Ecosystem: Consolidating previously fragmented Airflow environments into a cohesive orchestration platform.
Join us to explore practical strategies, technical challenges, lessons learned, and best practices for enhancing scalability, reliability, and maintainability in large-scale Airflow deployments.
Columbia DIn large organizations, multiple Apache Airflow instances often arise organically—driven by team-specific needs, distinct use cases, or tiered workloads. This fragmentation introduces complexity, operational overhead, and higher infrastructure costs. To address these challenges, we developed the “Orchestration Frederator,” a solution designed to unify and horizontally scale multiple Airflow deployments seamlessly.
This session will detail our journey in implementing Orchestration Frederator, highlighting how we achieved:
-
Horizontal Scalability: Seamlessly scaling Airflow across multiple instances without operational overhead.
In the rapidly evolving field of data engineering and data science, efficiency and ease of use are crucial. Our innovative solution offers a user-friendly interface to manage and schedule custom PySpark, PySQL, Python, and SQL code, streamlining the process from development to production. Using Airflow at the backend, this tool eliminates the complexities of infrastructure management, version control, CI/CD processes, and workflow orchestration.The intuitive UI allows users to upload code, configure job parameters, and set schedules effortlessly, without the need for additional scripting or coding. Additionally, users have the flexibility to bring their own custom artifactory solution and run their code. In summary, our solution significantly enhances the orchestration and scheduling of custom code, breaking down traditional barriers and empowering organizations to maximize their data’s potential and drive innovation efficiently. Whether you are an individual data scientist or part of a large data engineering team, this tool provides the resources needed to streamline your workflow and achieve your goals faster than ever before.
BecklerIn the rapidly evolving field of data engineering and data science, efficiency and ease of use are crucial. Our innovative solution offers a user-friendly interface to manage and schedule custom PySpark, PySQL, Python, and SQL code, streamlining the process from development to production. Using Airflow at the backend, this tool eliminates the complexities of infrastructure management, version control, CI/CD processes, and workflow orchestration.The intuitive UI allows users to upload code, configure job parameters, and set schedules effortlessly, without the need for additional scripting or coding. Additionally, users have the flexibility to bring their own custom artifactory solution and run their code. In summary, our solution significantly enhances the orchestration and scheduling of custom code, breaking down traditional barriers and empowering organizations to maximize their data’s potential and drive innovation efficiently. Whether you are an individual data scientist or part of a large data engineering team, this tool provides the resources needed to streamline your workflow and achieve your goals faster than ever before.
As data platforms grow in complexity, so do the orchestration needs behind them. Time-based (cron) scheduling has long been the default in Airflow, but dataset-based scheduling promises a more data-aware, efficient alternative. In this session, I’ll share lessons learned from operating Airflow at scale—supporting thousands of DAGs across teams with varied use cases, from simple ETL to complex ML workflows. We’ll explore when dataset scheduling makes sense, the challenges it introduces, and how to evolve your DAG design and platform architecture to make the most of it. Whether you’re migrating legacy workflows or designing new ones, this talk will help you evaluate the right scheduling model for your needs.
Columbia AAs data platforms grow in complexity, so do the orchestration needs behind them. Time-based (cron) scheduling has long been the default in Airflow, but dataset-based scheduling promises a more data-aware, efficient alternative. In this session, I’ll share lessons learned from operating Airflow at scale—supporting thousands of DAGs across teams with varied use cases, from simple ETL to complex ML workflows. We’ll explore when dataset scheduling makes sense, the challenges it introduces, and how to evolve your DAG design and platform architecture to make the most of it. Whether you’re migrating legacy workflows or designing new ones, this talk will help you evaluate the right scheduling model for your needs.
At LinkedIn, our data pipelines process exabytes of data, with our offline infrastructure executing 300K ETL workflows daily and 10K concurrent executions. Historically, these workloads ran on our legacy system, Azkaban, which faced UX, scalability, and operational challenges. To modernize our infra, we built a managed Airflow service, leveraging its enhanced developer & operator experience, rich feature set, and strong OSS community support. That initiated LinkedIn’s largest-ever infrastructure migration—transitioning thousands of legacy workflows to Airflow.
In this talk, we will share key lessons from migrating massive-scale pipelines with minimal production disruption. We will discuss:
- Overall Migration Strategy
- Custom Tooling Enhancements on testing, deployment, and observability
- Architectural Innovations decoupling orchestration and compute
- GenAI-powered Migration automating code rewrites
- Post-Migration Challenges & Airflow 3.0.
Attendees will walk away with battle-tested strategies for large-scale Airflow adoption and practical insights into scaling Airflow in enterprise environments.
Columbia CAt LinkedIn, our data pipelines process exabytes of data, with our offline infrastructure executing 300K ETL workflows daily and 10K concurrent executions. Historically, these workloads ran on our legacy system, Azkaban, which faced UX, scalability, and operational challenges. To modernize our infra, we built a managed Airflow service, leveraging its enhanced developer & operator experience, rich feature set, and strong OSS community support. That initiated LinkedIn’s largest-ever infrastructure migration—transitioning thousands of legacy workflows to Airflow.
There was a post on the data engineering subreddit recently that discussed how difficult it is to keep up with the data engineering world.
Did you learn Hadoop, great we are on Snowflake, BigQuery and Databricks now.
Just learned Airflow, well now we have Airflow 3.0.
And the list goes on.
But what doesn’t change, and what have been the lessons over the past decade. That’s what I’ll be covering in this talk. Real lessons and realities that come up time and time again whether you’re working for a start-up or a large enterprise.
Columbia DThere was a post on the data engineering subreddit recently that discussed how difficult it is to keep up with the data engineering world.
Did you learn Hadoop, great we are on Snowflake, BigQuery and Databricks now.
Just learned Airflow, well now we have Airflow 3.0.
And the list goes on.
But what doesn’t change, and what have been the lessons over the past decade. That’s what I’ll be covering in this talk. Real lessons and realities that come up time and time again whether you’re working for a start-up or a large enterprise.
Per Airflow community survey, Kubernetes is the most popular compute platform used to run Airflow and when run on Kubernetes, Airflow gains, out of the box, lots of benefits like monitoring, reliability, ease of deployment, scalability and autoscaling. On the other hand, running Airflow on Kubernetes means running a sophisticated distributed system on another distributed system which makes troubleshooting of Airflow tasks and DAGs failures harder.
This session tackles that bottleneck head-on, introducing a practical approach to building an automated diagnostic pipeline for Airflow on Kubernetes. Imagine offloading tedious investigations to a system that, on task failure, automatically collects and correlates key signals from Kubernetes components (linking Airflow tasks to specific Pods and their events), KubernetesGKE monitoring, and relevant logs—pinpointing root causes and suggesting actionable fixes.
Attendees will leave with a clear understanding of common Airflow-on-Kubernetes failure patterns—and more importantly, a blueprint and practical strategies to reduce MTTR and boost team efficiency.
BecklerPer Airflow community survey, Kubernetes is the most popular compute platform used to run Airflow and when run on Kubernetes, Airflow gains, out of the box, lots of benefits like monitoring, reliability, ease of deployment, scalability and autoscaling. On the other hand, running Airflow on Kubernetes means running a sophisticated distributed system on another distributed system which makes troubleshooting of Airflow tasks and DAGs failures harder.
This session tackles that bottleneck head-on, introducing a practical approach to building an automated diagnostic pipeline for Airflow on Kubernetes. Imagine offloading tedious investigations to a system that, on task failure, automatically collects and correlates key signals from Kubernetes components (linking Airflow tasks to specific Pods and their events), KubernetesGKE monitoring, and relevant logs—pinpointing root causes and suggesting actionable fixes.
As a popular open-source library for analytics engineering, dbt is often combined with Airflow. Orchestrating and executing dbt models as DAGs ensures an additional layer of control over tasks, observability, and provides a reliable, scalable environment to run dbt models.
This workshop will cover a step-by-step guide to Cosmos, a popular open-source package from Astronomer that helps you quickly run your dbt Core projects as Airflow DAGs and Task Groups, all with just a few lines of code. We’ll walk through:
- Running and visualising your dbt transformations
- Managing dependency conflicts
- Defining database credentials (profiles)
- Configuring source and test nodes
- Using dbt selectors
- Customising arguments per model
- Addressing performance challenges
- Leveraging deferrable operators
- Visualising dbt docs in the Airflow UI
- Example of how to deploy to production
- Troubleshooting
We encourage participants to bring their dbt project to follow this step-by-step workshop.
301This workshop will provide an overview of implementing operations research problems using Apache Airflow. This is a hands-on session where attendees will gain experience creating DAGs to define and manage workflows for classical operations research problems. The workshop will include several examples of how Airflow can be used to optimize and automate various decision-making processes, including:
- Inventory management: How to use Airflow to optimize inventory levels and reduce stockouts by analyzing demand patterns, lead times, and other factors.
- Production planning: How to use Airflow to create optimized production schedules that minimize downtime, reduce costs, and increase throughput.
- Logistics optimization: How to use Airflow to optimize transportation routes and other factors to improve the efficiency of logistics operations.
Attendees will come away with a solid understanding of using Airflow to automate decision-making processes with optimization solvers.
305Ready to contribute to Apache Airflow? In this hands-on workshop, you’ll be expected to come prepared with your development environment already configured (Breeze installed is strongly recommended, but Codespaces works if you can’t install Docker). We’ll dive straight into finding issues that match your skills and walk you through the entire contribution process—from creating your first pull request to receiving community feedback. Whether you’re writing code, enhancing documentation, or offering feedback, there’s a place for you. Let’s get started and see your name among Airflow contributors!
306Operating within the stringent regulatory landscape of Corporate Banking, Deutsche Bank relies heavily on robust data orchestration. This session explores how Deutsche Bank’s Corporate Bank leverages Apache Airflow across diverse environments, including both on-premises infrastructure and cloud platforms. Discover their approach to managing critical data & analytics workflows, encompassing areas like regulatory reporting, data integration and complex data processing pipelines. Gain insights into the architectural patterns and operational best practices employed to ensure compliance, security, and scalability when running Airflow at scale in a highly regulated, hybrid setting.
Columbia AOperating within the stringent regulatory landscape of Corporate Banking, Deutsche Bank relies heavily on robust data orchestration. This session explores how Deutsche Bank’s Corporate Bank leverages Apache Airflow across diverse environments, including both on-premises infrastructure and cloud platforms. Discover their approach to managing critical data & analytics workflows, encompassing areas like regulatory reporting, data integration and complex data processing pipelines. Gain insights into the architectural patterns and operational best practices employed to ensure compliance, security, and scalability when running Airflow at scale in a highly regulated, hybrid setting.
Red Hat’s unified data and AI platform relies on Apache Airflow for orchestration, alongside Snowflake, Fivetran, and Atlan. The platform prioritizes building a dependable data foundation, recognizing that effective AI depends on quality data. Airflow was selected for its predictability, extensive connectivity, reliability, and scalability.
The platform now supports business analytics, transitioning from ETL to ELT processes. This has resulted in a remarkable improvement in how we make data available for business decisions.
The platform’s capabilities are being extended to power Digital Workers (AI agents) using large language models, encompassing model training, fine-tuning, and inference. Two Digital Workers are currently deployed, with more in development.
This presentation will detail the rationale and background of this evolution, followed by an explanation of the architectural decisions made and the challenges encountered and resolved throughout the process of transforming into an AI-enabled data platform to power Red Hat’s business.
Columbia CRed Hat’s unified data and AI platform relies on Apache Airflow for orchestration, alongside Snowflake, Fivetran, and Atlan. The platform prioritizes building a dependable data foundation, recognizing that effective AI depends on quality data. Airflow was selected for its predictability, extensive connectivity, reliability, and scalability.
The platform now supports business analytics, transitioning from ETL to ELT processes. This has resulted in a remarkable improvement in how we make data available for business decisions.
At OLX, we connect millions of people daily through our online marketplace while relying on robust data pipelines. In this talk, we explore how the DAG Factory concept elevates data governance, lineage, and discovery by centralizing operator logic and restricting direct DAG creation. This approach enforces code quality, optimizes resources, maintains infrastructure hygiene and enables smooth version upgrades. We then leverage consistent naming conventions in Airflow to build targeted namespaces, aligning teams with global policies while preserving autonomy. Integrating external tools like AWS Lake Formation and Open Metadata further unifies governance, making it straightforward to manage and secure data. This is critical when handling hundreds or even thousands of active DAGs. If the idea of storing 1,600 pipelines in one folder seems overwhelming, join us to learn how the DAG Factory concept simplifies pipeline management. We’ll also share insights from OLX, highlighting how thoughtful design fosters oversight, efficiency, and discoverability across diverse use cases.
Columbia DAt OLX, we connect millions of people daily through our online marketplace while relying on robust data pipelines. In this talk, we explore how the DAG Factory concept elevates data governance, lineage, and discovery by centralizing operator logic and restricting direct DAG creation. This approach enforces code quality, optimizes resources, maintains infrastructure hygiene and enables smooth version upgrades. We then leverage consistent naming conventions in Airflow to build targeted namespaces, aligning teams with global policies while preserving autonomy. Integrating external tools like AWS Lake Formation and Open Metadata further unifies governance, making it straightforward to manage and secure data. This is critical when handling hundreds or even thousands of active DAGs. If the idea of storing 1,600 pipelines in one folder seems overwhelming, join us to learn how the DAG Factory concept simplifies pipeline management. We’ll also share insights from OLX, highlighting how thoughtful design fosters oversight, efficiency, and discoverability across diverse use cases.
At Yahoo, we built a secure, scalable, and cost-efficient batch processing platform using Amazon MWAA to orchestrate Apache Flink jobs on EKS, managed by the Flink Kubernetes Operator. This setup enables dynamic job orchestration while meeting strict enterprise compliance standards.
In this session, we’ll share how Airflow DAGs:
-
Dynamically launch, monitor, and clean up isolated Flink clusters per batch job, improving resource efficiency.
-
Securely fetch EKS kubeconfig, submit FlinkDeployment CRDs using FlinkKubernetesOperator, and poll job status using Airflow sensors.
-
Integrate IAM for access control and meet Yahoo’s security requirements, including mutual TLS (mTLS) with Athenz.
-
Optimize for cost and resilience through automated cleanup of jobs and the operator, and handle job failures and retries.
Join us for practical strategies and lessons from Yahoo’s production-scale Flink workflows in a Kubernetes environment.
BecklerThis session details practical strategies for introducing Apache Airflow in strict, compliance-heavy organizations. Learn how on-premise deployment and hybrid tooling can help modernize legacy workflows when public cloud solutions and container technologies are restricted. Discover how cross-platform engineering teams can collaborate securely using CI/CD bridges, and what it takes to meet rigorous security and governance standards. Key lessons address navigating resistance to change, achieving production sign-off, and avoiding common compliance pitfalls, relevant to anyone automating in public sector settings.
Columbia AThis session details practical strategies for introducing Apache Airflow in strict, compliance-heavy organizations. Learn how on-premise deployment and hybrid tooling can help modernize legacy workflows when public cloud solutions and container technologies are restricted. Discover how cross-platform engineering teams can collaborate securely using CI/CD bridges, and what it takes to meet rigorous security and governance standards. Key lessons address navigating resistance to change, achieving production sign-off, and avoiding common compliance pitfalls, relevant to anyone automating in public sector settings.
Metadata management is a cornerstone of effective data governance, yet it presents unique challenges distinct from traditional data engineering. At scale, efficiently extracting metadata from relational and NoSQL databases demands specialized solutions. To address this, our team has developed custom Airflow operators that scan and extract metadata across various database technologies, orchestrating 100+ production jobs to ensure continuous and reliable metadata collection.
Now, we’re expanding beyond databases to tackle non-traditional data sources such as file repositories and message queues. This shift introduces new complexities, including processing structured and unstructured files, managing schema evolution in streaming data, and maintaining metadata consistency across heterogeneous sources. In this session, we’ll share our approach to building scalable metadata scanners, optimizing performance, and ensuring adaptability across diverse data environments. Attendees will gain insights into designing efficient metadata pipelines, overcoming common pitfalls, and leveraging Airflow to drive metadata governance at scale.
Columbia CMetadata management is a cornerstone of effective data governance, yet it presents unique challenges distinct from traditional data engineering. At scale, efficiently extracting metadata from relational and NoSQL databases demands specialized solutions. To address this, our team has developed custom Airflow operators that scan and extract metadata across various database technologies, orchestrating 100+ production jobs to ensure continuous and reliable metadata collection.
Now, we’re expanding beyond databases to tackle non-traditional data sources such as file repositories and message queues. This shift introduces new complexities, including processing structured and unstructured files, managing schema evolution in streaming data, and maintaining metadata consistency across heterogeneous sources. In this session, we’ll share our approach to building scalable metadata scanners, optimizing performance, and ensuring adaptability across diverse data environments. Attendees will gain insights into designing efficient metadata pipelines, overcoming common pitfalls, and leveraging Airflow to drive metadata governance at scale.
DAGnostics seamlessly integrates Airflow Cluster Policy hooks to enforce governance from local DAG authoring through CI pipelines to production runtime. Learn how it closes validation gaps, collapses feedback loops from hours to seconds, and ensures consistent policies across stages. We examine current runtime-only enforcement and fractured CI checks, then unveil our architecture: a pluggable policy registry via Airflow entry points, local static analysis for pre-commit validation, GitHub Actions CI integration, and runtime hook enforcement. See real-world use cases: alerting standards, resource quotas, naming conventions, and exemption handling. Next, dive into implementation: authoring policies in Python, auto-discovery, cross-environment enforcement, upstream contribution, and testing strategies. We share LinkedIn’s metrics—2,000+ DAG repos, 10,000+ daily executions supporting trunk-based development across isolated teams/use-cases, and 78% fewer runtime violations—and lessons learned scaling policy-as-code at enterprise scale. Leave with a blueprint to adopt DAGnostics and strengthen your Airflow governance while preserving full compatibility with existing systems.
Columbia DDAGnostics seamlessly integrates Airflow Cluster Policy hooks to enforce governance from local DAG authoring through CI pipelines to production runtime. Learn how it closes validation gaps, collapses feedback loops from hours to seconds, and ensures consistent policies across stages. We examine current runtime-only enforcement and fractured CI checks, then unveil our architecture: a pluggable policy registry via Airflow entry points, local static analysis for pre-commit validation, GitHub Actions CI integration, and runtime hook enforcement. See real-world use cases: alerting standards, resource quotas, naming conventions, and exemption handling. Next, dive into implementation: authoring policies in Python, auto-discovery, cross-environment enforcement, upstream contribution, and testing strategies. We share LinkedIn’s metrics—2,000+ DAG repos, 10,000+ daily executions supporting trunk-based development across isolated teams/use-cases, and 78% fewer runtime violations—and lessons learned scaling policy-as-code at enterprise scale. Leave with a blueprint to adopt DAGnostics and strengthen your Airflow governance while preserving full compatibility with existing systems.
Apache Airflow’s executor landscape has traditionally presented users with a clear trade-off: choose either the speed of local execution or the scalability, isolation and configurability of remote execution. The AWS Lambda Executor introduces a new paradigm that bridges this gap, offering near-local execution speeds with the benefits of remote containerization.
This talk will begin with a brief overview of Airflow’s executors, how they work and what they are responsible for, highlighting the compromises between different executors. We will explore the emerging niche for fast, yet remote execution and demonstrate how the AWS Lambda Executor fills this space. We will also address practical considerations when using such an executor, such as working within Lambda’s 15 minute execution limit, and how to mitigate this using multi-executor configuration.
Whether you’re new to Airflow or an experienced user, this session will provide valuable insights into task execution and how you can combine the best of both local and remote execution paradigms.
BecklerApache Airflow’s executor landscape has traditionally presented users with a clear trade-off: choose either the speed of local execution or the scalability, isolation and configurability of remote execution. The AWS Lambda Executor introduces a new paradigm that bridges this gap, offering near-local execution speeds with the benefits of remote containerization.
This talk will begin with a brief overview of Airflow’s executors, how they work and what they are responsible for, highlighting the compromises between different executors. We will explore the emerging niche for fast, yet remote execution and demonstrate how the AWS Lambda Executor fills this space. We will also address practical considerations when using such an executor, such as working within Lambda’s 15 minute execution limit, and how to mitigate this using multi-executor configuration.
The Bloomberg Data Platform Engineering team is responsible for managing, storing, and providing access to business and financial data used by financial professionals across the global capital markets. Our team utilizes Apache Airflow to orchestrate data workflows across various applications and Bloomberg Terminal functions. Over the years, we have fine-tuned our Airflow cluster to handle more than 1,000 ingestion DAGs, which has presented unique scalability challenges. In this session, we will share insights into several key Airflow parameters — some of which you may not be all that familiar with — that our team uses to optimize and scale the platform effectively.
Columbia AThe Bloomberg Data Platform Engineering team is responsible for managing, storing, and providing access to business and financial data used by financial professionals across the global capital markets. Our team utilizes Apache Airflow to orchestrate data workflows across various applications and Bloomberg Terminal functions. Over the years, we have fine-tuned our Airflow cluster to handle more than 1,000 ingestion DAGs, which has presented unique scalability challenges. In this session, we will share insights into several key Airflow parameters — some of which you may not be all that familiar with — that our team uses to optimize and scale the platform effectively.
As modern data ecosystems grow in complexity, ensuring transparency, discoverability, and governance in data workflows becomes critical. Apache Airflow, a powerful workflow orchestration tool, enables data engineers to build scalable pipelines, but without proper visibility into data lineage, ownership, and quality, teams risk operating in a black box.
In this talk, we will explore how integrating Airflow with a data catalog can bring clarity and transparency to data workflows. We’ll discuss how metadata-driven orchestration enhances data governance, enables lineage tracking, and improves collaboration across teams. Through real-world use cases, we will demonstrate how Airflow can automate metadata collection, update data catalogs dynamically, and ensure data quality at every stage of the pipeline.
Attendees will walk away with practical strategies for implementing a transparent data workflow that fosters trust, efficiency, and compliance in their data infrastructure.
Columbia CAs modern data ecosystems grow in complexity, ensuring transparency, discoverability, and governance in data workflows becomes critical. Apache Airflow, a powerful workflow orchestration tool, enables data engineers to build scalable pipelines, but without proper visibility into data lineage, ownership, and quality, teams risk operating in a black box.
In this talk, we will explore how integrating Airflow with a data catalog can bring clarity and transparency to data workflows. We’ll discuss how metadata-driven orchestration enhances data governance, enables lineage tracking, and improves collaboration across teams. Through real-world use cases, we will demonstrate how Airflow can automate metadata collection, update data catalogs dynamically, and ensure data quality at every stage of the pipeline.
This session explores how to bring unit testing to SQL pipelines using Airflow. I’ll walk through the development of a SQL testing library that allows isolated testing of SQL logic by injecting mock data into base tables. To support this, we built a type system for AWS Glue tables using Pydantic, enabling schema validation and mock data generation. Over time, this type system also powered production data quality checks via a custom Airflow operator. Learn how this approach improves reliability, accelerates development, and scales testing across data workflows.
Columbia DThis session explores how to bring unit testing to SQL pipelines using Airflow. I’ll walk through the development of a SQL testing library that allows isolated testing of SQL logic by injecting mock data into base tables. To support this, we built a type system for AWS Glue tables using Pydantic, enabling schema validation and mock data generation. Over time, this type system also powered production data quality checks via a custom Airflow operator. Learn how this approach improves reliability, accelerates development, and scales testing across data workflows.
We will explore how Apache Airflow 3 unlocks new possibilities for smarter, more flexible DAG design. We’ll start by breaking down common anti-patterns in early DAG implementations, such as hardcoded operators, duplicated task logic, and rigid sequencing, that lead to brittle, unscalable workflows. From there, we’ll show how refactoring with the D.R.Y. (Don’t Repeat Yourself) principle, using techniques like task factories, parameterization, dynamic task mapping, and modular DAG construction, transforms these workflows into clean, reusable patterns.
With Airflow 3, these strategies go further: enabling DAGs that are reusable across both batch pipelines and streaming/event-driven workloads, while also supporting ad-hoc runs for testing, one-off jobs, or backfills. The result is not just more concise code, but workflows that can flexibly serve different data processing modes without duplication. Attendees will leave with concrete patterns and best practices for building maintainable, production-grade DAGs that are scalable, observable, and aligned with modern data engineering standards.
BecklerWe will explore how Apache Airflow 3 unlocks new possibilities for smarter, more flexible DAG design. We’ll start by breaking down common anti-patterns in early DAG implementations, such as hardcoded operators, duplicated task logic, and rigid sequencing, that lead to brittle, unscalable workflows. From there, we’ll show how refactoring with the D.R.Y. (Don’t Repeat Yourself) principle, using techniques like task factories, parameterization, dynamic task mapping, and modular DAG construction, transforms these workflows into clean, reusable patterns.
Before Airflow, our BigQuery pipelines at Create Music Group operated like musicians without a conductor—each playing on its own schedule, regardless of whether upstream data was ready. As our data platform grew, this chaos led to spiralling costs, performance bottlenecks, and became utterly unsustainable.
This talk tells the story of how Create Music Group brought harmony to its data workflows by adopting Apache Airflow and the Medallion architecture, ultimately slashing our data processing costs by 50%. We’ll show how moving to event-driven scheduling with datasets helped eliminate stale data issues, dramatically improved performance, and unlocked faster iteration across teams. Discover how we replaced repetitive SQL with standardized dimension/fact tables, empowering analysts in a safer sandbox.
Columbia ABefore Airflow, our BigQuery pipelines at Create Music Group operated like musicians without a conductor—each playing on its own schedule, regardless of whether upstream data was ready. As our data platform grew, this chaos led to spiralling costs, performance bottlenecks, and became utterly unsustainable.
This talk tells the story of how Create Music Group brought harmony to its data workflows by adopting Apache Airflow and the Medallion architecture, ultimately slashing our data processing costs by 50%. We’ll show how moving to event-driven scheduling with datasets helped eliminate stale data issues, dramatically improved performance, and unlocked faster iteration across teams. Discover how we replaced repetitive SQL with standardized dimension/fact tables, empowering analysts in a safer sandbox.
As Data Engineers, our jobs regularly include scheduling or scaling workflows.
But have you ever asked yourself, can I scale my scheduling ?
It turns out that you can! But doing so raises a number of issues that need to be addressed.
In this talk we’ll be:
- Recapping Asset-aware scheduling in Apache Airflow
- Discussing diverse methods to upscale our scheduling
- Solving the issue of maintaining our Airflow Asset synchronized between instances
- Comparing our professional push based solution and the built-in solution from AIP-82 and the pros and cons of each method.
I hope you will enjoy it!
Columbia CAs Data Engineers, our jobs regularly include scheduling or scaling workflows.
But have you ever asked yourself, can I scale my scheduling ?
It turns out that you can! But doing so raises a number of issues that need to be addressed.
In this talk we’ll be:
- Recapping Asset-aware scheduling in Apache Airflow
- Discussing diverse methods to upscale our scheduling
- Solving the issue of maintaining our Airflow Asset synchronized between instances
- Comparing our professional push based solution and the built-in solution from AIP-82 and the pros and cons of each method.
I hope you will enjoy it!
In today’s data-driven world, effective workflow management and AI are crucial for success. However, there’s a notable gap between Airflow and AI. Our presentation offers a solution to close this gap.
Proposing MCP (Model Context Protocol) server to act as a bridge. We’ll dive into two paths:
- AI-Augmented Airflow: Enhancing Airflow with AI to improve error handling, automate DAG generation, proactively detect issues, and optimize resource use.
- Airflow-Powered AI: Utilizing Airflow’s reliability to empower LLMs in executing complex tasks, orchestrating AI agents, and supporting decision-making with real-time data.
Key takeaways:
- Understanding how to integrate AI insights directly into your workflow orchestration.
- Learning how MCP empowers AI with robust orchestration capabilities, offering full logging, monitoring, and auditability.
- Gaining insights into how to transform LLMS from a reactive responder to a proactive, intelligent, and reliable executor.
Inviting you to explore how MCP can help workflow management, making AI-driven decisions more reliable and turning workflow systems into intelligent, autonomous agents.
Columbia DIn today’s data-driven world, effective workflow management and AI are crucial for success. However, there’s a notable gap between Airflow and AI. Our presentation offers a solution to close this gap.
Proposing MCP (Model Context Protocol) server to act as a bridge. We’ll dive into two paths:
- AI-Augmented Airflow: Enhancing Airflow with AI to improve error handling, automate DAG generation, proactively detect issues, and optimize resource use.
- Airflow-Powered AI: Utilizing Airflow’s reliability to empower LLMs in executing complex tasks, orchestrating AI agents, and supporting decision-making with real-time data.
Key takeaways:
Ever seen a DAG go rogue and deploy itself? Or try to time travel back to 1999? Join us for a light-hearted yet painfully relatable look at how not to scale your Airflow deployment to avoid chaos and debugging nightmares.
We’ll cover the classics: hardcoded secrets, unbounded retries (hello, immortal task!), and the infamous spaghetti DAG where 200 tasks are lovingly connected by hand and no one dares open the Airflow UI anymore. If you’ve ever used datetime.now() in your DAG definition and watched your backfills implode, this talk is for you.
From the BashOperator that became sentient to the XCom that tried to pass a whole Pandas DataFrame and the key to your mother’s house, we’ll walk through real-world bloopers with practical takeaways. You’ll learn why overusing PythonOperator is a recipe for mess, how not to use sensors unless you enjoy resource starvation, and why scheduling in local timezones is basically asking for a daylight savings time horror story. Other highlights include:
Over-provisioning resources in KubernetesPodOperator: many teams allocate excessive memory/CPU “just in case”, leading to cluster contention and resource waste.
Dynamic task mapping gone wild: 10,000 mapped tasks later… the scheduler is still crying.
SLAs used as data quality guarantees: creating alerts so noisy, nobody listens. Design-free DAGs: no docs, no comments, no idea why a task has a 3-day timeout. Finally, we’ll round it out with some dos and don’ts: using environment variables, avoiding memory-hungry monolith DAGs, skipping global imports, and not allocating 10x more memory “just in case.” Whether you’re new to Airflow or battle-hardened from a thousand failed backfills, come learn how to scale your pipelines without losing your mind (or your cluster).
BecklerEver seen a DAG go rogue and deploy itself? Or try to time travel back to 1999? Join us for a light-hearted yet painfully relatable look at how not to scale your Airflow deployment to avoid chaos and debugging nightmares.
We’ll cover the classics: hardcoded secrets, unbounded retries (hello, immortal task!), and the infamous spaghetti DAG where 200 tasks are lovingly connected by hand and no one dares open the Airflow UI anymore. If you’ve ever used datetime.now() in your DAG definition and watched your backfills implode, this talk is for you.
In this talk, we’ll share our journey and lessons learned from developing a new open-source Airflow operator that integrates a newly-launched AWS service with the Airflow ecosystem. This real-world case study will illuminate the complete lifecycle of building an Airflow operator, from initial design to successful community contribution.
We’ll dive deep into the practical challenges and solutions encountered throughout the journey, including:
- Evaluating when to build a new operator versus extending existing ones
- Navigating the Apache Airflow Open-source contribution process
- Best practices for operator design and implementation
- Key learnings and common pitfalls to avoid during the testing and release process
Whether you’re looking to contribute to Apache Airflow or build custom operators, this session will provide valuable insights into the development process, common pitfalls to avoid, and best practices when contributing to and collaborating with the Apache Airflow community.
Expect to leave with a practical roadmap for your own contributions and the confidence to successfully engage with the Apache Airflow ecosystem.
Columbia AIn this talk, we’ll share our journey and lessons learned from developing a new open-source Airflow operator that integrates a newly-launched AWS service with the Airflow ecosystem. This real-world case study will illuminate the complete lifecycle of building an Airflow operator, from initial design to successful community contribution.
We’ll dive deep into the practical challenges and solutions encountered throughout the journey, including:
- Evaluating when to build a new operator versus extending existing ones
- Navigating the Apache Airflow Open-source contribution process
- Best practices for operator design and implementation
- Key learnings and common pitfalls to avoid during the testing and release process
Whether you’re looking to contribute to Apache Airflow or build custom operators, this session will provide valuable insights into the development process, common pitfalls to avoid, and best practices when contributing to and collaborating with the Apache Airflow community.
Apache Airflow is a powerful workflow orchestrator, but as workloads grow, its Python-based components can become performance bottlenecks. This talk explores how Rust, with its speed, safety, and concurrency advantages, can enhance Airflow’s core components (e.g, scheduler, DAG processor, etc). We’ll dive into the motivations behind using Rust, architectural trade-offs, and the challenges of bridging the gap between Python and Rust. A proof-of-concept showcasing an Airflow scheduler rewritten in Rust will demonstrate the potential benefits of this approach.
Columbia AApache Airflow is a powerful workflow orchestrator, but as workloads grow, its Python-based components can become performance bottlenecks. This talk explores how Rust, with its speed, safety, and concurrency advantages, can enhance Airflow’s core components (e.g, scheduler, DAG processor, etc). We’ll dive into the motivations behind using Rust, architectural trade-offs, and the challenges of bridging the gap between Python and Rust. A proof-of-concept showcasing an Airflow scheduler rewritten in Rust will demonstrate the potential benefits of this approach.