Airflow Summit 2025 sessions

These are the confirmed sessions for Airflow Summit 2025.

Title
5 Simple Strategies To Enhance Your DAGs For Data Processing by William Orgertrice Take your DAGs in Apache Airflow to the next level? This is an insightful session where we’ll uncover 5 transformative strategies to enhance your data workflows. Whether you’re a data engineering pro or just getting started, this presentation is packed with practical tips and actionable insights that you can apply right away. We’ll dive into the magic of using powerful libraries like Pandas, share techniques to trim down data volumes for faster processing, and highlight the importance of modularizing your code for easier maintenance. Plus, you’ll discover efficient ways to monitor and debug your DAGs, and how to make the most of Airflow’s built-in features. See more ...
Agentic AI Automating Semantic Layer Updates with Airflow 3 by Andres Astorga Espriella & Soren Archibald In today’s dynamic data environments, tables and schemas are constantly evolving and keeping semantic layers up to date has become a critical operational challenge. Manual updates don’t scale, and delays can quickly lead to broken dashboards, failed pipelines, and lost trust. We’ll show how to harness Apache Airflow 3 and its new event-driven scheduling capabilities to automate the entire lifecycle: detecting table and schema changes in real time, parsing and interpreting those changes, and shifting left the updating of semantic models across dbt, Looker, or custom metadata layers. AI agents will add intelligence and automation that rationalize schema diffs, assess impact of changes, and propose targeted updates to semantic layers reducing manual work and minimizing the risk of errors. See more ...
Airflow & Bigtop: Modernize and integrate time-proven OSS stack with Apache Airflow by Kengo Seki & Masatake Iwasaki Apache Bigtop is a time-proven open-source software stack for building data platform, which has been built around the Hadoop and Spark ecosystem since 2011. Its software composition has been changed during such a long period, and recently job scheduler is removed mainly due to the inactivity of its development. The speaker believes that Airflow perfectly fits into this gap and is proposing incorporating it in the Bigtop stack. This presentation will introduce how easily users can build a data platform with Bigtop including Airflow, and how Airflow can integrate those software with its wide range of providers and enterprise-readiness such as the Kerberos support. See more ...
Airflow & Your Automation CoE: Streamlining Integration for Enterprise-Wide Governance and Value by Jon Hiett As Apache Airflow adoption accelerates for data pipeline orchestration, integrating it effectively into your enterprise’s Automation Center of Excellence (CoE) is crucial for maximizing ROI, ensuring governance, and standardizing best practices. This session explores common challenges faced when bringing specialized tools like Airflow into a broader CoE framework. We’ll demonstrate how leveraging enterprise automation platforms like Automic Automation can simplify this integration by providing centralized orchestration, standardized lifecycle management, and unified auditing for Airflow DAGs alongside other enterprise workloads. Furthermore, discover how Automation Analytics & Intelligence (AAI) can offer the CoE a single pane of glass for monitoring performance, tracking SLAs, and proving the business value of Airflow initiatives within the complete automation landscape. Learn practical strategies to ensure Airflow becomes a well-governed, high-performing component of your overall automation strategy. See more ...
Airflow 3 - An Open Heart Surgery by M Waqas Shahid Curious how code truly flows inside Airflow? Join me for a unique visualisation journey into Airflow’s inner workings (first of its kind) — code blocks and modules called when certain operations are running. A walkthrough that unveils task execution, observability, and debugging like never before. Scaling of Airflow in action, showing performance comparison b/w Airflow 3 vs 2. This session will demystify Airflow’s architecture, showcasing real-time task flows and the heartbeat of pipelines in action. See more ...
Airflow 3 UI is not enough? Add a Plugin! by Jens Scheffler, Brent Bovenzi & Pierre Jeambrun In Airflow 2 there was a plugin mechanism to extend the UI for new functions as well as be able to add hooks and other features. As Airflow 3 rewrote the UI old Plugins were not working for all cases anymore. Airflow 3.1 now provides a re-vamped option to extend the UI with a new plugin schema in native React components and embedded iframes following AIP-68 definitions. In this session we will provide an overview about capabilities and give some intro how you can roll-your-own. See more ...
Airflow 3’s Trigger UI: Evolution of Params by Shubham Raj & Jens Scheffler Are you looking to build slick, dynamic trigger forms for your DAGs? It all starts with mastering params. Params are the gold standard for adding execution options to your DAGs, allowing you to create dynamic, user-friendly trigger forms with descriptions, validation, and now, with Airflow 3, bidirectional support for conf data! In this talk, we’ll break down how to use params effectively, share best practices, and explore what’s new since the 2023 Airflow Summit talk (https://airflowsummit.org/sessions/2023/flexible-dag-trigger-forms-aip-50/). If you want to make DAG execution more flexible, intuitive, and powerful, this session is a must-attend! See more ...
Airflow and Optimised Data Platform: Setup & Customisations by M Waqas Shahid This workshop should be suitable for any Architect, Data Engineer or Devops aiming to build/enhance their internal Data Platform. At the end of this workshop you would have solid understanding of initial setup and ways to optimise further getting most out of the tool for your own organisation. See more ...
Airflow as an AI Agent’s Toolkit: Unlocking 1000+ Integrations with MCP by Kaxil Naik AI agents transform conversational prompts into actionable automation provided they have reliable access to essential tools like data warehouses, cloud storage, and APIs. Now imagine exposing Airflow’s rich integration layer directly to AI agents via the emerging Model Context Protocol (MCP). This isn’t just gluing AI into Airflow; it’s turning Airflow into a structured execution layer for adaptive, agentic logic with full observability, retries, and audit trails built in. We’ll demonstrate a real-world fraud detection pipeline powered by agents: suspicious transactions are analyzed, enriched dynamically with external customer data via MCP, and escalated based on validated, structured outputs. Every prompt, decision, and action is auditable and compliant. See more ...
Airflow at Zoox: A journey to orchestrate heterogeneous workflows by Justin Wang & Saurabh Gupta The workflow orchestration team at Zoox aims to build a solution for orchestrating heterogeneous workflows encompassing data, ML, and QA pipelines. We have encountered two primary challenges: first, the steep learning curve for new Airflow users and the need for a user-friendly yet scalable development process; second, integrating and migrating existing pipelines with established solutions. This presentation will detail our approach, as a small team at Zoox, to address these challenges. The discussion will cover the scope and scale of Airflow within Zoox, including current applications and future directions. Furthermore, we will share our strategies for simplifying the Airflow DAG creation process and enhancing user experience. Finally, we will present a case study illustrating the onboarding of a heterogeneous workflow across Databricks, AWS, and a Zoox in-house platform to manage both on-prem and cloud services. See more ...
Airflow That Remembers: The Dag Versioning Era is here! by Jed Cunningham & Ephraim Anierobi Airflow 3 introduced a game-changing feature: Dag versioning. Gone are the days of “latest only” Dags and confusing, inconsistent UI views when pipelines change mid-flight. This talk covers: Visualizing Dag changes over time in the UI How Dags code is versioned and can be grabbed from external sources Executing a whole Dag run against the same code version Dynamic Dags? Where do they fit in?! You’ll see real-world scenarios, UI demos, and learn how these advancements will help avoid “Airflow amnesia”. See more ...
Airflow Uses in an on-prem Research Setting by Lawrence Gerstley KP Division of Research uses Airflow as a central technology for integrating diverse technologies in an agile setting. We wish to present a set of use-cases for AI/ML workloads, including imaging analysis (tissue segmentation, mammography), NLP (early identification of psychosis), LLM processing (identification of vessel diameter from radiological impressions), and other large data processing tasks. We create these “short-lived” project workflows to accomplish specific aims, and then may never run the job again, so leveraging generalized patterns are crucial to quickly implementing these jobs. Our Advanced Computational Infrastructure is comprised of multiple Kubernetes clusters, and we use Airflow to democratize the use of our batch level resources in those clusters. We use Airflow form-based parameters to deploy pods running R and Python scripts where generalized parameters are injected into scripts that follow internal programming patterns. Finally, we also leverage Airflow to create headless services inside Kubernetes for large computational workloads (Spark & H2O) that subsequent pods consume ephemerally. See more ...
Airflow Without Borders: A Journey into Internationalization (i18n) by Shahar Epstein One of the exciting new features in Airflow 3 is internationalization (i18n), bringing multilingual support to the UI and making Airflow more accessible to users worldwide. This talk will highlight the UI changes made to support different languages, including locale-aware adjustments. We’ll discuss how translations are contributed and managed — including the use of LLMs to accelerate the process — and why human review remains an essential part of it. We’ll present the i18n policy designed to ensure long-term maintainability, along with the tooling developed to support it. Finally, you’ll learn how to get involved and contribute to Airflow’s global reach by translating or reviewing content in your language. See more ...
Allegro's Airflow Journey: From On-Prem to Cloud Orchestration at Scale by Piotr Dziuba & Marek Gawinski This session will detail Allegro’s, a leading e-commerce company in Poland, journey with Apache Airflow. It will chart our evolution from a custom, on-premises Airflow-as-a-Service solution through a significant expansion to over 300 Cloud Composer instances in Google Cloud, culminating in Airflow becoming the core of our data processing. We orchestrate over 64,000 regular tasks spanning over 6,000 active DAGs on more than 200 Airflow instances. From feeding business-supporting dashboards, to managing main data marts, and handling ML pipelines, and more. See more ...
Apache Airflow 3.0 - Bad vs. Best Practices In Production by Bhavani Ravi The general-purpose nature of Airflow has always left us questioning, “Is this the right way”? While the existing resources and community cover them, the new Airflow releases always leave us wondering if there is more . This talk reveals how 3.0’s innovations redefine best practices, building production-ready data platforms. • Dag Development - Future-proof your dags without compromising on Fundamentals • Modern Pipelines: How to best incorporate new Airflow features • Infrastructure: Leveraging 3.0’s Service-Oriented Architecture and Edge Executor • Teams & Responsibilities: Streamlined operations with the new split CLI and improved UI. • Monitoring & Observability: Building fail-proof pipelines See more ...
Applying Airflow to drive the digital workforce in the Enterprise by Shoubhik Bose Red Hat’s unified data and AI platform relies on Apache Airflow for orchestration, alongside Snowflake, Fivetran, and Atlan. The platform prioritizes building a dependable data foundation, recognizing that effective AI depends on quality data. Airflow was selected for its predictability, extensive connectivity, reliability, and scalability. The platform now supports business analytics, transitioning from ETL to ELT processes. This has resulted in a remarkable improvement in how we make data available for business decisions. See more ...
Assets: Past, Present, Future by Tzu-ping Chung Airflow Asset originated from data lineage and evolved into its current state, being used as a scheduling concept (data-aware, event-based scheduling). It has even more potential. This talk discusses how other parts of Airflow, namely Connection and Object Storage, contain concepts related to Asset, and we can tie them all together to make task authoring flow even more naturally. Planned topics: Brief history on Asset and related constructs. Current state of Asset concepts. Inlets, anyone? Finding inspiration from Pydantic et al. My next step for Asset. See more ...
Automating Business Intelligence with Airflow: A Practical Guide by Chinni Krishna Abburi In today’s fast-paced business world, timely and reliable insights are crucial — but manual BI workflows can’t keep up. This session offers a practical guide to automating business intelligence processes using Apache Airflow. We’ll walk through real-world examples of automating data extraction, transformation, dashboard refreshes, and report distribution. Learn how to design DAGs that align with business SLAs, trigger workflows based on events, integrate with popular BI tools like Tableau and Power BI, and implement alerting and failure recovery mechanisms. Whether you’re new to Airflow or looking to scale your BI operations, this session will equip you with actionable strategies to save time, reduce errors, and supercharge your organization’s decision-making capabilities. See more ...
Automating Threat Intelligence with Airflow, XDR, and LLMs using the MITRE ATT&CK Framework by Karan Alang Security teams often face alert fatigue from massive volumes of raw log data. This session demonstrates how to combine Apache Airflow, Wazuh, and LLMs to build automated pipelines for smarter threat triage—grounded in the MITRE ATT&CK framework. We’ll explore how Airflow can orchestrate a full workflow: ingesting Wazuh alerts, using LLMs to summarize log events, matching behavior to ATT&CK tactics and techniques, and generating enriched incident summaries. With AI-powered interpretation layered on top of structured threat intelligence, teams can reduce manual effort while increasing context and clarity. See more ...
AWS Lambda Executor: The Speed of Local Execution with the Advantages of Remote by Niko Oliveira Apache Airflow’s executor landscape has traditionally presented users with a clear trade-off: choose either the speed of local execution or the scalability, isolation and configurability of remote execution. The AWS Lambda Executor introduces a new paradigm that bridges this gap, offering near-local execution speeds with the benefits of remote containerization. This talk will begin with a brief overview of Airflow’s executors, how they work and what they are responsible for, highlighting the compromises between different executors. We will explore the emerging niche for fast, yet remote execution and demonstrate how the AWS Lambda Executor fills this space. We will also address practical considerations when using such an executor, such as working within Lambda’s 15 minute execution limit, and how to mitigate this using multi-executor configuration. See more ...
Becoming an Apache Airflow Committer from 0 by Zhe-You Liu How a Complete Beginner in Data Engineering / Junior Computer Science Student Became an Apache Airflow Committer in Just 5 Months—With 70+ PRs and 300 Hours of Contributions This talk is aimed at those who are still hesitant about contributing to Apache Airflow. I hope to inspire and encourage anyone to take the first step and start their journey in open-source—let’s build together! See more ...
Behind the Scenes: How We Tested Airflow 3 for Stability and Reliability by Rahul Vats & Phani Kumar Ensuring the stability of a major release like Airflow 3 required extensive testing across multiple dimensions. In this session, we will dive into the testing strategies and validation techniques used to guarantee a smooth rollout. From unit and integration tests to real-world DAG validations, this talk will cover the challenges faced, key learnings, and best practices for testing Airflow. Whether you’re a contributor, QA engineer, or Airflow user preparing for migration, this session will offer valuable takeaways to improve your own testing approach. See more ...
Benchmarking the Performance of Dynamically Generated DAGs by Tatiana Al-Chueyr Martins & Rahul Vats As teams scale their Airflow workflows, a common question is: “My DAG has 5,000 tasks—how long will it take to run in Airflow?” Beyond execution time, users often face challenges with dynamically generated DAGs, such as: Delayed visualization in the Airflow UI after deployment. High resource consumption, leading to Kubernetes pod evictions and out-of-memory errors. While estimating the resource utilization in a distributed data platform is complex, benchmarking can provide crucial insights. See more ...
Beyond Execution Dates: Empowering inference execution and hyper-parameter tuning with Airflow 3 by Ankit Chaurasia & Rahul Vats In legacy Airflow 2.x, each DAG run was tied to a unique “execution_date.” By removing this requirement, Airflow can now directly support a variety of new use cases, such as model training and generative AI inference, without the need for hacks and workarounds typically used by machine learning and AI engineers. In this talk, we will delve into the significant advancements in Airflow 3 that enable GenAI and MLOps use cases, particularly through the changes outlined in AIP 83. We’ll cover key changes like the renaming of “execution_date” to “logical_date,” along with the allowance for it to be null, and the introduction of the new “run_after” field which provides a more meaningful mechanism for scheduling and sorting. Furthermore, we’ll discuss how by removing the uniqueness constraint, Airflow 3 enables multiple parallel runs, empowering diverse triggering mechanisms and easing backfill logic with a real-world demo. See more ...
Beyond Logs: Unlocking Airflow 3.0 Observability with OpenTelemetry Traces by Christos Bisias Using OpenTelemetry tracing, users can gain full visibility into tasks and calls to outside services. This is an increasingly important skill, especially as tasks in an Airflow DAG involve multiple complex computations which take hours or days to complete. Airflow allows users to easily monitor how long entire DAG runs or individual tasks take, but preserves the anonymity of internal actions. OpenTelemetry gives users much more operational awareness and metrics they can use to improve operations. See more ...
Beyond the bundle - evolving DAG parsing in Airflow 3 by Igor Kholopov Airflow 3 made some great strides with AIP-66, introducing the concept of a DAG bundle. This successfully challenged one of the fundamental architectural limitations of original Airflow design of how DAGs are deployed, bringing the structure to something that often had to be operated as a pile of files in the past. However, we believe that this by no means should be the end of the road when it comes to making the DAG management easier, authoring more accessible to a broader audience, and integration with Data Agents smoother. We believe that the next step in Airflow’s evolution is in having a native option to break away from the necessity of having a real file in file systems on multiple components to have your DAG up and running. This is what we are hoping to achieve as part of AIP-85 - extendable DAG parsing control. In this talk I’d like to give a detailed overview of how we want to make it happen and show the examples of the valuable integrations we hope to unblock with it. See more ...
Boosting dbt-core workflows performance with Airflow’s Deferrable capabilities by Pankaj Koti, Tatiana Al-Chueyr Martins & Pankaj Singh Efficiently handling long-running workflows is crucial for scaling modern data pipelines. Apache Airflow’s deferrable operators help offload tasks during idle periods — freeing worker slots while tracking progress. This session explores how Cosmos 1.9 (https://github.com/astronomer/astronomer-cosmos) integrates Airflow’s deferrable capabilities to enhance orchestrating dbt (https://github.com/dbt-labs/dbt-core) in production, with insights from recent contributions that introduced this functionality. Key takeaways: Deferrable Operators: How they work and why they’re ideal for long-running dbt tasks. Integrating with Cosmos: Refactoring and enhancements to enable deferrable behaviour across platforms. Performance Gains: Resource savings and task throughput improvements from deferrable execution. Challenges & Future Enhancements: Lessons learned, compatibility, and ideas for broader support. Whether orchestrating dbt models on a cloud warehouse or managing large-scale transformations, this session offers practical strategies to reduce resource contention and boost pipeline performance. See more ...
Breaking News with Data Pipelines: How Airflow and AI Power Investigative Journalism by Zdravko Hvarlingov & Ivan Nikolov Investigative journalism often relies on uncovering hidden patterns in vast amounts of unstructured and semi-structured data. At the FT, we leverage Airflow to orchestrate AI-powered pipelines that transform complex, fragmented datasets into structured insights. Our Storyfinding team works closely with journalists to automate tedious data processing, enabling them to tell stories that might otherwise go untold. This talk will explore how we use Airflow to process and analyze text, documents, and other difficult-to-structure data sources combining AI, machine learning, and advanced computational techniques to extract meaningful entities, relationships, and patterns. We’ll also showcase our connection analysis workflows, which link various datasets to reveal previously hidden chains of people and companies, a crucial capability for investigative reporting. See more ...
Bridging Data Pipelines and Business Applications with Airflow and Control-M by Jon Fink & Amy Pitcher Learn how Control-M integrates with Airflow to orchestrate end-to-end workflows that include upstream and downstream enterprise systems like Supply Chain and Billing. Gain visibility, reliability, and seamless coordination across your data pipelines and the business operations they support. See more ...
Building a Transparent Data Workflow with Airflow and Data Catalog by John Robert As modern data ecosystems grow in complexity, ensuring transparency, discoverability, and governance in data workflows becomes critical. Apache Airflow, a powerful workflow orchestration tool, enables data engineers to build scalable pipelines, but without proper visibility into data lineage, ownership, and quality, teams risk operating in a black box. In this talk, we will explore how integrating Airflow with a data catalog can bring clarity and transparency to data workflows. We’ll discuss how metadata-driven orchestration enhances data governance, enables lineage tracking, and improves collaboration across teams. Through real-world use cases, we will demonstrate how Airflow can automate metadata collection, update data catalogs dynamically, and ensure data quality at every stage of the pipeline. See more ...
Building Airflow 3 setups resilient to zonal/regional down events, ready for Disaster Recovery event by Khaled Hassan Want to be resilient to any zonal/regional down events when building Airflow in a cloud environment? Unforeseen disruptions in cloud infrastructure, whether isolated to specific zones or impacting entire regions, pose a tangible threat to the continuous operation of critical data workflows managed by Airflow. These outages, though often technical in nature, translate directly into real-world consequences, potentially causing interruptions in essential services, delays in crucial information delivery, and ultimately impacting the reliability and efficiency of various operational processes that businesses and individuals depend upon daily. The inability to process data reliably due to infrastructure instability can cascade into tangible setbacks across diverse sectors, highlighting the urgent need for resilient and robust Airflow deployments. See more ...
Building an Airflow Center of Excellence: Lessons from the Frontlines by Jonathan Leek & Michelle Winters As organizations scale their data infrastructure, Apache Airflow becomes a mission-critical component for orchestrating workflows efficiently. But scaling Airflow successfully isn’t just about running pipelines—it’s about building a Center of Excellence (CoE) that empowers teams with the right strategy, best practices, and long-term enablement. Join Jon Leek and Michelle Winters as they share their experiences helping customers design and implement Airflow Centers of Excellence. They’ll walk through real-world challenges, best practices, and the structured approach Astronomer takes to ensure teams have the right plan, resources, and support to succeed. Whether you’re just starting with Airflow or looking to optimize and scale your workflows, this session will give you a proven framework to build a sustainable Airflow Center of Excellence within your organization. 🚀 See more ...
Building an MLOps Platform for 300+ ML/DS Specialists on Top of Airflow by Aleksandr Shirokov, Roman Khomenko & Tarasov Alexey As your organization scales to 20+ data science teams and 300+ DS/ML/DE engineers, you face a critical challenge: how to build a secure, reliable, and scalable orchestration layer that supports both fast experimentation and stable production workflows. We chose Airflow — and didn’t regret it! But to make it truly work at our scale, we had to rethink its architecture from the ground up. In this talk, we’ll share how we turned Airflow into a powerful MLOps platform through its core capability: running pipelines across multiple K8s GPU clusters from a single UI (!) using per-cluster worker pools. To support ease of use, we developed MLTool — our own library for fast and standardized DAG development, integrated Vault for secure secret management across teams, enabled real-time logging with S3 persistence and built a custom SparkSubmitOperator for Kerberos-authenticated Spark/Hadoop jobs in Kubernetes. We also streamlined the developer experience — users can generate a GitLab repo and deploy a versioned pipeline to prod in under 10 minutes! See more ...
Cloud Composer : Introduction into Advanced Features by Eugene Kosteev Learn the latest features published within Cloud Composer which is a managed service for Apache Airflow on Google Cloud Platform. See more ...
Common provider abstractions: Key for multi-cloud data handling by Vikram Koka Enterprises want the flexibility to operate across multiple clouds, whether to optimize costs, improve resiliency, to avoid vendor lock-in, or for data sovereignty. But for developers, that flexibility usually comes at the cost of extra complexity and redundant code. The goal here is simple: write once, run anywhere, with minimum boilerplate. In Apache Airflow, we’ve already begun tackling this problem with abstractions like Common-SQL, which lets you write database queries once and run them on 20+ databases, from Snowflake to Postgres to SQLite to SAP HANA. Similarly, Common-IO standardizes cloud blob storage interactions across all public clouds. With Airflow 3.0, we are pushing this further by introducing a Common Message Bus provider, which is an abstraction, initially supporting Amazon SQS and expanding to Google PubSub and Apache Kafka soon after. We expect additional implementations such as Amazon Kinesis and Managed Kafka over time. See more ...
Creating DuoFactory: An Orchestration Ecosystem with Airflow by Belle Romea Duolingo has built an internal tool DuoFactory to orchestrate AI generated content using Airflow. The tool has been used to generate example sentences per lesson, math exercises, and Duoradio lessons. The ecosystem is flexible for various company needs. Some of these use cases contain end to end generation where one click of a button generates content in app. We also have created a Workflow Builder to orchestrate and iterate on generative AI workflows by creating one-time DAG instances with a UI easy enough for non-engineers to use. See more ...
Custom Operators in Action: A Guide to Extending Airflow's Capabilities by Shalabh Agarwal Custom operators are the secret weapon for solving Airflow’s unique & challenging orchestration problems. This session will cover: When to build custom operators vs. using existing solutions Architecture patterns for creating maintainable, reusable operators Live coding demonstration: Building a custom operator from scratch Real-world examples: How custom operators solve specific business challenges Through practical code examples and architecture patterns, attendees will walk away with the knowledge to implement custom operators that enhance their Airflow deployments. See more ...
DAGLint: Elevating Airflow DAG Quality Through Automated Linting by Snir Israeli Maintaining consistency, code quality, and best practices for writing Airflow DAGs between teams and individual developers can be a significant challenge. Trying to achieve it using manual code reviews is both time-consuming and error-prone. To solve this at Next, we decided to build a custom, internally developed linting tool for Airflow DAGs, to help us evaluate their quality and uniformity - we call it - DAGLint. In this talk I am going to share why we chose to implement it, how we built it, and how we use it to elevate our code quality and standards throughout the entire Data engineering group. See more ...
Data & AI Orchestration at GoDaddy by Ankit Sahu & Brandon Abear As the adoption of Airflow increases within large enterprises to orchestrate their data pipelines, more than one team needs to create, manage, and run their workflows in isolation. With multi-tenancy not yet supported natively in Airflow, customers are adopting alternate ways to enable multiple teams to share infrastructure. In this session, we will explore how GoDaddy uses MWAA to build a Single Pane Airflow setup for multiple teams with a common observability platform, and how this foundation enables orchestration expansion beyond data workflows to AI workflows as well. We’ll discuss our roadmap for leveraging upcoming Airflow 3 features, including the task execution API for enhanced workflow management and DAG versioning capabilities for comprehensive auditing and governance. This session will help attendees gain insights into the use case, the solution architecture, implementation challenges and benefits, and our strategic vision for unified orchestration across data and AI workloads. See more ...
Data Quality and Observability with Airflow by Ipsa Trivedi & Chirag Tailor Tekmetric is the largest cloud based auto shop management system in the United States. We process vast amounts of data from various integrations with internal and external systems. Data quality and governance are crucial for both our internal operations and the success of our customers. We leverage multi-step data processing pipelines using AWS services and Airflow. While we utilize traditional data pipeline workflows to manage and move data, we go beyond standard orchestration. After data is processed, we apply tailored quality checks for schema validation, record completeness, freshness, duplication and more. See more ...
Deadline Alerts in Airflow 3.1 by Dennis Ferruzzi Do you have a DAG that needs to be done by a certain time? Have you tried to use Airflow 2’s SLA feature and found it restrictive or complicated? You aren’t alone! Come learn about the all-new Deadline Alerts feature in Airflow 3.1 which replaces SLA. We will discuss how Deadline Alerts work and how they improve on the retired SLA feature. Then we will look at some examples of workflows you can build with the new feature, including some of the callback options and how they work, and finally looking ahead to some future use-cases of using Deadlines for Tasks and even Assets. See more ...
Designing Scalable Retrieval-Augmented Generation (RAG) Pipelines at SAP with Apache Airflow by Sagar Sharma At SAP Business AI, we’ve transformed Retrieval-Augmented Generation (RAG) pipelines into enterprise-grade powerhouses using Apache Airflow. Our Generative AI Foundations Team developed a cutting-edge system that effectively grounds Large Language Models (LLMs) with rich SAP enterprise data. Powering Joule for Consultants, our innovative AI copilot, this pipeline manages the seamless ingestion, sophisticated metadata enrichment, and efficient lifecycle management of over a million structured and unstructured documents. By leveraging Airflow’s Dynamic DAGs, TaskFlow API, XCom, and Kubernetes Event-Driven Autoscaling (KEDA), we achieved unprecedented scalability and flexibility. Join our session to discover actionable insights, innovative scaling strategies, and a forward-looking vision for Pipeline-as-a-Service, empowering seamless integration of customer-generated content into scalable AI workflows See more ...
Do you trust Airflow with your money? (We do!) by Nick Bilozerov, Daniel Melchor & Sabrina Liu Airflow is wonderfully, frustratingly complex - and so is global finance! Stripe has very specific needs all over the planet, and we have customized Airflow to adapt to the variety and rigor that we need to grow the GDP of the internet. In this talk, you’ll learn: How we support independent DAG change management for over 500 different teams running over 150k tasks. How we’ve customized Airflow’s Kubernetes integration to comply with Stripe’s unique compliance requirements. See more ...
Dynamic Data Pipelines with DBT and Airflow by Miquel Angel Andreu Febrer This session showcases Okta’s innovative approach to data pipeline orchestration with dbt and Airflow. How we’ve implemented dynamically generated airflow dags workflows based on dbt’s dependency graph. This allows us to enforce strict data quality standards by automatically executing downstream model tests before upstream model deployments, effectively preventing error cascades. The entire CI/CD pipeline, from dbt model changes to production DAG deployment, is fully automated. The result? Accelerated development cycles, reduced operational overhead, and bulletproof data reliability See more ...
EdgeExecutor / Edge Worker - The new option to run anywhere by Jens Scheffler & Daniel Wolf Airflow 3 extends the deployment options to run your workload anywhere. You don’t need to bring your data to airflow but you can bring the execution where it needs to be. You can connect any cloud and on-prem location together and generate a hybrid workflow from one central Airflow instance. Only a HTTP connection is needed. We will present the use cases and concepts of the Edge deployment and how it is working also in a hybrid setup with Celery or other executors. See more ...
ELT, AI, and Elections: Leveraging Airflow and Machine Learning to Analyze Voting Behavior at INTRVL by Kyle McCluskey Discover how Apache Airflow powers scalable ELT pipelines, enabling seamless data ingestion, transformation, and machine learning-driven insights. This session will walk through: Automating Data Ingestion: Using Airflow to orchestrate raw data ingestion from third-party sources into your data lake (S3, GCP), ensuring a steady pipeline of high-quality training and prediction data. Optimizing Transformations with Serverless Computing: Offloading intensive transformations to serverless functions (GCP Cloud Run, AWS Lambda) and machine learning models (BigQuery ML, Sagemaker), integrating their outputs seamlessly into Airflow workflows. See more ...
Empowering Precision Healthcare with Apache Airflow-iKang Healthcare Group’s DataHub Journey by Yuan Luo & Huiliang Zhang iKang Healthcare Group, serving nearly 10 million patients annually, built a centralized healthcare data hub powered by Apache Airflow to support its large-scale, real-time clinical operations. The platform integrates batch and streaming data in a lakehouse architecture, orchestrating complex workflows from data ingestion (HL7/FHIR) to clinical decision support. Healthcare data’s inherent complexity—spanning structured lab results to unstructured clinical notes—requires dynamic, reliable orchestration. iKang uses Airflow’s DAGs, extensibility, and workflow-as-code capabilities to address challenges like multi-system coordination, semantic data linking, and fault-tolerant automation. See more ...
Enabling SQL testing in Airflow workflows using Pydantic types by Gurmeet Saran & Kushal Thakkar This session explores how to bring unit testing to SQL pipelines using Airflow. I’ll walk through the development of a SQL testing library that allows isolated testing of SQL logic by injecting mock data into base tables. To support this, we built a type system for AWS Glue tables using Pydantic, enabling schema validation and mock data generation. Over time, this type system also powered production data quality checks via a custom Airflow operator. Learn how this approach improves reliability, accelerates development, and scales testing across data workflows. See more ...
Enhancing Airflow REST API: From Basic Integration to Enterprise Scale by Vishal Vijayvargiya Apache Airflow’s REST API has evolved to support diverse orchestration needs, with managed services like MWAA introducing custom enhancements. One such feature, InvokeRestApi, enables dynamic interactions with external services while maintaining Airflow’s core orchestration capabilities. In this talk, we will explore the architectural design behind InvokeRestApi, detailing how it enhances API-driven workflows. Beyond the architecture, we’ll share key challenges and learnings from implementing and scaling Airflow’s REST API in production environments. Topics include authentication, performance considerations, error handling, and best practices for integrating external APIs efficiently. See more ...
Enhancing DAG Management with DMS: A Scalable Solution for Airflow by Sungji Yang & DaeHoon Song In this talk, we will introduce the DAG Management Service (DMS), developed to address critical challenges in managing Airflow clusters. With over 10,000 active DAGs, a single Airflow cluster faces scaling limits and noisy neighbor issues, impacting task scheduling SLAs. DMS enhances reliability by distributing DAGs across multiple clusters and enforcing proper configurations. We will also discuss how DMS streamlines Airflow version upgrades. Upgrading from an old Airflow version to the latest requires sequential updates and code modifications for over 10,000 DAGs. DMS proposes an efficient upgrade method, reducing dependency on users. See more ...
Enhancing Small Retailer Visibility: Machine Learning Pipelines with Apache Airflow by Hannah Lundrigan & Alberto Hernandez Small retailers often lack the data visibility that larger companies rely on for decision-making. In this session, we’ll dive into how Apache Airflow powers end-to-end machine learning pipelines that process inventory and sales data, enabling retailers and suppliers to gain valuable industry insights. We’ll cover feature engineering, model training, and automated inference workflows, along with strategies for handling messy, incomplete retail data. We will discuss how Airflow enables scalable ML-driven insights that improve demand forecasting, product categorization, and supply chain optimization. See more ...
Ensuring Data Accuracy & Consistency with Airflow and dbt Tests by Bao Nguyen As analytics engineers, ensuring data accuracy and consistency is critical, but how do we systematically catch errors before they impact stakeholders? This session will explore how to integrate Airflow with dbt tests to build reliable and automated data validation workflows. We’ll cover: How to orchestrate dbt tests with Airflow DAGs for real-time data quality checks. Handling test failures with alerting and retry strategies. Using custom dbt tests for advanced validation beyond built-in checks. Best practices for data observability, logging, and monitoring failed runs. See more ...
Event-Driven Airflow 3.0: Real-Time Orchestration with Pub/Sub by Andrea Bombino & Nawfel Bacha Traditional time-based scheduling in Airflow can lead to inefficiencies and delays. With Airflow 3.0, we can now leverage native event-driven DAG execution, enabling workflows to trigger instantly when data arrives—eliminating polling-based sensors and rigid schedules. This talk explores real-time orchestration using Airflow 3.0 and Google Cloud Pub/Sub. We’ll showcase how to build an event-driven pipeline where DAGs automatically trigger as new data lands, ensuring faster and more efficient processing. Through a live demo, we’ll demonstrate how Airflow listens to Pub/Sub messages and dynamically triggers dbt transformations only when fresh data is available. This approach improves scalability, reduces costs, and enhances orchestration efficiency. Key Takeaways: How event-driven DAGs work vs. traditional scheduling, Best practices for integrating Airflow with Pub/Sub,Eliminating polling-based sensors for efficiency,Live demo: Event-driven pipeline with Airflow 3.0, Pub/Sub & dbt. See more ...
Event-Driven, Partition-Aware: Modern Orchestration with Airflow at Datadog by Julien Le Dem & Zach Gottesman Datadog is a world-class data platform ingesting more than a 100 trillion events a day, providing real-time insights. Before Airflow’s prominence, we built batch processing on Luigi, Spotify’s open-source orchestrator. As Airflow gained wide adoption, we evaluated adopting the major improvements of release 2.0, but opted for building our own orchestrator instead to realize our dataset-centric, event-driven vision. Meanwhile, the 3.0 release aligned Airflow with the same vision we pursued internally, as a modern asset-driven orchestrator. It showed how futile it is to build our own compared to the momentum of the community. We evaluated several orchestrators and decided to join forces with the Airflow project. See more ...
Fine-Tuning Airflow: Parameters You May Not Know About by Yu Lung Law & Ivan Sayapin The Bloomberg Data Platform Engineering team is responsible for managing, storing, and providing access to business and financial data used by financial professionals across the global capital markets. Our team utilizes Apache Airflow to orchestrate data workflows across various applications and Bloomberg Terminal functions. Over the years, we have fine-tuned our Airflow cluster to handle more than 1,000 ingestion DAGs, which has presented unique scalability challenges. In this session, we will share insights into several key Airflow parameters — some of which you may not be all that familiar with — that our team uses to optimize and scale the platform effectively. See more ...
From Centrailization to Autonomy: Managing Airflow Pipeline through Multi-Tenancy by Silver Pang At the enterprise level, managing Airflow deployments across multiple teams can become complex, leading to bottlenecks and slowed development cycles. We will share our journey of decentralizing Airflow repositories to empower data engineering teams with multi-tenancy, clean folder structures, and streamlined DevOps processes. We dive into how restructuring our Airflow architecture and utilizing repository templates allowed teams to generate new data pipelines effortlessly. This approach enables engineers to focus on business logic without worrying about underlying Airflow configurations. By automating deployments and reducing manual errors through CI/CD pipelines, we minimized operational overhead. See more ...
From Complexity to Simplicity with TaskHarbor: Trendyol's Path to a Unified Orchestration Platform by Salih Goktug Kose & Burak Ozdemir At Trendyol, Turkey’s leading e-commerce company, Apache Airflow powers our task orchestration, handling DAGs with 500+ tasks, complex interdependencies, and diverse environments. Managing on-prem Airflow instances posed challenges in scalability, maintenance, and deployment. To address these, we built TaskHarbor, a fully managed orchestration platform with a hybrid architecture—combining Airflow on GKE with on-prem resources for optimal performance and efficiency. This talk covers how we: Enabled seamless DAG synchronization across environments using GCS Fuse. Optimized workload distribution via GCP’s HTTPS & TCP Load Balancers. Automated infrastructure provisioning (GKE, CloudSQL, Kubernetes) using Terraform. Simplified Airflow deployments by replacing Helm YAML files with a custom templating tool, reducing configurations to 10-15 lines. Built a fully automated deployment pipeline, ensuring zero developer intervention. We enhanced efficiency, reliability, and automation in hybrid orchestration by embracing a scalable, maintainable, and cloud-native strategy. Attendees will obtain practical insights into architecting Airflow at scale and optimizing deployments. See more ...
From Cron to Data-Aware: Evolving Airflow Scheduling at Scale by Yunhao Qing As data platforms grow in complexity, so do the orchestration needs behind them. Time-based (cron) scheduling has long been the default in Airflow, but dataset-based scheduling promises a more data-aware, efficient alternative. In this session, I’ll share lessons learned from operating Airflow at scale—supporting thousands of DAGs across teams with varied use cases, from simple ETL to complex ML workflows. We’ll explore when dataset scheduling makes sense, the challenges it introduces, and how to evolve your DAG design and platform architecture to make the most of it. Whether you’re migrating legacy workflows or designing new ones, this talk will help you evaluate the right scheduling model for your needs. See more ...
From DAGs to Insights: Business-Driven Airflow Use Cases by Tala Karadsheh Airflow is integral to GitHub’s data and insight generation. This session dives into use cases from GitHub where key business decisions are driven, at the root, with the help of Airflow. The session will also highlight how both GitHub and Airflow celebrate, promote, and nurture OSS innovations in their own ways. See more ...
From Legacy to Leading Edge: How Airflow Migration Unlocked Cross-Team Business Value by Blagoy Kaloferov At TrueCar, migrating hundreds of legacy Oozie workflows and in-house orchestration tools to Apache Airflow required key technical decisions that transformed our data platform architecture and organizational capabilities. We consolidated individual chained tasks into optimized DAGs leveraging native Airflow functionality to trigger compute across cloud environments. A crucial breakthrough was developing DAG generators to scale migration—essential for efficiently migrating hundreds of workflows while maintaining consistency. By decoupling orchestration from compute, we gained flexibility to select optimal tools for specific outcomes—programmatic processing, analytics, batch jobs, or AI/ML pipelines. This resulted in cost reductions, performance improvements, and team agility. We also gained unprecedented visibility into DAG performance and dependency patterns previously invisible across fragmented systems. Attendees will learn how we redesigned complex workflows into efficient DAGs using dynamic task generation, architectural decisions that enabled platform innovation and the decision framework that made our migration transformational. See more ...
From Oops to Secure Ops: Self-Hosted AI for Airflow Failure Diagnosis by Nathan Hadfield Last year, ‘From Oops to Ops’ showed how AI-powered failure analysis could help diagnose why Airflow tasks fail. But do we really need large, expensive cloud-based AI models to answer simple diagnostic questions? Relying on external AI APIs introduces privacy risks, unpredictable costs, and latency, often without clear benefits for this use case. With the rise of distilled, open-source models, self-hosted failure analysis is now a practical alternative. This talk will explore how to deploy an AI service on infrastructure you control, compare cost, speed, and accuracy between OpenAI’s API and self-hosted models, and showcase a live demo of AI-powered task failure diagnosis using DeepSeek and Llama—running without external dependencies to keep data private and costs predictable. See more ...
Get Certified: DAG Authoring for Apache Airflow 3 by Marc Lamberti The DAG Authoring for Apache Airflow certification validates your expertise in advanced Airflow concepts and demonstrates your ability to build production-grade data pipelines. See more ...
Get started with Airflow 3.0 by Kenten Danas Get hands-on experience with the new release and learn how to leverage new features like DAG versioning, backfills, data assets, and a new react-based UI. See more ...
GitHub's Airflow Journey: Lessons, Mistakes, and Insights by Oleksandr Slynko This session explores how GitHub uses Apache Airflow for efficient data engineering. We will share nearly 9 years of experiences, including lessons learnt, mistakes made, and the ways we reduced our on-call and engineering burden. We’ll demonstrate how we keep data flowing smoothly while continuously evolving Airflow and other components of our data platform, ensuring safety and reliability. The session will touch on how we migrate Airflow between cloud without user impact. We’ll also cover how we cut down the time from idea to running a DAG in production, despite our Airflow repo being among the top 15 by number of PRs within GitHub. See more ...
How Airflow can help with Data Management and Governance by Kunal Jain Metadata management is a cornerstone of effective data governance, yet it presents unique challenges distinct from traditional data engineering. At scale, efficiently extracting metadata from relational and NoSQL databases demands specialized solutions. To address this, our team has developed custom Airflow operators that scan and extract metadata across various database technologies, orchestrating 100+ production jobs to ensure continuous and reliable metadata collection. Now, we’re expanding beyond databases to tackle non-traditional data sources such as file repositories and message queues. This shift introduces new complexities, including processing structured and unstructured files, managing schema evolution in streaming data, and maintaining metadata consistency across heterogeneous sources. In this session, we’ll share our approach to building scalable metadata scanners, optimizing performance, and ensuring adaptability across diverse data environments. Attendees will gain insights into designing efficient metadata pipelines, overcoming common pitfalls, and leveraging Airflow to drive metadata governance at scale. See more ...
How Airflow Runs The Weather by Eloi Codina Torras Forecasting the weather and air quality is a logistical challenge. Numerical simulations are complex, resource-hungry, and sometimes fail without warning. Yet, our clients depend on accurate forecasts delivered daily and on time. At the heart of this operation is Airflow: the orchestration engine that keeps everything running. In this session, we’ll dive into the world behind weather and air quality forecasts. In particular, we’ll explore: The atmospheric modeling pipeline, to understand the unique demands it places on infrastructure How we use Airflow to orchestrate complex simulations reliably and at scale, to inspire new ways of managing time-critical, compute-heavy workflows. Our integration of Airflow with a high-performance computing (HPC) environment using Slurm, to run resource-intensive workloads efficiently in bare metal machines. At Meteosim we are experts on weather and air quality intelligence. With projects in over 80 countries, we support decision-making in industries where weather and air quality matter most: from daily operations to long-term sustainability. See more ...
How Airflow solves the coordination of decentralised teams at Vinted by Oscar Ligthart & Rodrigo Loredo Vinted is the biggest second-hand marketplace in Europe with multiple business verticals. Our data ecosystem has over 20 decentralized teams responsible for generating, transforming, and building Data Products from petabytes of data. This creates a daring environment where inter-team dependencies, varied expertise with scheduling tools, and diverse use cases need to be managed efficiently. To tackle these challenges, we have centralized our approach by leveraging Apache Airflow to orchestrate data dependencies across teams. See more ...
How Pinterest Uses Ai to Empower Airflow Users for Troubleshooting by Rachel Sun At Pinterest, there are over 10,000 DAGs supporting various use cases across different teams and roles. With this scale and diversity, user support has been an ongoing challenge to unlock productivity. As Airflow increasingly serves as a user interface to a variety of data and ML infrastructure behind the scenes, it’s common for issues from multiple areas to surface in Airflow, making triage and troubleshooting a challenge. In this session, we will discuss the scale of the problem we are facing, how we have addressed it so far, and how we are introducing LLM AI to help solve this problem. See more ...
Implementing Operations Research Problems with Apache Airflow: From Modelling to Production by Philippe Gagnon Hands-on session where attendees will gain experience creating DAGs to define and manage workflows for classical operations research problems. See more ...
Introducing Apache Airflow® 3 – The Next Evolution in Orchestration by Amogh Desai, Ash Berlin-Taylor, Brent Bovenzi, Bugra Ozturk, Daniel Standish, Jed Cunningham, Jens Scheffler, Kaxil Naik, Pierre Jeambrun, Tzu-ping Chung, Vikram Koka, Vincent Beck & Constance Martineau Apache Airflow® 3 is here, bringing major improvements to data orchestration. In this keynote, core Airflow contributors will walk through key enhancements that boost flexibility, efficiency, and user experience. Vikram Koka will kick things off with an overview of Airflow 3, followed by deep dives into DAG versioning (Jed Cunningham), enhanced backfilling (Daniel Standish), and a modernized UI (Brent Bovenzi & Pierre Jeambrun). Next, Ash Berlin-Taylor, Kaxil Naik, and Amogh Desai will introduce the Task Execution Interface and Task SDK, enabling tasks in any environment and language. Jens Scheffler will showcase the Edge Executor, while Tzu-ping Chung and Vincent Beck will demo event-driven scheduling and data assets. Finally, Buğra Öztürk will unveil CLI enhancements for automation and debugging. See more ...
Learn from Deutsche Bank: Using Apache Airflow in Regulated Environments by Christian Foernges Operating within the stringent regulatory landscape of Corporate Banking, Deutsche Bank relies heavily on robust data orchestration. This session explores how Deutsche Bank’s Corporate Bank leverages Apache Airflow across diverse environments, including both on-premises infrastructure and cloud platforms. Discover their approach to managing critical data & analytics workflows, encompassing areas like regulatory reporting, data integration and complex data processing pipelines. Gain insights into the architectural patterns and operational best practices employed to ensure compliance, security, and scalability when running Airflow at scale in a highly regulated, hybrid setting. See more ...
Lessons from Airflow gone wrong: How to set yourself up to scale successfully by Annie Friedman & Caitlin Petro Ever seen a DAG go rogue and deploy itself? Or try to time travel back to 1999? Join us for a light-hearted yet painfully relatable look at how not to scale your Airflow deployment to avoid chaos and debugging nightmares. We’ll cover the classics: hardcoded secrets, unbounded retries (hello, immortal task!), and the infamous spaghetti DAG where 200 tasks are lovingly connected by hand and no one dares open the Airflow UI anymore. If you’ve ever used datetime.now() in your DAG definition and watched your backfills implode, this talk is for you. See more ...
Lessons learned for scaling up Airflow 3 in Public Cloud by Przemek Wiech & Augusto Hidalgo Apache Airflow 3 is a new state-of-the-art version of Airflow. For many users who plan to adopt Airflow 3 it’s important to understand how Airflow 3 behaves from performance perspective compared to Airflow 2. This presentation is going to present performance results for various Airflow 3 configurations and provide information to users to should give Airflow 3 adopters good understanding of Airflow 3 performance. The reference Airflow 3 configuration will be using Kubernetes cluster as a compute layer, PostgreSQL as Airflow Database and would be performed on Google Cloud Platform. Performance tests will be performed using community version of performance tests framework and there might be references to Cloud Composer (managed service for Apache Airflow). The tests will be done in production-grade configurations that might be good references for Airflow community users. See more ...
Lessons learned from migrating to Airflow @ LI Scale by Arthur Chen, Trevor DeVore & Deng Pan At LinkedIn, our data pipelines process exabytes of data, with our offline infrastructure executing 300K ETL workflows daily and 10K concurrent executions. Historically, these workloads ran on our legacy system, Azkaban, which faced UX, scalability, and operational challenges. To modernize our infra, we built a managed Airflow service, leveraging its enhanced developer & operator experience, rich feature set, and strong OSS community support. That initiated LinkedIn’s largest-ever infrastructure migration—transitioning thousands of legacy workflows to Airflow. See more ...
Linkedin's Journey on scaling airflow by Rahul Gade & Arun Kumar Last year, we shared how LinkedIn’s continuous deployment platform (LCD) leveraged Apache Airflow to streamline and automate deployment workflows. LCD is the deployment platform inside Linkedin which is actively used by all engineers (10000+) at Likedin. This year, we take a deeper dive into the challenges, solutions, and engineering innovations that helped us scale Airflow to support thousands of concurrent tasks while maintaining usability and reliability. Key Takeaways: Abstracting Airflow for a Better User Experience – How we designed a system where users could define and update their workflows without directly interacting with Airflow. See more ...
LLM-Powered Review Analysis: Optimising Data Engineering using Airflow by Naseem Shah A real-world journey of how my small team at Xena Intelligence built robust data pipelines for our enterprise customers using Airflow. If you’re a data engineer, or part of a small team, this talk is for you. Learn how we orchestrated a complex workflow to process millions of public reviews. What You’ll Learn: Cost-Efficient DAG Designing: Decomposing complex processes into atomic tasks using the TaskFlow, XComs, Mapped tasks, and Task groups. Diving into one of our DAGs as a concrete example of how our approach optimizes parallelism, error handling, delivery speed, and reliability. See more ...
LLMOps with Airflow 3.0 and the Airflow AI SDK by Ryan Hatter Airflow 3 brings several exciting new features that better support MLOps: Native, intuitive backfills Removal of the unique execution date for dag runs Native support for event-driven scheduling These features, combined with the Airflow AI SDK, enable dag authors to easily build scalable, maintainable, and performant LLMOps pipelines. In this talk, we’ll go through a series of workflows that use the Airflow AI SDK to empower Astronomer’s support staff to more quickly resolve problems faced by Astronomer’s customers. See more ...
Mastering Event-Driven in Airflow 3: Building Scalable Data Pipelines by Luan Moreno Medeiros Maciel Transform your data pipelines with event-driven scheduling in Airflow 3. See more ...
Model Context Protocol with Airflow by Abhishek Bhakat & Sudarshan Chaudhari In today’s data-driven world, effective workflow management and AI are crucial for success. However, there’s a notable gap between Airflow and AI. Our presentation offers a solution to close this gap. Proposing MCP (Model Context Protocol) server to act as a bridge. We’ll dive into two paths: AI-Augmented Airflow: Enhancing Airflow with AI to improve error handling, automate DAG generation, proactively detect issues, and optimize resource use. Airflow-Powered AI: Utilizing Airflow’s reliability to empower LLMs in executing complex tasks, orchestrating AI agents, and supporting decision-making with real-time data. Key takeaways: See more ...
Modernizing Automation in Secure, Regulated Environments: Lessons from Deploying Airflow by Oluwafemi Olawoyin This session details practical strategies for introducing Apache Airflow in strict, compliance-heavy organizations. Learn how on-premise deployment and hybrid tooling can help modernize legacy workflows when public cloud solutions and container technologies are restricted. Discover how cross-platform engineering teams can collaborate securely using CI/CD bridges, and what it takes to meet rigorous security and governance standards. Key lessons address navigating resistance to change, achieving production sign-off, and avoiding common compliance pitfalls, relevant to anyone automating in public sector settings. See more ...
Multi-Instance Asset Synchronization - push or pull? by Sebastien Crocquevieille As Data Engineers, our jobs regularly include scheduling or scaling workflows. But have you ever asked yourself, can I scale my scheduling ? It turns out that you can! But doing so raises a number of issues that need to be addressed. In this talk we’ll be: Recapping Asset-aware scheduling in Apache Airflow Discussing diverse methods to upscale our scheduling Solving the issue of maintaining our Airflow Asset synchronized between instances Comparing our professional push based solution and the built-in solution from AIP-82 and the pros and cons of each method. I hope you will enjoy it! See more ...
Navigating Secure and Cost-Efficient Flink Batch on Kubernetes with Airflow by Purshotam Shah & Prakash Nandha Mukunthan At Yahoo, we built a secure, scalable, and cost-efficient batch processing platform using Amazon MWAA to orchestrate Apache Flink jobs on EKS, managed by the Flink Kubernetes Operator. This setup enables dynamic job orchestration while meeting strict enterprise compliance standards. In this session, we’ll share how Airflow DAGs: Dynamically launch, monitor, and clean up isolated Flink clusters per batch job, improving resource efficiency. Securely fetch EKS kubeconfig, submit FlinkDeployment CRDs using FlinkKubernetesOperator, and poll job status using Airflow sensors. See more ...
New Tools, Same Craft: The Developer's Toolbox in 2025 by Brooke Jamieson Our development workflows look dramatically different than they did a year ago. Code generation, automated testing, and AI-assisted documentation tools are now part of many developers’ daily work. Yet as these tools reshape how we code, I’ve noticed something worth examining: while our toolbox is changing rapidly, the core of being a good developer hasn’t. Problem-solving, collaborative debugging, and systems thinking remain as crucial as ever. In this keynote, I’ll share observations about: See more ...
No More Missed Beats: How Airflow Rescued Our Analytics Pipeline by pei-chi-miko-chen Before Airflow, our BigQuery pipelines at Create Music Group operated like musicians without a conductor—each playing on its own schedule, regardless of whether upstream data was ready. As our data platform grew, this chaos led to spiralling costs, performance bottlenecks, and became utterly unsustainable. This talk tells the story of how Create Music Group brought harmony to its data workflows by adopting Apache Airflow and the Medallion architecture, ultimately slashing our data processing costs by 50%. We’ll show how moving to event-driven scheduling with datasets helped eliminate stale data issues, dramatically improved performance, and unlocked faster iteration across teams. Discover how we replaced repetitive SQL with standardized dimension/fact tables, empowering analysts in a safer sandbox. See more ...
Operation Airlift: Uber's ongoing journey of migrating 200K pipelines to a single Airflow3 instance by Sumit Maheshwari Yes, you read that right — 200,000 pipelines, nearly 1 million task executions per day, all powered by a single Airflow instance. In this session, we’ll take you behind the scenes of one of the boldest orchestration projects ever attempted: how Uber’s data platform team is executing what might be the largest Apache Airflow migration in history — and doing it straight to Airflow 3. From scaling challenges and architectural choices to lessons learned in high-throughput orchestration, this is a deep dive into the tech, the chaos, and the strategy behind making data fly at unprecedented scale. See more ...
Orchestrating AI Knowledge Bases with Apache Airflow by Theo Lebrun In the age of Generative AI, knowledge bases are the backbone of intelligent systems, enabling them to deliver accurate and context-aware responses. But how do you ensure that these knowledge bases remain up-to-date and relevant in a rapidly changing world? Enter Apache Airflow, a robust orchestration tool that streamlines the automation of data workflows. This talk will explore how Airflow can be leveraged to manage and update AI knowledge bases across multiple data sources. We’ll dive into the architecture, demonstrate how Airflow enables efficient data extraction, transformation, and loading (ETL), and share insights on tackling challenges like data consistency, scheduling, and scalability. See more ...
Orchestrating Apache Airflow ML Workflows at Scale with SageMaker Unified Studio by Vinod Jayendra, Suba Palanisamy, Sean Bjurstrom & Anurag Srivastava We’ll explore how to leverage Amazon SageMaker Unified Studio to build and deploy scalable Apache Airflow workflows that span the data and AI/ML lifecycle. See more ...
Orchestrating Data Quality - Quality Data Brought To You By Airflow by Maggie Stark & Marion Azoulai Ensuring high-quality data is essential for building user trust and enabling data teams to work efficiently. In this talk, we’ll explore how the Astronomer data team leverages Airflow to uphold data quality across complex pipelines; minimizing firefighting and maximizing confidence in reported metrics. Maintaining data quality requires a multi-faceted approach: safeguarding the integrity of source data, orchestrating pipelines reliably, writing robust code, and maintaining consistency in outputs. We’ve embedded data quality into the DevEx experience, so it’s always at the forefront instead of in the backlog of tech debt. See more ...
Orchestrating Databricks with Airflow: Unlocking the Power of MVs, Streaming Tables, and AI by Denny Lee As data workloads grow in complexity, teams need seamless orchestration to manage pipelines across batch, streaming, and AI/ML workflows. Apache Airflow provides a flexible and open-source way to orchestrate Databricks’ entire platform, from SQL analytics with Materialized Views (MVs) and Streaming Tables (STs) to AI/ML model training and deployment. In this session, we’ll showcase how Airflow can automate and optimize Databricks workflows, reducing costs and improving performance for large-scale data processing. We’ll highlight how MVs and STs eliminate manual incremental logic, enable real-time ingestion, and enhance query performance—all while maintaining governance and flexibility. Additionally, we’ll demonstrate how Airflow simplifies ML model lifecycle management by integrating Databricks’ AI/ML capabilities into end-to-end data pipelines. See more ...
Orchestrating Global Market Data Pipelines with Airflow by Di Wu In this presentation, I will highlight how Apache Airflow addresses key data management challenges for Exchange-Traded Funds (ETFs) in the global financial market. ETFs, which combine features of mutual funds and stocks, track indexes, commodities, or baskets of assets and trade on major stock exchanges. Because they operate around the clock across multiple time zones, ETF managers must navigate diverse regulations, coordinate complex operational constraints, and ensure accurate valuations. This often requires integrating data from vendors for pricing and reference details. These data sets arrive at different times, can conflict, and must pass rigorous quality checks before being published for global investors. Managing updates, orchestrating workflows, and maintaining high data quality present significant hurdles. Apache Airflow tackles these issues by scheduling repetitive tasks and enabling event-triggered job runs for immediate data checks. It offers monitoring and alerting, thus reducing manual intervention and errors. Using DAGs, Airflow scales efficiently, streamlining complex data ingestion, validation, and publication processes. See more ...
Orchestrating MLOps and Data Transformation at EDB with Airflow by Karthik Dulam This talk explores EDB’s journey from siloed reporting to a unified data platform, powered by Airflow. We’ll delve into the architectural evolution, showcasing how Airflow orchestrates a diverse range of use cases, from Analytics Engineering to complex MLOps pipelines. Learn how EDB leverages Airflow and Cosmos to integrate dbt for robust data transformations, ensuring data quality and consistency. We’ll provide a detailed case study of our MLOps implementation, demonstrating how Airflow manages training, inference, and model monitoring pipelines for Azure Machine Learning models. See more ...
Orchestrating Travel Insights: Priceline's MLOps with Airflow by Priyanka Samanta The journey from ML model development to production deployment and monitoring is often complex and fragmented. How can teams overcome the chaos of disparate tools and processes? This session dives into how Apache Airflow serves as a unifying force in MLOps. We’ll begin with a look at the broader MLOps trends observed by Google within the Airflow community, highlighting how Airflow is evolving to meet these challenges and showcasing diverse MLOps use cases – both current and future. See more ...
Orchestrator of Orchestrators: Uniting Airflow Pipelines with Business Applications in Production by Basil Faruqui Airflow powers thousands of data and ML pipelines—but in the enterprise, these pipelines often need to interact with business-critical systems like ERPs, CRMs, and core banking platforms. In this demo-driven session we will connect Airflow with Control-M from BMC and showcase how Airflow can participate in end-to-end workflows that span not just data platforms but also transactional business applications. Session highlights Trigger Airflow DAGs based on business events (e.g., invoice approvals, trade settlements) Feed Airflow pipeline outputs into ERP systems (e.g., SAP) or CRMs (e.g., Salesforce) Orchestrate multi-platform workflows from cloud to mainframe with SLA enforcement, dependency management, and centralized control. Provide unified monitoring and auditing across data and application layers See more ...
Pittsburgh Goes With The Flow - Use Cases In Local Government by Alida Laney The City of Pittsburgh utilizes Airflow (via Astronomer) for a wide variety of tasks. From employee-focused use cases, like time bank balancing and internal dashboards, to public-facing publication, the City’s data flows through our DAGs from many sources to many sources. Airflow acts as a funnel point and is an essential tool for Pittsburgh’s Data Services team. See more ...
Productionising dbt-core with Airflow by Pankaj Singh, Tatiana Al-Chueyr Martins & Pankaj Koti This workshop will cover a step-by-step guide to Cosmos, an open-source package that helps you quickly run your dbt Core projects as Airflow DAGs and Task Groups. See more ...
Purple is the new green: harnessing deferrable operators to improve performance & reduce costs by Ethan Shalev Airflow’s traditional execution model often leads to wasted resources: worker nodes sitting idle, waiting on external systems. At Wix, we tackled this inefficiency head-on by refactoring our in-house operators to support Airflow’s deferrable execution model. Join us on a walk through Wix’s journey to a more efficient Airflow setup, from identifying bottlenecks to implementing deferrable operators and reaping their benefits. We’ll share the alternatives considered, the refactoring process, and how the team seamlessly integrated deferrable execution with no disruption to data engineers’ workflows. See more ...
Run Airflow tasks on your coffee machine by Cedrik Neumann Airflow 3 comes with two new features: Edge execution and the task SDK. Powered by a HTTP API, these make it possible to write and execute Airflow tasks in any language from anywhere. In this session I will explain some of the APIs needed and show how to interact with them based on an embedded toy worker written in Rust and running on an ESP32-C3. Furthermore I will provide practical tips on writing your own edge worker and how to develop against a running instance of Airflow. See more ...
Scaling Airflow with MWAA: A Multi-Tenant Enterprise Data Platform Journey by Srinivas Podila & Venkat Sadineni We use Amazon MWAA to orchestrate our enterprise data warehouse and MDM solutions. Our DAGs extract data from Salesforce, Oracle, Workday, and SFTP, transform it using Mulesoft, Informatica, and DBT, and load it into Salesforce Data Cloud and Snowflake. MWAA is configured as a multi-tenant platform, supporting more than 10 teams and managing thousands of DAGs per environment. Each team follows a full SDLC and has a dedicated Git repo integrated with Jenkins-based CI/CD pipelines for independent deployments. See more ...
Scaling and Unifying Multiple Airflow Instances with Orchestration Frederator by Chirag Todarka & Alvin Zhang In large organizations, multiple Apache Airflow instances often arise organically—driven by team-specific needs, distinct use cases, or tiered workloads. This fragmentation introduces complexity, operational overhead, and higher infrastructure costs. To address these challenges, we developed the “Orchestration Frederator,” a solution designed to unify and horizontally scale multiple Airflow deployments seamlessly. This session will detail our journey in implementing Orchestration Frederator, highlighting how we achieved: Horizontal Scalability: Seamlessly scaling Airflow across multiple instances without operational overhead. See more ...
Scaling ML Infrastructure: Lessons from Building Distributed Systems by Ashok Prakash In today’s data-driven world, scalable ML infrastructure is mission-critical. As ML workloads grow, orchestration tools like Apache Airflow become essential for managing pipelines, training, deployment, and observability. In this talk, I’ll share lessons from building distributed ML systems across cloud platforms, including GPU-based training and AI-powered healthcare. We’ll cover patterns for scaling Airflow DAGs, integrating telemetry and auto-healing, and aligning cross-functional teams. Whether you’re launching your first pipeline or managing ML at scale, you’ll gain practical strategies to make Airflow the backbone of your ML infrastructure. See more ...
Seamless Airflow Upgrades: Migrating from 2.x to 3 by Ankit Chaurasia Airflow 3 has officially arrived! In this session, we’ll start by discussing prerequisites for a smooth upgrade from Airflow 2.x to Airflow 3, including airflow version requirements, removing deprecated SubDAGs, and backing up and cleaning your metadata database prior to migration. We’ll then explore the new CLI utility: airflow config update [—-fix] for auto-applying configuration changes. We’ll demo cleaning old XCom data to speed up schema migration. During this session, attendees will learn to verify and adapt their pipelines for Airflow 3 using a Ruff-based upgrade utility. I will demo run ruff check dag/ –select AIR301 to surface scheduling issues, inspect fixes via ruff check dag/ –select AIR301 –show-fixes, and apply corrections with ruff check dag/ –select AIR301 –fix. We’ll also examine rules AIR302 for deprecated config and AIR303 for provider package migrations. By the end, your DAGs will pass all AIR3xx checks error-free. See more ...
Seamless Integration: Building Applications That Leverage Airflow's Database Migration Framework by Ephraim Anierobi This session presents a comprehensive guide to building applications that integrate with Apache Airflow’s database migration system. We’ll explore how to harness Airflow’s robust Alembic-based migration toolchain to maintain schema compatibility between Airflow and custom applications, enabling developers to create solutions that evolve alongside the Airflow ecosystem without disruption. See more ...
Seamless Migration: Leveraging Ruff for a Smooth Transition from Airflow 2 to Airflow 3 by Wei Lee Migrating from Airflow 2 to the newly released Airflow 3 may seem intimidating due to numerous breaking changes and the introduction of new features. Although a backward compatibility layer has been implemented and most of the existing dags should work fine, some features—such as subdags and execution_date—have been removed based on community consensus. To support this transition, we worked with Ruff to establish rules that automatically identify removed or deprecated features and even assist in fixing them. In this presentation, I will outline our current Ruff features, the migration rules from Airflow 2 to 3, and how this experience opens the door for us to promote best practices in Airflow through Ruff in the future. See more ...
Securing Airflow CLI with API by Bugra Ozturk This talk will explore the key changes introduced by AIP-81, focusing on security enhancements and user experience improvements across the entire software development lifecycle. We will break down the technical advancements from both a security and usability perspective, addressing key questions for Apache Airflow users of all levels. Topics include and not limited to isolating CLI communication to enhance security via leveraging Role-Based Access Control (RBAC) within the API for secure database interactions, clearly defining local vs. remote command execution and future improvements. See more ...
Security made us do it: Airflow’s new Task Execution Architecture by Amogh Desai & Ash Berlin-Taylor Airflow v2 architecture has strong coupling between the Airflow core & the User Code running in an Airflow task. This poses barriers in security, maintenance, and adoption. One such threat is that user code can access the source of truth of Airflow - the metadata DB and run any query against it! From a scalability angle, ‘n’ tasks create ‘n’ DB connections, limiting Airflow’s ability to scale effectively. To address this we proposed AIP-72 – a client-server model for task execution. The new architecture addresses several long-standing issues, including DB isolation from workers, dependency conflicts between Airflow core & workers, and ‘n’ number of DB connections.The new architecture has two parts: See more ...
Semiconductor (Chip) Design Workflow Orchestration with Airflow by Dheeraj Turaga The design of Qualcomm’s Snapdragon System-On-Chip (SoCs) involves several hundred complex workflows orchestrated across multiple data centers, taking the design from RTL to GDS. In the Snapdragon Oryon Custom CPU team, we introduced Airflow about 2 years ago to orchestrate design, verification, emulation, CI/CD, and physical implementation of our CPUs. Use Case: • Standardization and Templatization: We standardize and templatize common workflows, allowing designers to verify their designs by customizing YAML parameters. • Custom Shell Operators: We created custom shell operators (tcshrc) to source project environments and work with internal tooling. • Smart Retries: We use pre/post-execute hooks to trigger smart retries on failure. • Dynamic Celery Workers: We auto-create Celery workers on the fly on our High-Performance Compute (HPC) clusters to launch and manage Electronic Design Automation (EDA) workloads. • Hybrid Executor Strategy: We use a hybrid executor strategy (CeleryExecutor and EdgeExecutor) to orchestrate tasks across multiple data centers. • EdgeExecutor for Remote Testing: We leverage EdgeExecutor to access post-silicon hardware in remote locations. See more ...
Simplifying Data Lineage: How OpenLineage Empowers Airflow and Beyond by Harel Shein & Julien Le Dem OpenLineage has simplified collecting lineage metadata across the data ecosystem by standardizing its representation in an extensible model. It enabled a whole ecosystem improving data pipeline reliability and ease of troubleshooting in production environments. In this talk, we’ll briefly introduce the OpenLineage model and explore how this metadata is collected from Airflow, Spark, dbt, and Flink. We’ll demonstrate how to extract valuable insights and outline practical benefits and common challenges when building ingestion, processing and storage for OpenLineage data. We will also briefly show how OpenLineage events can be used to observe data pipelines exhastively and the benefits that brings. See more ...
Simplifying Data Management with DAG Factory by Katarzyna Kalek & Jakub Orlowski At OLX, we connect millions of people daily through our online marketplace while relying on robust data pipelines. In this talk, we explore how the DAG Factory concept elevates data governance, lineage, and discovery by centralizing operator logic and restricting direct DAG creation. This approach enforces code quality, optimizes resources, maintains infrastructure hygiene and enables smooth version upgrades. We then leverage consistent naming conventions in Airflow to build targeted namespaces, aligning teams with global policies while preserving autonomy. Integrating external tools like AWS Lake Formation and Open Metadata further unifies governance, making it straightforward to manage and secure data. This is critical when handling hundreds or even thousands of active DAGs. If the idea of storing 1,600 pipelines in one folder seems overwhelming, join us to learn how the DAG Factory concept simplifies pipeline management. We’ll also share insights from OLX, highlighting how thoughtful design fosters oversight, efficiency, and discoverability across diverse use cases. See more ...
Supercharging Apache Airflow: Enhancing Core Components with Rust by Shahar Epstein Apache Airflow is a powerful workflow orchestrator, but as workloads grow, its Python-based components can become performance bottlenecks. This talk explores how Rust, with its speed, safety, and concurrency advantages, can enhance Airflow’s core components (e.g, scheduler, DAG processor, etc). We’ll dive into the motivations behind using Rust, architectural trade-offs, and the challenges of bridging the gap between Python and Rust. A proof-of-concept showcasing an Airflow scheduler rewritten in Rust will demonstrate the potential benefits of this approach. See more ...
Sustainable Computing in Airflow: Reducing Emissions with Carbon Aware Scheduling by Ryan Singman As the climate impact of cloud computing grows, carbon aware computing offers a promising way to cut emissions without compromising performance. By shifting workloads to times of lower carbon intensity on the power grid, we can achieve significant emissions reductions—often 10–30%—with no code changes to the underlying task. In this talk, we’ll explore the principles behind carbon-aware computing, walk through how these ideas translate to actionable reductions in Airflow, and introduce the open-source CarbonAware provider for Airflow. We’ll also highlight how Airflow’s deferable operators, task metadata, and flexible execution model make it uniquely well suited for temporal shifting based on grid carbon intensity. See more ...
Task failures troubleshooting based on Airflow & Kubernetes signals by Khadija Al Ahyane Per Airflow community survey, Kubernetes is the most popular compute platform used to run Airflow and when run on Kubernetes, Airflow gains, out of the box, lots of benefits like monitoring, reliability, ease of deployment, scalability and autoscaling. On the other hand, running Airflow on Kubernetes means running a sophisticated distributed system on another distributed system which makes troubleshooting of Airflow tasks and DAGs failures harder. This session tackles that bottleneck head-on, introducing a practical approach to building an automated diagnostic pipeline for Airflow on Kubernetes. Imagine offloading tedious investigations to a system that, on task failure, automatically collects and correlates key signals from Kubernetes components (linking Airflow tasks to specific Pods and their events), KubernetesGKE monitoring, and relevant logs—pinpointing root causes and suggesting actionable fixes. See more ...
The Secret to Airflow's Evergreen Build: CI/CD magic by Amogh Desai, Jarek Potiuk & Pavan kumar Gopidesu Have you ever wondered why Apache Airflow builds are asymptotically(*) green? That thrive for “perennial green build” is not magic, it’s the result of continuous, often unseen engineering effort within our CI/CD pipelines & dev environments. This dedication ensures that maintainers can work efficiently & contributors can onboard smoothly. To tackle the ever growing contributor base, we have a CI/CD team run by volunteers putting in significant work in the foundational tooling. In this talk, we reveal some innovative solutions we have implemented like: See more ...
Transforming Data Engineering: Achieving Efficiency and Ease with an Intuitive Orchestration Solution by Rakesh Kumar Tai & Mili Tripathi In the rapidly evolving field of data engineering and data science, efficiency and ease of use are crucial. Our innovative solution offers a user-friendly interface to manage and schedule custom PySpark, PySQL, Python, and SQL code, streamlining the process from development to production. Using Airflow at the backend, this tool eliminates the complexities of infrastructure management, version control, CI/CD processes, and workflow orchestration.The intuitive UI allows users to upload code, configure job parameters, and set schedules effortlessly, without the need for additional scripting or coding. Additionally, users have the flexibility to bring their own custom artifactory solution and run their code. In summary, our solution significantly enhances the orchestration and scheduling of custom code, breaking down traditional barriers and empowering organizations to maximize their data’s potential and drive innovation efficiently. Whether you are an individual data scientist or part of a large data engineering team, this tool provides the resources needed to streamline your workflow and achieve your goals faster than ever before. See more ...
Transforming Insurance underwriting with Agentic AI by Peeyush Rai The weav.ai platform is built on top of Apache Airflow, chosen for its deterministic, predictable execution coupled with extreme developer customizability. weav.ai has seamlessly integrated its AI agents with Airflow to enable a unified AI orchestration to bring the power of scalability, robustness and the intelligence of AI in a single process. This talk will focus on the use cases being served, an architecture overview of the key Airflow capabilities being leveraged, and how Agentic AI has been seamlessly integrated to deliver the AI powered workflows. Weav.ai’s platform is agnostic to any specific cloud or LLM and can orchestrate across those based on the use case. See more ...
Unleash Airflow's Potential with hands-on Performance Optimization by Mike Ellis This interactive workshop session empowers you to unlock the full potential of Apache Airflow through performance optimization techniques. See more ...
Unlocking Event-Driven Scheduling in Airflow 3: A New Era of Reactive Data Pipelines by Vincent Beck Airflow 3 introduces a major evolution in orchestration: native support for external event-driven scheduling. In this talk, I’ll share the journey behind AIP-82—why we needed it, how we built it, and what it unlocks. I’ll dive into how the new AssetWatcher enables pipelines to respond immediately to events like file arrivals, API calls, or pub/sub messages. You’ll see how this drastically reduces latency and infrastructure overhead while improving reactivity and resource efficiency. We’ll explore how it works under the hood, real-world use cases, best practices, and migration tips for teams ready to shift from time-based to event-driven workflows. If you’re looking to make your Airflow DAGs more dynamic, this is the talk that shows you how. Whether you’re an operator or contributor, you’ll walk away with a deep understanding of one of Airflow 3’s most impactful features. See more ...
Using Apache Airflow with Trino for (almost) all your data problems by Philippe Gagnon Trino is incredibly effective at enabling users to extract insights quickly and effectively from large amount of data located in dispersed and heterogeneous federated data systems. However, some business data problems are more complex than interactive analytics use cases, and are best broken down into a sequence of interdependent steps, a.k.a. a workflow. For these use cases, dedicated software is often required in order to schedule and manage these processes with a principled approach. See more ...
When Airflow Meets Yunikorn: Enhancing Airflow with Yunikorn for Higher Efficiency by Xiaodong Deng & Chaoran Yu Apache Airflow’s Kubernetes integration enables flexible workload execution on Kubernetes but lacks advanced resource management features including application queueing, tenant isolation and gang scheduling. These features are increasingly critical for data engineering as well as AI/ML use cases, particularly GPU utilization optimization. Apache Yunikorn, a Kubernetes-native scheduler, addresses these gaps by offering a high-performance alternative to Kubernetes default scheduler. In this talk, we’ll demonstrate how to conveniently leverage Yunikorn’s power in Airflow, along with practical use cases and examples. See more ...
Why AWS chose Apache Airflow to power workflows for the next generation of Amazon SageMaker by John Jackson On March 13th, 2025, Amazon Web Services announced General Availability of Amazon SageMaker Unified Studio, bringing together AWS machine learning and analytics capabilities. At the heart of this next generation of Amazon SageMaker sits Apache Airflow. All SageMaker Unified Studio users have a personal, open-source Airflow deployment, running alongside their Jupyter notebook, enabling those users to easily develop Airflow DAGs that have unified access to all of their data. In this talk, I will go into details around the motivations for choosing Airflow for this capability, the challenges with incorporating Airflow into such a large and diverse experience, the key role that open-source plays, how we’re leveraging GenAI to make that open source development experience better, and the goals for the future of Airflow in SageMaker Unified Studio. See more ...
Why Data Teams Keep Reinventing the Wheel: The Struggle for Code Reuse in the Data Transformation La by Maxime Beauchemin Data teams have a bad habit: reinventing the wheel. Despite the explosion of open-source tooling, best practices, and managed services, teams still find themselves building bespoke data platforms from scratch—often hitting the same roadblocks as those before them. Why does this keep happening, and more importantly, how can we break the cycle? In this talk, we’ll unpack the key reasons data teams default to building rather than adopting, from technical nuances to cultural and organizational dynamics. We’ll discuss why fragmentation in the modern data stack, the pressure to “own” infrastructure, and the allure of in-house solutions make this problem so persistent. See more ...
Your first Apache Airflow Contribution by Ryan Hatter, Amogh Desai, Phani Kumar & Kalya Reddy Whether you’re writing code, enhancing documentation, or offering feedback, there’s a place for you. Let’s get started and see your name among Airflow contributors!" See more ...
Your privacy or our progress: rethinking telemetry in Airflow by Bolke de Bruin We face a paradox: we could use usage data to build better software, but collecting that data seems to contradict the very principles of user freedom that open source represents. Apache Airflow’s current telemetry - already purged - system has become a battleground for this conflict, with some users voicing concerns over privacy while maintainers struggle to make informed decisions without data. What can we do to strike the right balance? See more ...