Check out the full program for Airflow Summit.
If you prefer, you can also see this as sessionize layout or list of sessions.
Monday, August 31, 2026
This session explores the next phase of Dag versioning in Airflow and the practical questions users face in real deployments. Dag versioning moved Airflow beyond a “latest only” model, but it also introduced confusion around why Dag versions keep increasing, what disabling Dag bundle versioning actually does, what creates a new version, and how users should think about clears, reruns, and backfills after a Dag changes. I will examine a common misconception: disabling bundle versioning does not stop Dag version changes. I will also connect Dag versioning to Dag delivery in Airflow 3, showing how Git backed Dag bundles provide a more native alternative to git-sync in Helm-based deployments.
Texas Ballroom 5This session explores the next phase of Dag versioning in Airflow and the practical questions users face in real deployments. Dag versioning moved Airflow beyond a “latest only” model, but it also introduced confusion around why Dag versions keep increasing, what disabling Dag bundle versioning actually does, what creates a new version, and how users should think about clears, reruns, and backfills after a Dag changes. I will examine a common misconception: disabling bundle versioning does not stop Dag version changes. I will also connect Dag versioning to Dag delivery in Airflow 3, showing how Git backed Dag bundles provide a more native alternative to git-sync in Helm-based deployments.
Migrating a production Airflow deployment from version 2 to 3 without disrupting hundreds of DAGs across multiple teams sounds scary (and it is). In this talk I will share how we migrated versions without a big-bang cutover, without weeks of cross-team change requests, and without leaving our pipelines in a broken state.
I’ll walk through how we built a compatibility layer to make sure our code runs on both versions during the migration, how we used AI-tooling to orchestrate 400+ DAG changes and how our on-demand ephemeral environments - full k8s deployments deployed for each pull request - helped us experiment and test all the required changes.
Most important of all, I will share what we learned, where we failed and what we would do better next time.
Texas Ballroom 7Migrating a production Airflow deployment from version 2 to 3 without disrupting hundreds of DAGs across multiple teams sounds scary (and it is). In this talk I will share how we migrated versions without a big-bang cutover, without weeks of cross-team change requests, and without leaving our pipelines in a broken state.
I’ll walk through how we built a compatibility layer to make sure our code runs on both versions during the migration, how we used AI-tooling to orchestrate 400+ DAG changes and how our on-demand ephemeral environments - full k8s deployments deployed for each pull request - helped us experiment and test all the required changes.
AI coding assistants have transformed software development, moving from ad hoc “vibe coding” to rigorous spec-driven development (SDD). The Airflow ecosystem has fully embraced these advancements, but different use cases demand different SDD approaches. This talk compares ETL and ML pipeline patterns, showing how each leverages Airflow’s unique capabilities differently. I then present SDD strategies along a Spec Stability Spectrum. ETL specs are stable and external — schemas, dbt models — making deterministic, template-driven approaches like DAG Factory and the cosmos-dbt-core skill the right fit. ML specs are volatile and internal, as experiments evolve, so LLM-driven hybrid approaches like the Airflow AI SDK and the airflow-hitl skill are better suited. Both approaches are demonstrated live with Claude Code. Examples draw from my work at TXI Digital generating ETL and ML pipelines for heavy industry clients, with a focus on Rail and anecdotes from Renewable Energy.
Texas Ballroom 6AI coding assistants have transformed software development, moving from ad hoc “vibe coding” to rigorous spec-driven development (SDD). The Airflow ecosystem has fully embraced these advancements, but different use cases demand different SDD approaches. This talk compares ETL and ML pipeline patterns, showing how each leverages Airflow’s unique capabilities differently. I then present SDD strategies along a Spec Stability Spectrum. ETL specs are stable and external — schemas, dbt models — making deterministic, template-driven approaches like DAG Factory and the cosmos-dbt-core skill the right fit. ML specs are volatile and internal, as experiments evolve, so LLM-driven hybrid approaches like the Airflow AI SDK and the airflow-hitl skill are better suited. Both approaches are demonstrated live with Claude Code. Examples draw from my work at TXI Digital generating ETL and ML pipelines for heavy industry clients, with a focus on Rail and anecdotes from Renewable Energy.
At Stripe, we process petabytes of data daily across thousands of pipelines powering financial reporting, fraud detection, and merchant analytics. As our data estate grew, so did the complexity of authoring, scheduling, and operating these pipelines. Engineers spent more time wrangling Airflow DAG boilerplate and managing dependencies than writing transformation logic.
To address this, we built a declarative platform that generates Airflow DAGs from YAML and SQL definitions. Authors specify what they want — source tables, SQL transformations, incremental mode, output schema — and the platform handles the rest: generating Airflow tasks, wiring upstream sensors, registering Iceberg tables, and configuring scheduling parameters. A key piece is an in-house dataset-to-task mapping service that resolves upstream dataset dependencies to their producing Airflow tasks. When an author declares an input dataset, the platform automatically looks up which task produces it and generates the appropriate sensor — no manual DAG cross-referencing required. This eliminates an entire class of misconfigured dependency bugs common in hand-wired Airflow deployments.
Texas Ballroom 1-2-3At Stripe, we process petabytes of data daily across thousands of pipelines powering financial reporting, fraud detection, and merchant analytics. As our data estate grew, so did the complexity of authoring, scheduling, and operating these pipelines. Engineers spent more time wrangling Airflow DAG boilerplate and managing dependencies than writing transformation logic.
To address this, we built a declarative platform that generates Airflow DAGs from YAML and SQL definitions. Authors specify what they want — source tables, SQL transformations, incremental mode, output schema — and the platform handles the rest: generating Airflow tasks, wiring upstream sensors, registering Iceberg tables, and configuring scheduling parameters. A key piece is an in-house dataset-to-task mapping service that resolves upstream dataset dependencies to their producing Airflow tasks. When an author declares an input dataset, the platform automatically looks up which task produces it and generates the appropriate sensor — no manual DAG cross-referencing required. This eliminates an entire class of misconfigured dependency bugs common in hand-wired Airflow deployments.
At Meteosim, Airflow is the engine for our entire decision system. It runs daily weather and air quality forecasts on schedule, but it also enables OnaChem React, a software that lets users manage chemical emergencies in real-time, and helps us manage consultancy projects.
This talk covers how we set up Airflow 3 to handle five very different types of workloads:
1. Daily Forecasts: Running physics simulations for weather and air quality.
2. Sensor Validation: Ingest data from thousands of sensors and validate it.
3. Human-in-the-Loop: Managing long-running consultancy projects where Dags pause and wait for expert approval.
4. Emergency Response: Help users manage chemical emergencies using multiple real-time toxic dispersion simulations with pre-defined workflows through our SaaS platform.
5. Training AI models: Track multiple experiments.
We will explain why Airflow 3 was necessary to make this work. You will see how we orchestrate physics, AI, and human decisions in a single environment.
Texas Ballroom 1-2-3At Meteosim, Airflow is the engine for our entire decision system. It runs daily weather and air quality forecasts on schedule, but it also enables OnaChem React, a software that lets users manage chemical emergencies in real-time, and helps us manage consultancy projects.
This talk covers how we set up Airflow 3 to handle five very different types of workloads:
1. Daily Forecasts: Running physics simulations for weather and air quality.
2. Sensor Validation: Ingest data from thousands of sensors and validate it.
3. Human-in-the-Loop: Managing long-running consultancy projects where Dags pause and wait for expert approval.
4. Emergency Response: Help users manage chemical emergencies using multiple real-time toxic dispersion simulations with pre-defined workflows through our SaaS platform.
5. Training AI models: Track multiple experiments.
We will explain why Airflow 3 was necessary to make this work. You will see how we orchestrate physics, AI, and human decisions in a single environment.
Apache Airflow is often perceived as a platform best suited for large organisations with significant infrastructure budgets and dedicated platform teams. In this talk, I want to share how we built and scaled a robust Airflow platform with tight cost constraints whilst still maintaining reliability, governance and developer productivity.
Starting from a small Airflow setup, we have evolved our architecture to support multiple teams and increasingly complex workflows. This includes standardising environments and making sure best practises are adopted around observability, resource management and version control.
I want to walk through the architectural decisions we made, the trade-offs we managed and open-source solutions we considered. I also want to outline the concrete steps we took to reduce operational overhead and get a hold of our cloud spend. I will share some practical examples of how to enforce consistency across DAGs, scale Airflow and build a data platform that grows with the business.
This session will be aimed at engineers and platforms teams who want to run Airflow efficiently and sustainably even with limited resources and budget constraints.
Texas Ballroom 7Apache Airflow is often perceived as a platform best suited for large organisations with significant infrastructure budgets and dedicated platform teams. In this talk, I want to share how we built and scaled a robust Airflow platform with tight cost constraints whilst still maintaining reliability, governance and developer productivity.
Starting from a small Airflow setup, we have evolved our architecture to support multiple teams and increasingly complex workflows. This includes standardising environments and making sure best practises are adopted around observability, resource management and version control.
Storage usage is a major driver of infrastructure cost for media collaboration platforms. Understanding how storage grows across accounts, assets, and workflows requires analytics pipelines that combine product data with infrastructure metrics.
In this talk, I’ll share how we built storage analytics pipelines that model storage usage across accounts and plan tiers to help leadership understand infrastructure cost drivers. Using warehouse data models orchestrated with Airflow, we developed pipelines that track storage usage over time, identify discrepancies in legacy storage calculations, and resolve edge-cases.
These pipelines enabled deeper analysis of storage growth and informed changes to asset lifecycle policies that significantly reduced cloud storage costs.
What attendees will learn:
- How to design analytics pipelines for infrastructure usage data
- Modeling storage usage across accounts and assets
- Using Airflow to orchestrate infrastructure analytics workflows
- Turning analytics insights into infrastructure cost optimization
Storage usage is a major driver of infrastructure cost for media collaboration platforms. Understanding how storage grows across accounts, assets, and workflows requires analytics pipelines that combine product data with infrastructure metrics.
In this talk, I’ll share how we built storage analytics pipelines that model storage usage across accounts and plan tiers to help leadership understand infrastructure cost drivers. Using warehouse data models orchestrated with Airflow, we developed pipelines that track storage usage over time, identify discrepancies in legacy storage calculations, and resolve edge-cases.
If you are migrating from self-hosted Airflow to any of the managed platforms, most migration guides you’ll find online assume one environment, one team, one version. Large organizations are never that simple.
This talk comes from four years of assisting customers with real migrations across some of the biggest Airflow deployments out there, from self-hosted open source to managed cloud platforms like MWAA, GCC, and Astro, and between major version upgrades.
The organizations we worked with had multiple teams, multiple Airflow versions running in parallel, and years of decisions baked into their infrastructure and Dags.
We’ll walk through what a migration actually looks like at that scale. The architecture, design, and planning work that has to happen before anyone touches a Dag, and the organizational coordination needed to make it work.
We’ll cover every topic so that you leave this session ready to confidently start your migration projects. You’ll leave with a framework for scoping a migration, a clearer picture of the work to come, and lessons we learned from doing this across organizations that couldn’t afford to get it wrong.
Texas Ballroom 5If you are migrating from self-hosted Airflow to any of the managed platforms, most migration guides you’ll find online assume one environment, one team, one version. Large organizations are never that simple.
This talk comes from four years of assisting customers with real migrations across some of the biggest Airflow deployments out there, from self-hosted open source to managed cloud platforms like MWAA, GCC, and Astro, and between major version upgrades.
As Airflow becomes mission-critical, centralized data teams often become a bottleneck. This session provides a framework for building a Center of Excellence (CoE) that empowers autonomous domain teams while maintaining global standards.
We detail the shift toward “Data Platform Engineering,” treating orchestration as a product. Using case studies from large-scale organizations, we discuss a three-layer model: Strategic (governance), Tactical (platform development), and Operational (business unit execution).
Attendees will learn to design a self-service platform with guardrails that manages multiple teams without interference. We will explore using Airflow 3.0’s architecture for task isolation and conclude with a guide on aligning cross-functional teams and measuring value through consumption-based billing.
Texas Ballroom 7As Airflow becomes mission-critical, centralized data teams often become a bottleneck. This session provides a framework for building a Center of Excellence (CoE) that empowers autonomous domain teams while maintaining global standards.
We detail the shift toward “Data Platform Engineering,” treating orchestration as a product. Using case studies from large-scale organizations, we discuss a three-layer model: Strategic (governance), Tactical (platform development), and Operational (business unit execution).
Attendees will learn to design a self-service platform with guardrails that manages multiple teams without interference. We will explore using Airflow 3.0’s architecture for task isolation and conclude with a guide on aligning cross-functional teams and measuring value through consumption-based billing.
What if your Airflow DAG could orchestrate robots, thermal chambers, and silicon tests, not just code?
Silicon validation labs rely on scarce, stateful physical resources: robotic handlers, DUT boards, thermal/power systems, instruments, and shared hardware queues. Teams often coordinate these via spreadsheets and ad hoc reservations, causing contention, idle gaps, conflicts, poor observability, and slow triage.
This talk presents a closed-loop orchestration model where Apache Airflow is the control plane for a software-defined validation lab. A central DAG coordinates robotic handling, thermal/power setup, stress and performance runs, and parametric characterization on hosts connected to silicon. It continuously ingests hardware health, measurements, and test outcomes, then feeds results into AI-assisted analysis to choose the next physical action: refine parameters, schedule follow-up experiments, or trigger mitigation.
Using Edge workers on dedicated lab machines, we replace manual coordination with reliable, auditable orchestration. The same pattern extends beyond silicon to robotics labs, device farms, and other cyber-physical environments.
Texas Ballroom 1-2-3What if your Airflow DAG could orchestrate robots, thermal chambers, and silicon tests, not just code?
Silicon validation labs rely on scarce, stateful physical resources: robotic handlers, DUT boards, thermal/power systems, instruments, and shared hardware queues. Teams often coordinate these via spreadsheets and ad hoc reservations, causing contention, idle gaps, conflicts, poor observability, and slow triage.
This talk presents a closed-loop orchestration model where Apache Airflow is the control plane for a software-defined validation lab. A central DAG coordinates robotic handling, thermal/power setup, stress and performance runs, and parametric characterization on hosts connected to silicon. It continuously ingests hardware health, measurements, and test outcomes, then feeds results into AI-assisted analysis to choose the next physical action: refine parameters, schedule follow-up experiments, or trigger mitigation.
Debugging Airflow failures in production can be harder than building the pipelines themselves. Engineers often encounter issues such as disappearing DAGs, hanging tasks, missing logs, zombie tasks, or sudden performance degradation, often with little visibility into the root cause.
Over the past year, while supporting multiple Airflow deployments and integrations, we investigated several such incidents across different teams and environments. This session shares lessons from these real debugging cases and explains how the issues were diagnosed and resolved.
We will walk through incidents involving scheduler behaviour, concurrency limits, memory pressure, and process-level failures. For each case, we highlight the symptoms, the investigation approach, and the root cause.
Attendees will learn
- How to systematically debug complex Airflow failures
- Which components commonly hide the root cause
- Practical signals to watch in logs and metrics
Debugging Airflow failures in production can be harder than building the pipelines themselves. Engineers often encounter issues such as disappearing DAGs, hanging tasks, missing logs, zombie tasks, or sudden performance degradation, often with little visibility into the root cause.
Over the past year, while supporting multiple Airflow deployments and integrations, we investigated several such incidents across different teams and environments. This session shares lessons from these real debugging cases and explains how the issues were diagnosed and resolved.
Upgrading to Apache Airflow in large, production-grade environments can be complex—especially in enterprise setups with hundreds of DAGs, custom plugins, and mission-critical pipelines. The challenge grows even more complex in decentralized setups, where platform teams are responsible for the system’s stability, but the DAG code lives across multiple teams you don’t directly control.
You will have the chance for personalised review of your current organizational setup, assess testing coverage, and identify concrete ways to improve your upgrade process. This hands-on workshop will provide:
- Environment Health Check & Audits (dependency checks, resource sizing)
- DAG refactoring for deprecated features and optimizations
- Database migrations and backward-compatibility strategies
- Improving CI/CD validation using GenAI to increase reliability
- Self-managed and Astronomer upgrade (with no downtime)
Supported by battle-tested approach and guided exercises. Recommended for platform teams, data engineers, and architects managing production Airflow deployments. At the end of this workshop participants will gain actionable strategies tailored to their specific upgrade challenges.
Hill Country CDUpgrading to Apache Airflow in large, production-grade environments can be complex—especially in enterprise setups with hundreds of DAGs, custom plugins, and mission-critical pipelines. The challenge grows even more complex in decentralized setups, where platform teams are responsible for the system’s stability, but the DAG code lives across multiple teams you don’t directly control.
You will have the chance for personalised review of your current organizational setup, assess testing coverage, and identify concrete ways to improve your upgrade process. This hands-on workshop will provide:
Thanks to AI, your data scientists can build models faster than ever. The new bottleneck? Their attention. When your team maintains a zoo of ML models (dbt/SQL scoring models, Python ML on Kubernetes, and point-and-click product UI models) every new species adds feeding schedules, health checks, and habitat needs. The real question becomes: which animals need the zookeeper right now?
At Pendo, we orchestrate 10+ ML models through Airflow, each with its own dbt Cloud feature prep, Kubernetes scoring pods, and downstream monitoring. This talk covers how we keep the zoo running: DAG dependencies across heterogeneous model types, conditional execution for models that only score on certain schedules, and model-specific sub-pipelines that keep each species healthy. Then we’ll demo DS ModelGuard, an agentic monitoring system we built internally that does the morning rounds, tracking API health, output volume, likelihood drift, and feature-level input drift, so your data scientists know which enclosure to check first.
You’ll leave knowing how to wire up a diverse model zoo in Airflow and how to build attention-routing so your team stops checking every cage and starts prioritizing.
Texas Ballroom 6Thanks to AI, your data scientists can build models faster than ever. The new bottleneck? Their attention. When your team maintains a zoo of ML models (dbt/SQL scoring models, Python ML on Kubernetes, and point-and-click product UI models) every new species adds feeding schedules, health checks, and habitat needs. The real question becomes: which animals need the zookeeper right now?
At Pendo, we orchestrate 10+ ML models through Airflow, each with its own dbt Cloud feature prep, Kubernetes scoring pods, and downstream monitoring. This talk covers how we keep the zoo running: DAG dependencies across heterogeneous model types, conditional execution for models that only score on certain schedules, and model-specific sub-pipelines that keep each species healthy. Then we’ll demo DS ModelGuard, an agentic monitoring system we built internally that does the morning rounds, tracking API health, output volume, likelihood drift, and feature-level input drift, so your data scientists know which enclosure to check first.
Today’s Pipeline authoring is synchronous: writing code, chasing error - every step blocks the engineer until resolved. You can’t step away or parallelize. Airflow Autopilot reimagines this to be AI-native and asynchronous. Describe your pipeline’s intent. The agent takes over - orchestrating two classes of purpose-built tools: tools that generate the DAG code and automate setup, and scorer tools that evaluate it across dimensions: e.g. data discovery, auth, compliance, DAG validation, even end-to-end execution. Every scorer returns a deterministic result and structured, prioritized hints. The agent runs the generate → verify → refine loop — calling scorers, reading hints, fixing code, re-scoring — until every dimension passes. You come back to a PR with DAGs that have been iteratively built, tested, and ready for review. For 10,000+ Airflow users, this shifts the engineer from executor to reviewer: you own the intent and final judgment, the agent owns the execution. Attendees leave with the architecture for an AI-native authoring experience, the principles behind decomposing work into scorer-sized verification units, and what it takes to scale this in production.
Texas Ballroom 1-2-3Today’s Pipeline authoring is synchronous: writing code, chasing error - every step blocks the engineer until resolved. You can’t step away or parallelize. Airflow Autopilot reimagines this to be AI-native and asynchronous. Describe your pipeline’s intent. The agent takes over - orchestrating two classes of purpose-built tools: tools that generate the DAG code and automate setup, and scorer tools that evaluate it across dimensions: e.g. data discovery, auth, compliance, DAG validation, even end-to-end execution. Every scorer returns a deterministic result and structured, prioritized hints. The agent runs the generate → verify → refine loop — calling scorers, reading hints, fixing code, re-scoring — until every dimension passes. You come back to a PR with DAGs that have been iteratively built, tested, and ready for review. For 10,000+ Airflow users, this shifts the engineer from executor to reviewer: you own the intent and final judgment, the agent owns the execution. Attendees leave with the architecture for an AI-native authoring experience, the principles behind decomposing work into scorer-sized verification units, and what it takes to scale this in production.
Airflow testing today is a patchwork: you can validate code and catch obvious breakage early, but many production failures live in the seams—runtime state, persistence, serialization boundaries, API behavior, and the way a real deployment executes work across components. The fast tools are valuable, yet they don’t fully model Airflow as a system. Meanwhile, the default development posture nudges you toward single-process behavior and away from realistic concurrency and state interactions. The result is a familiar trade: quick feedback vs. meaningful confidence. “Airflow in a Box” is a step toward collapsing that trade—making deeper, more production-relevant tests accessible without requiring a full, heavyweight instance for every iteration. In this talk, we’ll discuss methodology, quantify slickness, and share real code!
Texas Ballroom 7Airflow testing today is a patchwork: you can validate code and catch obvious breakage early, but many production failures live in the seams—runtime state, persistence, serialization boundaries, API behavior, and the way a real deployment executes work across components. The fast tools are valuable, yet they don’t fully model Airflow as a system. Meanwhile, the default development posture nudges you toward single-process behavior and away from realistic concurrency and state interactions. The result is a familiar trade: quick feedback vs. meaningful confidence. “Airflow in a Box” is a step toward collapsing that trade—making deeper, more production-relevant tests accessible without requiring a full, heavyweight instance for every iteration. In this talk, we’ll discuss methodology, quantify slickness, and share real code!
Most Airflow failures are still handled manually — retries, Slack alerts, and late-night debugging. This talk shows how to design Airflow as a self-healing platform that detects problems early, limits blast radius, and automatically recovers. We’ll cover practical patterns for DAG, schema, and dependency-drift detection; safe, selective backfills; predictive failure modeling using metadata; lineage-aware rollbacks; and canary deployment for DAGs. You’ll learn how to isolate unstable workloads before they impact others and how to turn Airflow into an intelligent control plane — not just a scheduler.
Texas Ballroom 5Most Airflow failures are still handled manually — retries, Slack alerts, and late-night debugging. This talk shows how to design Airflow as a self-healing platform that detects problems early, limits blast radius, and automatically recovers. We’ll cover practical patterns for DAG, schema, and dependency-drift detection; safe, selective backfills; predictive failure modeling using metadata; lineage-aware rollbacks; and canary deployment for DAGs. You’ll learn how to isolate unstable workloads before they impact others and how to turn Airflow into an intelligent control plane — not just a scheduler.
Graph databases are increasingly used for relationship-heavy data such as fraud detection, knowledge graphs and CRM systems, yet integrating them into orchestration workflows has remained difficult. This session introduces the Apache TinkerPop Provider for Airflow, enabling graph databases to be orchestrated as first-class citizens. I will demonstrate how it works with both self-hosted and managed services such as AWS Neptune and Azure Cosmos DB.
Texas Ballroom 6Graph databases are increasingly used for relationship-heavy data such as fraud detection, knowledge graphs and CRM systems, yet integrating them into orchestration workflows has remained difficult. This session introduces the Apache TinkerPop Provider for Airflow, enabling graph databases to be orchestrated as first-class citizens. I will demonstrate how it works with both self-hosted and managed services such as AWS Neptune and Azure Cosmos DB.
The industry treats agents and pipelines as opposing paradigms. We think that framing is wrong. Most agentic problem-solving, when you look at what it actually does, has pipeline structure: gather data, process each dimension independently, synthesize, evaluate. The question is not “agents or pipelines?” but where the LLM fits inside the pipeline and what you gain by making each step explicit.
This talk makes that concrete. We start with AIP-99 and the operator library that gives Airflow first-class LLM support: inference, SQL generation, branching, schema validation, and embedding, all backed by PydanticAI with 20+ model providers out of the box. We walk through a real pipeline that analyzes 5,856 survey responses using four parallel LLM-generated queries, DataFusion execution, and a synthesis step, showing exactly where the LLM reasons and where the pipeline handles everything else.
Then we go deeper. Fault-tolerant agentic systems need more than retry counts. AIP-105 introduces pluggable retry policies that classify failures at the exception level, including an LLM-powered variant that distinguishes a rate limit from an auth error from a transient network blip. LLMSchemaCheckOperator validates upstream data before the LLM ever sees it. DAG Result API lets a pipeline expose a semantic output, turning a DAG into a callable function for downstream agents. These are not theoretical. We demo each one.
We close with what is next: persistent task state for agentic workflows that survive retries (AIP-103), and the path toward dynamic execution graphs that support feedback loops while preserving the auditability that makes pipelines worth building in the first place.
Texas Ballroom 1-2-3The industry treats agents and pipelines as opposing paradigms. We think that framing is wrong. Most agentic problem-solving, when you look at what it actually does, has pipeline structure: gather data, process each dimension independently, synthesize, evaluate. The question is not “agents or pipelines?” but where the LLM fits inside the pipeline and what you gain by making each step explicit.
This talk makes that concrete. We start with AIP-99 and the operator library that gives Airflow first-class LLM support: inference, SQL generation, branching, schema validation, and embedding, all backed by PydanticAI with 20+ model providers out of the box. We walk through a real pipeline that analyzes 5,856 survey responses using four parallel LLM-generated queries, DataFusion execution, and a synthesis step, showing exactly where the LLM reasons and where the pipeline handles everything else.
Airflow’s callback system has undergone significant architectural changes recently. Originally driven by the introduction of Deadline Alerts, these improvements have far broader implications for how callbacks are defined, where they run, and how reliable they are.
In this talk, I’ll cover the user-facing and provider-facing changes along with a brief look at the significant technical design decisions and internal refactoring behind them, such as a new workload type and unified type-agnostic database model for callbacks. In the long term, this work makes both callbacks and the Dag Processor more robust, and the improved isolation is a key stepping stone toward Airflow’s upcoming multi-team capabilities.
Texas Ballroom 5Airflow’s callback system has undergone significant architectural changes recently. Originally driven by the introduction of Deadline Alerts, these improvements have far broader implications for how callbacks are defined, where they run, and how reliable they are.
In this talk, I’ll cover the user-facing and provider-facing changes along with a brief look at the significant technical design decisions and internal refactoring behind them, such as a new workload type and unified type-agnostic database model for callbacks. In the long term, this work makes both callbacks and the Dag Processor more robust, and the improved isolation is a key stepping stone toward Airflow’s upcoming multi-team capabilities.
How do you monitor Airflow across 50 teams in real-time? How do downstream systems react instantly to pipeline completions without polling APIs? How do you build custom dashboards without overloading Airflow’s database? This talk demonstrates how we use Change Data Capture to stream Airflow’s metadata to Kafka, making orchestration events consumable by any system in real-time. By capturing changes in Airflow’s Postgres database and publishing them to Kafka topics, we enable instant notifications, real-time dashboards, compliance audit trails, and cross-system orchestration without modifying Airflow code or impacting performance. You’ll learn how to set up Debezium CDC for Airflow’s metadata tables, design Kafka topics for task and DAG events, build real-time consumers for monitoring and alerting, handle schema evolution across Airflow upgrades, and implement cost attribution and SLA monitoring in real-time. Using production examples processing millions of events daily, I’ll share architecture decisions, performance optimizations, and lessons from running CDC at scale. You’ll leave with patterns for making Airflow observable to your entire organization.
Texas Ballroom 6How do you monitor Airflow across 50 teams in real-time? How do downstream systems react instantly to pipeline completions without polling APIs? How do you build custom dashboards without overloading Airflow’s database? This talk demonstrates how we use Change Data Capture to stream Airflow’s metadata to Kafka, making orchestration events consumable by any system in real-time. By capturing changes in Airflow’s Postgres database and publishing them to Kafka topics, we enable instant notifications, real-time dashboards, compliance audit trails, and cross-system orchestration without modifying Airflow code or impacting performance. You’ll learn how to set up Debezium CDC for Airflow’s metadata tables, design Kafka topics for task and DAG events, build real-time consumers for monitoring and alerting, handle schema evolution across Airflow upgrades, and implement cost attribution and SLA monitoring in real-time. Using production examples processing millions of events daily, I’ll share architecture decisions, performance optimizations, and lessons from running CDC at scale. You’ll leave with patterns for making Airflow observable to your entire organization.
Performance issues in Apache Airflow rarely appear as clear failures. Instead, they surface as subtle signals: longer task queue times, slower DAG parsing, scheduler lag, or workers hitting limits as workloads grow.
In this talk, we share lessons from profiling real production deployments across Airflow 2.x and 3.x. Combining frontline operational insights with focused technical investigation, we analysed task latency, DAG parsing time, worker behaviour, and metadata database performance under sustained load.
We show how configuration choices such as parallelism, max active runs, and worker resources can amplify or limit version-level improvements. We also discuss performance drift in long-running environments, where accumulated DAG runs expose slow queries or missing indexes that fresh deployments do not reveal.
Finally, we examine how dynamic DAG generation (e.g. with cosmos dbt dags) and custom user code can unintentionally impact parsing and execution performance.
Attendees leave with a practical framework to profile existing deployments, isolate bottlenecks, optimise performance, reduce recurring issues, and approach upgrades with confidence.
Texas Ballroom 7Performance issues in Apache Airflow rarely appear as clear failures. Instead, they surface as subtle signals: longer task queue times, slower DAG parsing, scheduler lag, or workers hitting limits as workloads grow.
In this talk, we share lessons from profiling real production deployments across Airflow 2.x and 3.x. Combining frontline operational insights with focused technical investigation, we analysed task latency, DAG parsing time, worker behaviour, and metadata database performance under sustained load.
In this session I will provide a deep dive into a task instance’s lifetime. From when the scheduler decides for it to be scheduled until it is marked as success or failed.
We will explore when in the process concepts like concurrency, pools and priority weights apply, what it means for a task to be “queued” and where things like cluster policies, operator links, callbacks and event listeners are evaluated.
The goal is to have a non-technical reference of the inner workings of Airflow applicable to the day-to-day of Dag and Plugin authors.
Texas Ballroom 5In this session I will provide a deep dive into a task instance’s lifetime. From when the scheduler decides for it to be scheduled until it is marked as success or failed.
We will explore when in the process concepts like concurrency, pools and priority weights apply, what it means for a task to be “queued” and where things like cluster policies, operator links, callbacks and event listeners are evaluated.
Airflow’s legacy SLA (Service level agreement) feature let users set a maximum expected duration for a DAG run and receive an email when it was exceeded, but it was inflexible and hard to configure. Deadline Alerts replaced it in 3.1 with a general-purpose system for time-based alerting. Since then, two release cycles have reshaped the feature.
Callbacks now run in supervised subprocesses with access to Connections, Variables, and Assets, which means they can query your infrastructure and respond to problems, not just send a notification. Deadline status is visible in the UI Grid view and DAG run overview. Named deadlines let you attach multiple alerts to a single DAG for different stakeholders. OpenLineage captures deadline events. And fixes for duplicate callbacks under HA schedulers and migration performance have made the feature production-solid.
As one of the developers of Deadline Alerts, I’ll walk through these changes and show callback patterns that take advantage of the new execution model. I’ll close with where the feature is going: deadlines that attach to individual tasks and assets. The end goal is for Deadline Alerts to make time constraints something the scheduler understands and acts on, not just something you get alerted about after the fact.
Texas Ballroom 7Airflow’s legacy SLA (Service level agreement) feature let users set a maximum expected duration for a DAG run and receive an email when it was exceeded, but it was inflexible and hard to configure. Deadline Alerts replaced it in 3.1 with a general-purpose system for time-based alerting. Since then, two release cycles have reshaped the feature.
Callbacks now run in supervised subprocesses with access to Connections, Variables, and Assets, which means they can query your infrastructure and respond to problems, not just send a notification. Deadline status is visible in the UI Grid view and DAG run overview. Named deadlines let you attach multiple alerts to a single DAG for different stakeholders. OpenLineage captures deadline events. And fixes for duplicate callbacks under HA schedulers and migration performance have made the feature production-solid.
A drone doesn’t care what time it is. It takes off when the mission says so, lands when the battery says so, and uploads its logs whenever the LTE link or WiFi finally cooperates. Cron-based pipelines, by contrast, care deeply about the clock — and that mismatch is where most fleet telemetry stacks quietly bleed money, latency, and engineer sanity on empty polls, half-parsed flights, and workers pinned waiting on slow uploads.
This talk is about throwing the schedule away. In Airflow 3, every completed flight becomes a first-class Asset, and downstream DAGs — parsing, enrichment, anomaly detection, perception-model retraining, regulatory reporting — wake only when the flight they care about actually exists. We’ll cover a custom MAVLinkHook and TelemetryIngestOperator for .ulg and .tlog files, Dynamic Task Mapping across concurrent flights, deferrable sensors, Asset producer/consumer chains replacing ExternalTaskSensor tangles, and honest migration lessons from running old and new DAGs side-by-side on a live fleet.
Texas Ballroom 6A drone doesn’t care what time it is. It takes off when the mission says so, lands when the battery says so, and uploads its logs whenever the LTE link or WiFi finally cooperates. Cron-based pipelines, by contrast, care deeply about the clock — and that mismatch is where most fleet telemetry stacks quietly bleed money, latency, and engineer sanity on empty polls, half-parsed flights, and workers pinned waiting on slow uploads.
Asset partitions are a key building block in Expanded Data Awareness. This session explains the core semantics of partition definitions, partition mappings, and backfill behavior in AIP-76. I will show how these pieces fit together in the current design, then discuss where asset partitions can go next, including improvements in authoring ergonomics, observability, and partition-aware workflow capabilities. Attendees will leave with a clear mental model of today’s implementation and a practical view of future direction.
Texas Ballroom 7Asset partitions are a key building block in Expanded Data Awareness. This session explains the core semantics of partition definitions, partition mappings, and backfill behavior in AIP-76. I will show how these pieces fit together in the current design, then discuss where asset partitions can go next, including improvements in authoring ergonomics, observability, and partition-aware workflow capabilities. Attendees will leave with a clear mental model of today’s implementation and a practical view of future direction.
Apache Airflow® has long been the control plane for data pipelines. As AI workflows move into production, teams are discovering the same challenges apply: LLM calls fail, embeddings need regenerating, and agent outputs need human review. The operational discipline that Airflow brings to data pipelines is exactly what AI workflows need too.
Rather than managing data pipelines in Airflow and AI workflows in a separate system, Airflow lets you build both in one observable, reliable control plane. You get scheduling, retries, lineage, versioning, and human-in-the-loop capabilities for your LLM tasks the same way you already have them for your SQL transformations.
In this hands-on workshop, you will build an end-to-end AI pipeline using Airflow’s LLM task decorators, all in your browser, no setup required. The scenario: processing customer reviews for AstroTrips, a fictional interplanetary travel company, with LLM analysis, intelligent routing, vector embeddings, and an AI agent that drafts responses, all with human-in-the-loop approval.
Hill Country CDApache Airflow® has long been the control plane for data pipelines. As AI workflows move into production, teams are discovering the same challenges apply: LLM calls fail, embeddings need regenerating, and agent outputs need human review. The operational discipline that Airflow brings to data pipelines is exactly what AI workflows need too.
Rather than managing data pipelines in Airflow and AI workflows in a separate system, Airflow lets you build both in one observable, reliable control plane. You get scheduling, retries, lineage, versioning, and human-in-the-loop capabilities for your LLM tasks the same way you already have them for your SQL transformations.
Modern pharmacy enterprise systems must process high volumes of complex prescriptions while maintaining strict safety, compliance, and operational efficiency. However, traditional rule-based platforms frequently generate low-specificity alerts that contribute to alert fatigue, workflow bottlenecks, and increased manual intervention. As clinical guidelines, payer requirements, and treatment protocols evolve, static rule engines struggle to keep pace with the dynamic nature of modern pharmacy operations.
This session presents a practical architecture for AI-enabled prescription workflow automation orchestrated through Apache Airflow, enabling scalable, transparent, and auditable clinical workflows. By combining rule-based safety checks with machine learning models for classification, anomaly detection, and intelligent workflow routing, the system significantly improves routing precision, reduces false positives, and accelerates prescription verification.
Texas Ballroom 1-2-3Modern pharmacy enterprise systems must process high volumes of complex prescriptions while maintaining strict safety, compliance, and operational efficiency. However, traditional rule-based platforms frequently generate low-specificity alerts that contribute to alert fatigue, workflow bottlenecks, and increased manual intervention. As clinical guidelines, payer requirements, and treatment protocols evolve, static rule engines struggle to keep pace with the dynamic nature of modern pharmacy operations.
This session presents a practical architecture for AI-enabled prescription workflow automation orchestrated through Apache Airflow, enabling scalable, transparent, and auditable clinical workflows. By combining rule-based safety checks with machine learning models for classification, anomaly detection, and intelligent workflow routing, the system significantly improves routing precision, reduces false positives, and accelerates prescription verification.
Modern data platforms rely on real-time pipelines to process and analyze large volumes of streaming events. Apache Airflow is widely used for batch orchestration, but it can also coordinate complex streaming architectures. In this session, we explore how Airflow orchestrates scalable pipelines built with Apache Kafka and Apache Spark running on Kubernetes in cloud environments.
We walk through an architecture where Kafka handles high-throughput event ingestion, Spark processes streaming data for analytics and transformation, and Kubernetes provides scalable infrastructure for distributed workloads. Airflow acts as the orchestration layer, coordinating job scheduling, pipeline dependencies, and operational visibility.
Through practical examples and design patterns, attendees will learn how Airflow integrates with Kubernetes to manage Spark jobs, trigger processing pipelines, and coordinate streaming and batch workloads. We will also discuss monitoring strategies and best practices for operating production-grade streaming pipelines using Airflow, Kafka, Spark, and Kubernetes.
Texas Ballroom 6Modern data platforms rely on real-time pipelines to process and analyze large volumes of streaming events. Apache Airflow is widely used for batch orchestration, but it can also coordinate complex streaming architectures. In this session, we explore how Airflow orchestrates scalable pipelines built with Apache Kafka and Apache Spark running on Kubernetes in cloud environments.
We walk through an architecture where Kafka handles high-throughput event ingestion, Spark processes streaming data for analytics and transformation, and Kubernetes provides scalable infrastructure for distributed workloads. Airflow acts as the orchestration layer, coordinating job scheduling, pipeline dependencies, and operational visibility.
Airflow 3 has officially arrived! If you’re considering an upgrade, this session will equip you with essential migration utilities that facilitate a smooth transition from Airflow 2.x. Attendees will learn the new CLI command, “airflow config lint”, to analyze your configuration files for any removed, deprecated, or renamed elements. This command provides comprehensive feedback and allows for filtering specific sections and options.
During the session, attendees will learn to leverage a set of rigorous Ruff rules - AIR301, AIR302, and AIR303 - crafted to detect migration issues within your codebase automatically. Notably, rule AIR301 flags DAG definitions lacking an explicit schedule argument, a critical update in Airflow 3. Rule AIR302 identifies deprecated functions and removes configuration settings, offering recommended alternatives. Rule AIR303 highlights code that references components now shifted to provider packages, ensuring your integrations are up to date.
Join this session for live demos and practical examples that will empower you to confidently upgrade, minimize downtime, and achieve optimal performance in Airflow 3.
Texas Ballroom 5Airflow 3 has officially arrived! If you’re considering an upgrade, this session will equip you with essential migration utilities that facilitate a smooth transition from Airflow 2.x. Attendees will learn the new CLI command, “airflow config lint”, to analyze your configuration files for any removed, deprecated, or renamed elements. This command provides comprehensive feedback and allows for filtering specific sections and options.
During the session, attendees will learn to leverage a set of rigorous Ruff rules - AIR301, AIR302, and AIR303 - crafted to detect migration issues within your codebase automatically. Notably, rule AIR301 flags DAG definitions lacking an explicit schedule argument, a critical update in Airflow 3. Rule AIR302 identifies deprecated functions and removes configuration settings, offering recommended alternatives. Rule AIR303 highlights code that references components now shifted to provider packages, ensuring your integrations are up to date.
Problem Statement: As our data platform scaled, our shared Airflow 2.9 deployment became a bottleneck with critical challenges: development friction from shared repositories, custom security workarounds, release coordination complexity, data isolation concerns, and cost attribution opacity. When Airflow 3.x launched with hybrid execution support, we restructured our architecture. Following a successful proof of value, we implemented remote execution - enabling teams to run workloads in isolated Kubernetes clusters while maintaining centralized orchestration. This session shares our journey, architectural decisions, and how we leveraged agentic AI to streamline migration and developer experience.
Presentation Details: Join us for a practitioner’s guide to transforming Airflow from a shared bottleneck into scalable execution with multiple Kubernetes clusters. The Journey: Why we moved from Airflow 2.9’s monolithic deployment to Airflow 3.x’s remote execution. The Architecture: Astronomer orchestrating workloads across team-owned Azure Kubernetes clusters. The Innovation: How agentic AI automates DAG development, from code generation to deployment.
Texas Ballroom 7Problem Statement: As our data platform scaled, our shared Airflow 2.9 deployment became a bottleneck with critical challenges: development friction from shared repositories, custom security workarounds, release coordination complexity, data isolation concerns, and cost attribution opacity. When Airflow 3.x launched with hybrid execution support, we restructured our architecture. Following a successful proof of value, we implemented remote execution - enabling teams to run workloads in isolated Kubernetes clusters while maintaining centralized orchestration. This session shares our journey, architectural decisions, and how we leveraged agentic AI to streamline migration and developer experience.
Apache Spark’s new Declarative Pipelines (SDP) let engineers define WHAT their data should look like, not HOW to build it. Apache Airflow 3 brings a declarized orchestration model. Together, they eliminate an entire category of boilerplate: the DAG that exists only to babysit a pipeline. This talk walks through building a production Spark SDP pipeline orchestrated by Airflow 3, showing how dependency graphs replace imperative task chains, how testing and recovery patterns change when your pipeline is declarative end-to-end, and what this means for the 80% of data engineering time currently spent on operational plumbing.
Texas Ballroom 6Apache Spark’s new Declarative Pipelines (SDP) let engineers define WHAT their data should look like, not HOW to build it. Apache Airflow 3 brings a declarized orchestration model. Together, they eliminate an entire category of boilerplate: the DAG that exists only to babysit a pipeline. This talk walks through building a production Spark SDP pipeline orchestrated by Airflow 3, showing how dependency graphs replace imperative task chains, how testing and recovery patterns change when your pipeline is declarative end-to-end, and what this means for the 80% of data engineering time currently spent on operational plumbing.
In healthcare data, standards are often anything but standard. Every new partner arrives with its own requirements for data exchange spanning FHIR APIs, HL7 feeds, SFTP drops, and custom vendor extracts.
The result? Integration projects that stretch from weeks into months, custom pipelines that only one engineer understands, and implementation teams who are already counting down to your next missed deadline.
This session shows how Airflow can change your approach to managing data transfer for healthcare partners.
We’ll cover how to structure your pipelines with DAGFactory patterns so that you don’t need to treat partner onboarding as a custom engineering effort and where this approach breaks down. We will also cover how you can integrate with tools like OpenMetadata (using Airflow under the hood) to track data assets, so your implementation team knows what is happening without having to ask an engineer.
This session is for data engineers building or maintaining healthcare integration pipelines, and healthcare leaders who don’t want to keep hearing “the next partner integration is a few months away.”
TTexas Ballroom 1-2-3In healthcare data, standards are often anything but standard. Every new partner arrives with its own requirements for data exchange spanning FHIR APIs, HL7 feeds, SFTP drops, and custom vendor extracts.
The result? Integration projects that stretch from weeks into months, custom pipelines that only one engineer understands, and implementation teams who are already counting down to your next missed deadline.
This session shows how Airflow can change your approach to managing data transfer for healthcare partners.
This talk covers migrating a production Airflow platform that orchestrates a large VM fleet — provisioning, OS patching, and decommissioning at high concurrency. This is not a data pipeline — it is infrastructure operations at fleet scale We’ll share workflow patterns that make fleet-scale orchestration possible in Airflow, then cover how we moved from an Airflow 2 monolith — all components on every node with fixed worker counts — to Airflow 3 with independently scalable services, each with its own release cycle. We’ll dig into a silent breaking change in Airflow 3’s XCom behavior: xcom_pull(key=…) without task_ids no longer searches upstream tasks, returning None with no warning. We’ll present three iterations of solving this — from O(n) DAG traversal to a custom XCom backend that restores Airflow 2 semantics with zero DAG code changes — and the design tradeoffs at each stage. Attendees will learn how Airflow powers infrastructure operations beyond data pipelines, how Airflow 3’s XCom silently breaks Airflow 2 workflows, three approaches to the same migration problem, and lessons from running both versions in parallel.
Texas Ballroom 5This talk covers migrating a production Airflow platform that orchestrates a large VM fleet — provisioning, OS patching, and decommissioning at high concurrency. This is not a data pipeline — it is infrastructure operations at fleet scale We’ll share workflow patterns that make fleet-scale orchestration possible in Airflow, then cover how we moved from an Airflow 2 monolith — all components on every node with fixed worker counts — to Airflow 3 with independently scalable services, each with its own release cycle. We’ll dig into a silent breaking change in Airflow 3’s XCom behavior: xcom_pull(key=…) without task_ids no longer searches upstream tasks, returning None with no warning. We’ll present three iterations of solving this — from O(n) DAG traversal to a custom XCom backend that restores Airflow 2 semantics with zero DAG code changes — and the design tradeoffs at each stage. Attendees will learn how Airflow powers infrastructure operations beyond data pipelines, how Airflow 3’s XCom silently breaks Airflow 2 workflows, three approaches to the same migration problem, and lessons from running both versions in parallel.
AI agents break the traditional Airflow trust model. While standard tasks are deterministic, agents execute dynamic logic and invoke external tools, meaning untrusted code is suddenly running inside standard containers sharing your host kernel. This session demonstrates how to secure AI workloads in Airflow without rewriting the orchestrator or building custom executors. We will introduce a custom, policy-driven @agent TaskFlow abstraction that leverages Kubernetes executor_config overrides (like runtimeClassName) to isolate workloads on the fly.
Key Takeaways for Attendees:
- The Threat Model: Why containers are not a strong enough security boundary for AI agents.
- The Implementation: How to build an @agent decorator that routes tasks to sandboxed environments (gVisor, Kata, Peer Pods) while keeping the KubernetesExecutor unchanged.
- Kubernetes in Production: How to achieve a VM-per-pod pattern using open-source tools without requiring nested node virtualization.
- Operational Realities: A candid look at execution flow, pod spec mutation, and the latency/cost trade-offs of runtime isolation.
AI agents break the traditional Airflow trust model. While standard tasks are deterministic, agents execute dynamic logic and invoke external tools, meaning untrusted code is suddenly running inside standard containers sharing your host kernel. This session demonstrates how to secure AI workloads in Airflow without rewriting the orchestrator or building custom executors. We will introduce a custom, policy-driven @agent TaskFlow abstraction that leverages Kubernetes executor_config overrides (like runtimeClassName) to isolate workloads on the fly.
As analytics teams grow, monolithic dbt projects can become tightly coupled and difficult to scale. Cross-domain dependencies multiply, deployment cycles slow down, and ownership boundaries blur.
dbt Mesh proposes a domain-oriented approach with independently owned dbt projects, explicit cross-project contracts, and controlled exposure to dependencies. Applying Mesh principles is not just about splitting repositories; orchestration must also support these boundaries.
In this session, we explore how to design dbt projects according to Mesh principles and how Airflow orchestration can reinforce those architectural decisions. Using multi-project capabilities in Cosmos that leverage dbt Loom-style cross-project referencing, we demonstrate how Airflow can model domain separation while still enabling controlled cross-project dependencies.
We will discuss architectural trade-offs, dependency modelling strategies, and lessons learned while enabling multi-project orchestration. Attendees will leave with practical guidance on moving from monolithic transformation workflows to domain-oriented patterns and understanding what orchestration support is required to make them a success.
Texas Ballroom 1-2-3As analytics teams grow, monolithic dbt projects can become tightly coupled and difficult to scale. Cross-domain dependencies multiply, deployment cycles slow down, and ownership boundaries blur.
dbt Mesh proposes a domain-oriented approach with independently owned dbt projects, explicit cross-project contracts, and controlled exposure to dependencies. Applying Mesh principles is not just about splitting repositories; orchestration must also support these boundaries.
In this session, we explore how to design dbt projects according to Mesh principles and how Airflow orchestration can reinforce those architectural decisions. Using multi-project capabilities in Cosmos that leverage dbt Loom-style cross-project referencing, we demonstrate how Airflow can model domain separation while still enabling controlled cross-project dependencies.
Teams running Airflow on Kubernetes know the trade‑off all too well: Kubernetes scales beautifully in production, but makes local development slow, brittle, and unrealistic. Engineers struggle to replicate production environments locally, forcing them into inefficient “test-in-production” cycles that slow delivery velocity, increase deployment risk, and frustrate data teams.
In this talk, we’ll walk through the architectural patterns and platform engineering approach we used to give engineers on‑demand, isolated, production‑like Airflow environments, without sacrificing the benefits of shared Kubernetes infrastructure.
What You’ll Learn:
- An architectural pattern for provisioning on-demand, isolated Airflow environments on shared EKS infrastructure
- Real-world lessons from operating this solution in production: what worked, what didn’t, and what we’d do differently
- Measurable outcomes: how this approach reduced DAG development cycle time and improved engineer satisfaction
If you’re operating Airflow on Kubernetes—or designing internal platforms for data and ML teams—this session offers a concrete, battle‑tested blueprint to improve Airflow delivery from your stakeholders.
Texas Ballroom 7Teams running Airflow on Kubernetes know the trade‑off all too well: Kubernetes scales beautifully in production, but makes local development slow, brittle, and unrealistic. Engineers struggle to replicate production environments locally, forcing them into inefficient “test-in-production” cycles that slow delivery velocity, increase deployment risk, and frustrate data teams.
In this talk, we’ll walk through the architectural patterns and platform engineering approach we used to give engineers on‑demand, isolated, production‑like Airflow environments, without sacrificing the benefits of shared Kubernetes infrastructure.
Processing unstructured data in regulated industries, healthcare, finance, legal, is one of the hardest data engineering challenges: the data is messy, privacy constraints prevent sending it to external APIs, and scale makes manual processing impossible.
In this talk, I’ll walk through how to design and deploy an Apache Airflow–orchestrated LangChain pipeline powered by LLMs to digitize unstructured documents into a unified structured platform.
I’ll cover the full architecture: how Airflow DAGs coordinate multi-step LLM inference, validation, and ingestion stages; how LoRA/PEFT fine-tuning adapted open-source LLMs for domain-specific language without leaking sensitive data; and how failure handling, retries, and data quality checks were built natively into Airflow.
Attendees will leave with a reproducible blueprint applicable to any domain, practical patterns for integrating local LLMs into DAG-driven pipelines, and honest lessons from running this in production at scale.
Texas Ballroom 6Processing unstructured data in regulated industries, healthcare, finance, legal, is one of the hardest data engineering challenges: the data is messy, privacy constraints prevent sending it to external APIs, and scale makes manual processing impossible.
In this talk, I’ll walk through how to design and deploy an Apache Airflow–orchestrated LangChain pipeline powered by LLMs to digitize unstructured documents into a unified structured platform.
I’ll cover the full architecture: how Airflow DAGs coordinate multi-step LLM inference, validation, and ingestion stages; how LoRA/PEFT fine-tuning adapted open-source LLMs for domain-specific language without leaking sensitive data; and how failure handling, retries, and data quality checks were built natively into Airflow.
Many teams develop their own “Dag factory” to make Airflow easier to use in their organizations. This can help their users avoid python and configure Dags in a simpler manner. However, there is a huge spike in the difficulty curve of writing a DAG if it requires logic that does not fit into the confines of the Dag factory. If you want to create such a DAG, you are then having to completely abandon the pre-made framework and go back to writing a pure airflow DAG. Instead, I will present a different perspective that instead of producing entire DAGs, you should create pre-made task groups that can be dropped into a DAG to cover common steps, but in a manner that maintains a smooth difficulty curve if you want to add customer elements.
Texas Ballroom 6Many teams develop their own “Dag factory” to make Airflow easier to use in their organizations. This can help their users avoid python and configure Dags in a simpler manner. However, there is a huge spike in the difficulty curve of writing a DAG if it requires logic that does not fit into the confines of the Dag factory. If you want to create such a DAG, you are then having to completely abandon the pre-made framework and go back to writing a pure airflow DAG. Instead, I will present a different perspective that instead of producing entire DAGs, you should create pre-made task groups that can be dropped into a DAG to cover common steps, but in a manner that maintains a smooth difficulty curve if you want to add customer elements.
When Airflow 3 introduced JWT based task authentication, it also introduced new attack surfaces; such as, Tokens that can’t be revoked,Tasks that lose authentication while waiting in queues and Forked processes that inherit signing keys and also can forge tokens for other tasks.
In this talk, I’ll walk through three security challenges at the task execution boundary and the code contributed to fix them:
Token revocation (merged, PR #61339): Airflow 3.x had no way to invalidate issued JWTs with implications for common compliance frameworks.
Scope separation (in progress, PR #60108): A two-token mechanism separating long lived workload tokens from short lived execution tokens which is in review with the Airflow core team.
Task identity provenance (in active discussion): I’ll present a proposed defense, a server-side JTI allowlisting that could make forged tokens useless across all execution topologies.
This session is deeply technical and grounded in real contributed code including what attack vectors existed before each fix and the audience will leave understanding Airflow 3’s token security model and practical patterns for securing multi team task execution.
Texas Ballroom 5When Airflow 3 introduced JWT based task authentication, it also introduced new attack surfaces; such as, Tokens that can’t be revoked,Tasks that lose authentication while waiting in queues and Forked processes that inherit signing keys and also can forge tokens for other tasks.
In this talk, I’ll walk through three security challenges at the task execution boundary and the code contributed to fix them:
Token revocation (merged, PR #61339): Airflow 3.x had no way to invalidate issued JWTs with implications for common compliance frameworks.
What does solving a Rubik’s Cube have to do with Apache Airflow? More than you’d think.
In this talk, I’ll walk through a project where Airflow orchestrates the process of solving a Rubik’s Cube — not as a gimmick, but as a framework for exploring cyclic workflows, state management, and iterative computation in a system designed for DAGs. Cube-solving algorithms naturally require feedback loops, evolving state, and conditional branching — all things that challenge Airflow’s acyclic model.
We’ll explore how to model “cycles” without breaking DAG semantics, manage cube state across tasks, handle convergence and termination conditions, and avoid common anti-patterns. Along the way, I’ll share practical lessons about idempotency, XCom design, task explosion, and when to rethink orchestration boundaries.
If you’ve ever tried to push Airflow beyond straightforward ETL, this session will give you concrete patterns for safely orchestrating iterative, stateful workflows in production.
Texas Ballroom 7What does solving a Rubik’s Cube have to do with Apache Airflow? More than you’d think.
In this talk, I’ll walk through a project where Airflow orchestrates the process of solving a Rubik’s Cube — not as a gimmick, but as a framework for exploring cyclic workflows, state management, and iterative computation in a system designed for DAGs. Cube-solving algorithms naturally require feedback loops, evolving state, and conditional branching — all things that challenge Airflow’s acyclic model.
We built a centralized Gateway that sits in front of our entire scheduling fleet and solves three problems no single-cluster Airflow setup ever faces.
Composite Routing — Workflows are bound to clusters via a tag or their workspace
Global Concurrency Control — Each cluster enforces its own Airflow pool locally, unaware of what the other five are running. Shared downstream systems — rate-limited APIs, licensed compute engines — can be overwhelmed even when every individual pool looks healthy. The Gateway acts as a platform-wide slot broker: operators acquire a slot before doing real work. A built-in heartbeat scheduler reconciles stale slots against each cluster’s REST API, handling crashes and OOM kills transparently.
Transparent Version Upgrade — Each cluster carries version tags. During an Airflow upgrade: re-tag routing rules to send new submissions to the high-version cluster; existing runs drain on the old cluster undisturbed. Once drained, upgrade the old cluster and rejoin it. No maintenance window.
Takeaway: a thin routing layer makes your Airflow fleet elastic and upgradable without touching the scheduler or any pipeline code.
Texas Ballroom 1-2-3We built a centralized Gateway that sits in front of our entire scheduling fleet and solves three problems no single-cluster Airflow setup ever faces.
Composite Routing — Workflows are bound to clusters via a tag or their workspace
Global Concurrency Control — Each cluster enforces its own Airflow pool locally, unaware of what the other five are running. Shared downstream systems — rate-limited APIs, licensed compute engines — can be overwhelmed even when every individual pool looks healthy. The Gateway acts as a platform-wide slot broker: operators acquire a slot before doing real work. A built-in heartbeat scheduler reconciles stale slots against each cluster’s REST API, handling crashes and OOM kills transparently.
Airflow 3 has been out for a year. In this keynote, we take stock of where the community stands, what we built together, and where we are headed.
We open with the data: adoption trends, community growth, and honest feedback from teams running Airflow 3 in production. What is working, what surprised us, and what the survey tells us about how the ecosystem is evolving.
The second section covers the year in Airflow. Provider discovery and distribution has been modernized. Airflow gained first-class support for AI and LLM workloads. And scheduling became more powerful, letting pipelines respond to data at a finer granularity.
We close with what is coming. Pipelines will soon persist state across retries, making long-running workloads more reliable. Resumable operators will eliminate the restart-from-scratch failure mode. And multi-language support will open Airflow beyond the Python ecosystem.
Twenty-five minutes. A lot of ground to cover
Texas Ballroom 1-2-3Airflow 3 has been out for a year. In this keynote, we take stock of where the community stands, what we built together, and where we are headed.
We open with the data: adoption trends, community growth, and honest feedback from teams running Airflow 3 in production. What is working, what surprised us, and what the survey tells us about how the ecosystem is evolving.
The second section covers the year in Airflow. Provider discovery and distribution has been modernized. Airflow gained first-class support for AI and LLM workloads. And scheduling became more powerful, letting pipelines respond to data at a finer granularity.
As Airflow deployments scale and the number of Dag authors increases the question arises: how do we support many teams with different needs and requirements on a shared platform? Over the years we’ve observed many organizations building their own multi-tenant layers on top of Apache Airflow to solve this problem and we’re now adding native support for this type of deployment. This talk explores building multi-team support in Airflow, working backwards from those real deployment challenges and community pain points we’ve observed.
Rather than strict multi-tenancy with complete isolation, we designed a flexible multi-team architecture allowing isolated task execution, UI experience and connections/variables/secrets management to name a few. This allows multiple teams of engineers to operate within a single Airflow environment, sharing the scheduler API server and database.
We’ll cover some of the technical details of the new architecture, how what we’ve built differs from multi-tenancy, and demo how to use multi-team.
Texas Ballroom 1-2-3As Airflow deployments scale and the number of Dag authors increases the question arises: how do we support many teams with different needs and requirements on a shared platform? Over the years we’ve observed many organizations building their own multi-tenant layers on top of Apache Airflow to solve this problem and we’re now adding native support for this type of deployment. This talk explores building multi-team support in Airflow, working backwards from those real deployment challenges and community pain points we’ve observed.
There’s a class of workload that doesn’t belong in your streaming stack. A team needs to react to data arriving in S3 or a message landing in Kafka. The SLA is minutes. Someone reaches for Flink because the orchestrator can’t trigger on events. Six months later, you’re running a streaming app for what is a bounded computation with a latency requirement.
This talk names that pattern, the “messy middle,” and argues that Airflow 3 eliminates the gap that pushed these workloads to streaming. Asset Watchers monitor external sources through async triggers, firing DAGs within minutes of event arrival. Assets turn data products into scheduling primitives. Partitions let Airflow reason about which slices of a dataset are ready.
Airflow is still a batch orchestrator. It won’t replace Flink for sub-minute guarantees, stateful processing, or windowed aggregations. I’ll be direct about the boundaries and present a framework for when streaming is actually the right call.
Attendees will leave with a precise definition of the messy middle, event-driven orchestration patterns for Airflow 3, and a way to evaluate orchestration vs. streaming on operational cost, not just latency.
Texas Ballroom 5There’s a class of workload that doesn’t belong in your streaming stack. A team needs to react to data arriving in S3 or a message landing in Kafka. The SLA is minutes. Someone reaches for Flink because the orchestrator can’t trigger on events. Six months later, you’re running a streaming app for what is a bounded computation with a latency requirement.
This talk names that pattern, the “messy middle,” and argues that Airflow 3 eliminates the gap that pushed these workloads to streaming. Asset Watchers monitor external sources through async triggers, firing DAGs within minutes of event arrival. Assets turn data products into scheduling primitives. Partitions let Airflow reason about which slices of a dataset are ready.
Many data teams can build machine learning models, but operationalizing them reliably remains a challenge.
At Astronomer, our data team recently moved from exploratory modeling to running multiple production ML models powering go-to-market analytics and workflows. Rather than introducing heavy MLOps infrastructure, we integrated the full ML lifecycle directly into our Airflow-based data platform.
In this talk, we’ll share how we use Airflow to orchestrate production ML end-to-end: from feature pipelines in Snowflake, to model training and artifact promotion, to batch scoring and prediction delivery.
We’ll also show how reusable Airflow task groups and a declarative DAG framework allowed us to standardize model deployment, monitoring, and retraining across models.
This approach enabled us to go from zero to several production ML models in just a few months while keeping the system simple, scalable, and fully integrated with our existing data workflows.
Attendees will leave with practical patterns for operationalizing ML using Airflow without building a complex MLOps stack.
Texas Ballroom 6Many data teams can build machine learning models, but operationalizing them reliably remains a challenge.
At Astronomer, our data team recently moved from exploratory modeling to running multiple production ML models powering go-to-market analytics and workflows. Rather than introducing heavy MLOps infrastructure, we integrated the full ML lifecycle directly into our Airflow-based data platform.
In this talk, we’ll share how we use Airflow to orchestrate production ML end-to-end: from feature pipelines in Snowflake, to model training and artifact promotion, to batch scoring and prediction delivery.
What if your pipeline could tell the difference between recoverable errors and real bugs and handle both without waking anyone up? At OnsiteIQ, we process millions of construction site images monthly through Airflow with mixed AWS Batch spot and on-demand tasks. We need to handle corrupt data, spot evictions, and real bugs. Before Airflow, every failure looked the same: something broke, an engineer investigated, and the same transient infrastructure issues kept masking real bugs underneath. In a 3-month solo migration, I built custom Airflow operators that automatically classified and handled every failure via Airflow’s callbacks. Actual code bugs surface through clean, noise-free alerts directly to actionable tickets. Every genuine bug got caught exactly once and permanently fixed. Engineering oversight dropped from 20% to zero within months. This talk covers the error classification architecture, automatic fallback patterns, and the framework for turning Airflow into a self-healing system.
Texas Ballroom 7What if your pipeline could tell the difference between recoverable errors and real bugs and handle both without waking anyone up? At OnsiteIQ, we process millions of construction site images monthly through Airflow with mixed AWS Batch spot and on-demand tasks. We need to handle corrupt data, spot evictions, and real bugs. Before Airflow, every failure looked the same: something broke, an engineer investigated, and the same transient infrastructure issues kept masking real bugs underneath. In a 3-month solo migration, I built custom Airflow operators that automatically classified and handled every failure via Airflow’s callbacks. Actual code bugs surface through clean, noise-free alerts directly to actionable tickets. Every genuine bug got caught exactly once and permanently fixed. Engineering oversight dropped from 20% to zero within months. This talk covers the error classification architecture, automatic fallback patterns, and the framework for turning Airflow into a self-healing system.
As Airflow deployments scale and the number of Dag authors increases the question arises: how do we support many teams with different needs and requirements on a shared platform? Over the years we’ve observed many organizations building their own multi-tenant layers on top of Apache Airflow to solve this problem and we’re now adding native support for this type of deployment. This talk explores building multi-team support in Airflow, working backwards from those real deployment challenges and community pain points we’ve observed.
Rather than strict multi-tenancy with complete isolation, we designed a flexible multi-team architecture allowing isolated task execution, UI experience and connections/variables/secrets management to name a few. This allows multiple teams of engineers to operate within a single Airflow environment, sharing the scheduler API server and database.
We’ll cover some of the technical details of the new architecture, how what we’ve built differs from multi-tenancy, and demo how to use multi-team.
Texas Ballroom 1-2-3As Airflow deployments scale and the number of Dag authors increases the question arises: how do we support many teams with different needs and requirements on a shared platform? Over the years we’ve observed many organizations building their own multi-tenant layers on top of Apache Airflow to solve this problem and we’re now adding native support for this type of deployment. This talk explores building multi-team support in Airflow, working backwards from those real deployment challenges and community pain points we’ve observed.
There’s a class of workload that doesn’t belong in your streaming stack. A team needs to react to data arriving in S3 or a message landing in Kafka. The SLA is minutes. Someone reaches for Flink because the orchestrator can’t trigger on events. Six months later, you’re running a streaming app for what is a bounded computation with a latency requirement.
This talk names that pattern, the “messy middle,” and argues that Airflow 3 eliminates the gap that pushed these workloads to streaming. Asset Watchers monitor external sources through async triggers, firing DAGs within minutes of event arrival. Assets turn data products into scheduling primitives. Partitions let Airflow reason about which slices of a dataset are ready.
Airflow is still a batch orchestrator. It won’t replace Flink for sub-minute guarantees, stateful processing, or windowed aggregations. I’ll be direct about the boundaries and present a framework for when streaming is actually the right call.
Attendees will leave with a precise definition of the messy middle, event-driven orchestration patterns for Airflow 3, and a way to evaluate orchestration vs. streaming on operational cost, not just latency.
Texas Ballroom 5There’s a class of workload that doesn’t belong in your streaming stack. A team needs to react to data arriving in S3 or a message landing in Kafka. The SLA is minutes. Someone reaches for Flink because the orchestrator can’t trigger on events. Six months later, you’re running a streaming app for what is a bounded computation with a latency requirement.
This talk names that pattern, the “messy middle,” and argues that Airflow 3 eliminates the gap that pushed these workloads to streaming. Asset Watchers monitor external sources through async triggers, firing DAGs within minutes of event arrival. Assets turn data products into scheduling primitives. Partitions let Airflow reason about which slices of a dataset are ready.
Many data teams can build machine learning models, but operationalizing them reliably remains a challenge.
At Astronomer, our data team recently moved from exploratory modeling to running multiple production ML models powering go-to-market analytics and workflows. Rather than introducing heavy MLOps infrastructure, we integrated the full ML lifecycle directly into our Airflow-based data platform.
In this talk, we’ll share how we use Airflow to orchestrate production ML end-to-end: from feature pipelines in Snowflake, to model training and artifact promotion, to batch scoring and prediction delivery.
We’ll also show how reusable Airflow task groups and a declarative DAG framework allowed us to standardize model deployment, monitoring, and retraining across models.
This approach enabled us to go from zero to several production ML models in just a few months while keeping the system simple, scalable, and fully integrated with our existing data workflows.
Attendees will leave with practical patterns for operationalizing ML using Airflow without building a complex MLOps stack.
Texas Ballroom 6Many data teams can build machine learning models, but operationalizing them reliably remains a challenge.
At Astronomer, our data team recently moved from exploratory modeling to running multiple production ML models powering go-to-market analytics and workflows. Rather than introducing heavy MLOps infrastructure, we integrated the full ML lifecycle directly into our Airflow-based data platform.
In this talk, we’ll share how we use Airflow to orchestrate production ML end-to-end: from feature pipelines in Snowflake, to model training and artifact promotion, to batch scoring and prediction delivery.
What if your pipeline could tell the difference between recoverable errors and real bugs and handle both without waking anyone up? At OnsiteIQ, we process millions of construction site images monthly through Airflow with mixed AWS Batch spot and on-demand tasks. We need to handle corrupt data, spot evictions, and real bugs. Before Airflow, every failure looked the same: something broke, an engineer investigated, and the same transient infrastructure issues kept masking real bugs underneath. In a 3-month solo migration, I built custom Airflow operators that automatically classified and handled every failure via Airflow’s callbacks. Actual code bugs surface through clean, noise-free alerts directly to actionable tickets. Every genuine bug got caught exactly once and permanently fixed. Engineering oversight dropped from 20% to zero within months. This talk covers the error classification architecture, automatic fallback patterns, and the framework for turning Airflow into a self-healing system.
Texas Ballroom 7What if your pipeline could tell the difference between recoverable errors and real bugs and handle both without waking anyone up? At OnsiteIQ, we process millions of construction site images monthly through Airflow with mixed AWS Batch spot and on-demand tasks. We need to handle corrupt data, spot evictions, and real bugs. Before Airflow, every failure looked the same: something broke, an engineer investigated, and the same transient infrastructure issues kept masking real bugs underneath. In a 3-month solo migration, I built custom Airflow operators that automatically classified and handled every failure via Airflow’s callbacks. Actual code bugs surface through clean, noise-free alerts directly to actionable tickets. Every genuine bug got caught exactly once and permanently fixed. Engineering oversight dropped from 20% to zero within months. This talk covers the error classification architecture, automatic fallback patterns, and the framework for turning Airflow into a self-healing system.
At Stripe, we process petabytes of data daily across thousands of pipelines powering financial reporting, fraud detection, and merchant analytics. As our data estate grew, so did the complexity of authoring, scheduling, and operating these pipelines. Engineers spent more time wrangling Airflow DAG boilerplate and managing dependencies than writing transformation logic.
To address this, we built a declarative platform that generates Airflow DAGs from YAML and SQL definitions. Authors specify what they want — source tables, SQL transformations, incremental mode, output schema — and the platform handles the rest: generating Airflow tasks, wiring upstream sensors, registering Iceberg tables, and configuring scheduling parameters. A key piece is an in-house dataset-to-task mapping service that resolves upstream dataset dependencies to their producing Airflow tasks. When an author declares an input dataset, the platform automatically looks up which task produces it and generates the appropriate sensor — no manual DAG cross-referencing required. This eliminates an entire class of misconfigured dependency bugs common in hand-wired Airflow deployments.
Texas Ballroom 1-2-3At Stripe, we process petabytes of data daily across thousands of pipelines powering financial reporting, fraud detection, and merchant analytics. As our data estate grew, so did the complexity of authoring, scheduling, and operating these pipelines. Engineers spent more time wrangling Airflow DAG boilerplate and managing dependencies than writing transformation logic.
To address this, we built a declarative platform that generates Airflow DAGs from YAML and SQL definitions. Authors specify what they want — source tables, SQL transformations, incremental mode, output schema — and the platform handles the rest: generating Airflow tasks, wiring upstream sensors, registering Iceberg tables, and configuring scheduling parameters. A key piece is an in-house dataset-to-task mapping service that resolves upstream dataset dependencies to their producing Airflow tasks. When an author declares an input dataset, the platform automatically looks up which task produces it and generates the appropriate sensor — no manual DAG cross-referencing required. This eliminates an entire class of misconfigured dependency bugs common in hand-wired Airflow deployments.
This session explores the next phase of Dag versioning in Airflow and the practical questions users face in real deployments. Dag versioning moved Airflow beyond a “latest only” model, but it also introduced confusion around why Dag versions keep increasing, what disabling Dag bundle versioning actually does, what creates a new version, and how users should think about clears, reruns, and backfills after a Dag changes. I will examine a common misconception: disabling bundle versioning does not stop Dag version changes. I will also connect Dag versioning to Dag delivery in Airflow 3, showing how Git backed Dag bundles provide a more native alternative to git-sync in Helm-based deployments.
Texas Ballroom 5This session explores the next phase of Dag versioning in Airflow and the practical questions users face in real deployments. Dag versioning moved Airflow beyond a “latest only” model, but it also introduced confusion around why Dag versions keep increasing, what disabling Dag bundle versioning actually does, what creates a new version, and how users should think about clears, reruns, and backfills after a Dag changes. I will examine a common misconception: disabling bundle versioning does not stop Dag version changes. I will also connect Dag versioning to Dag delivery in Airflow 3, showing how Git backed Dag bundles provide a more native alternative to git-sync in Helm-based deployments.
AI coding assistants have transformed software development, moving from ad hoc “vibe coding” to rigorous spec-driven development (SDD). The Airflow ecosystem has fully embraced these advancements, but different use cases demand different SDD approaches. This talk compares ETL and ML pipeline patterns, showing how each leverages Airflow’s unique capabilities differently. I then present SDD strategies along a Spec Stability Spectrum. ETL specs are stable and external — schemas, dbt models — making deterministic, template-driven approaches like DAG Factory and the cosmos-dbt-core skill the right fit. ML specs are volatile and internal, as experiments evolve, so LLM-driven hybrid approaches like the Airflow AI SDK and the airflow-hitl skill are better suited. Both approaches are demonstrated live with Claude Code. Examples draw from my work at TXI Digital generating ETL and ML pipelines for heavy industry clients, with a focus on Rail and anecdotes from Renewable Energy.
Texas Ballroom 6AI coding assistants have transformed software development, moving from ad hoc “vibe coding” to rigorous spec-driven development (SDD). The Airflow ecosystem has fully embraced these advancements, but different use cases demand different SDD approaches. This talk compares ETL and ML pipeline patterns, showing how each leverages Airflow’s unique capabilities differently. I then present SDD strategies along a Spec Stability Spectrum. ETL specs are stable and external — schemas, dbt models — making deterministic, template-driven approaches like DAG Factory and the cosmos-dbt-core skill the right fit. ML specs are volatile and internal, as experiments evolve, so LLM-driven hybrid approaches like the Airflow AI SDK and the airflow-hitl skill are better suited. Both approaches are demonstrated live with Claude Code. Examples draw from my work at TXI Digital generating ETL and ML pipelines for heavy industry clients, with a focus on Rail and anecdotes from Renewable Energy.
Migrating a production Airflow deployment from version 2 to 3 without disrupting hundreds of DAGs across multiple teams sounds scary (and it is). In this talk I will share how we migrated versions without a big-bang cutover, without weeks of cross-team change requests, and without leaving our pipelines in a broken state.
I’ll walk through how we built a compatibility layer to make sure our code runs on both versions during the migration, how we used AI-tooling to orchestrate 400+ DAG changes and how our on-demand ephemeral environments - full k8s deployments deployed for each pull request - helped us experiment and test all the required changes.
Most important of all, I will share what we learned, where we failed and what we would do better next time.
Texas Ballroom 7Migrating a production Airflow deployment from version 2 to 3 without disrupting hundreds of DAGs across multiple teams sounds scary (and it is). In this talk I will share how we migrated versions without a big-bang cutover, without weeks of cross-team change requests, and without leaving our pipelines in a broken state.
I’ll walk through how we built a compatibility layer to make sure our code runs on both versions during the migration, how we used AI-tooling to orchestrate 400+ DAG changes and how our on-demand ephemeral environments - full k8s deployments deployed for each pull request - helped us experiment and test all the required changes.
At Meteosim, Airflow is the engine for our entire decision system. It runs daily weather and air quality forecasts on schedule, but it also enables OnaChem React, a software that lets users manage chemical emergencies in real-time, and helps us manage consultancy projects.
This talk covers how we set up Airflow 3 to handle five very different types of workloads:
1. Daily Forecasts: Running physics simulations for weather and air quality.
2. Sensor Validation: Ingest data from thousands of sensors and validate it.
3. Human-in-the-Loop: Managing long-running consultancy projects where Dags pause and wait for expert approval.
4. Emergency Response: Help users manage chemical emergencies using multiple real-time toxic dispersion simulations with pre-defined workflows through our SaaS platform.
5. Training AI models: Track multiple experiments.
We will explain why Airflow 3 was necessary to make this work. You will see how we orchestrate physics, AI, and human decisions in a single environment.
Texas Ballroom 1-2-3At Meteosim, Airflow is the engine for our entire decision system. It runs daily weather and air quality forecasts on schedule, but it also enables OnaChem React, a software that lets users manage chemical emergencies in real-time, and helps us manage consultancy projects.
This talk covers how we set up Airflow 3 to handle five very different types of workloads:
1. Daily Forecasts: Running physics simulations for weather and air quality.
2. Sensor Validation: Ingest data from thousands of sensors and validate it.
3. Human-in-the-Loop: Managing long-running consultancy projects where Dags pause and wait for expert approval.
4. Emergency Response: Help users manage chemical emergencies using multiple real-time toxic dispersion simulations with pre-defined workflows through our SaaS platform.
5. Training AI models: Track multiple experiments.
We will explain why Airflow 3 was necessary to make this work. You will see how we orchestrate physics, AI, and human decisions in a single environment.
If you are migrating from self-hosted Airflow to any of the managed platforms, most migration guides you’ll find online assume one environment, one team, one version. Large organizations are never that simple.
This talk comes from four years of assisting customers with real migrations across some of the biggest Airflow deployments out there, from self-hosted open source to managed cloud platforms like MWAA, GCC, and Astro, and between major version upgrades.
The organizations we worked with had multiple teams, multiple Airflow versions running in parallel, and years of decisions baked into their infrastructure and Dags.
We’ll walk through what a migration actually looks like at that scale. The architecture, design, and planning work that has to happen before anyone touches a Dag, and the organizational coordination needed to make it work.
We’ll cover every topic so that you leave this session ready to confidently start your migration projects. You’ll leave with a framework for scoping a migration, a clearer picture of the work to come, and lessons we learned from doing this across organizations that couldn’t afford to get it wrong.
Texas Ballroom 5If you are migrating from self-hosted Airflow to any of the managed platforms, most migration guides you’ll find online assume one environment, one team, one version. Large organizations are never that simple.
This talk comes from four years of assisting customers with real migrations across some of the biggest Airflow deployments out there, from self-hosted open source to managed cloud platforms like MWAA, GCC, and Astro, and between major version upgrades.
Storage usage is a major driver of infrastructure cost for media collaboration platforms. Understanding how storage grows across accounts, assets, and workflows requires analytics pipelines that combine product data with infrastructure metrics.
In this talk, I’ll share how we built storage analytics pipelines that model storage usage across accounts and plan tiers to help leadership understand infrastructure cost drivers. Using warehouse data models orchestrated with Airflow, we developed pipelines that track storage usage over time, identify discrepancies in legacy storage calculations, and resolve edge-cases.
These pipelines enabled deeper analysis of storage growth and informed changes to asset lifecycle policies that significantly reduced cloud storage costs.
What attendees will learn:
- How to design analytics pipelines for infrastructure usage data
- Modeling storage usage across accounts and assets
- Using Airflow to orchestrate infrastructure analytics workflows
- Turning analytics insights into infrastructure cost optimization
Storage usage is a major driver of infrastructure cost for media collaboration platforms. Understanding how storage grows across accounts, assets, and workflows requires analytics pipelines that combine product data with infrastructure metrics.
In this talk, I’ll share how we built storage analytics pipelines that model storage usage across accounts and plan tiers to help leadership understand infrastructure cost drivers. Using warehouse data models orchestrated with Airflow, we developed pipelines that track storage usage over time, identify discrepancies in legacy storage calculations, and resolve edge-cases.
Apache Airflow is often perceived as a platform best suited for large organisations with significant infrastructure budgets and dedicated platform teams. In this talk, I want to share how we built and scaled a robust Airflow platform with tight cost constraints whilst still maintaining reliability, governance and developer productivity.
Starting from a small Airflow setup, we have evolved our architecture to support multiple teams and increasingly complex workflows. This includes standardising environments and making sure best practises are adopted around observability, resource management and version control.
I want to walk through the architectural decisions we made, the trade-offs we managed and open-source solutions we considered. I also want to outline the concrete steps we took to reduce operational overhead and get a hold of our cloud spend. I will share some practical examples of how to enforce consistency across DAGs, scale Airflow and build a data platform that grows with the business.
This session will be aimed at engineers and platforms teams who want to run Airflow efficiently and sustainably even with limited resources and budget constraints.
Texas Ballroom 7Apache Airflow is often perceived as a platform best suited for large organisations with significant infrastructure budgets and dedicated platform teams. In this talk, I want to share how we built and scaled a robust Airflow platform with tight cost constraints whilst still maintaining reliability, governance and developer productivity.
Starting from a small Airflow setup, we have evolved our architecture to support multiple teams and increasingly complex workflows. This includes standardising environments and making sure best practises are adopted around observability, resource management and version control.
What if your Airflow DAG could orchestrate robots, thermal chambers, and silicon tests, not just code?
Silicon validation labs rely on scarce, stateful physical resources: robotic handlers, DUT boards, thermal/power systems, instruments, and shared hardware queues. Teams often coordinate these via spreadsheets and ad hoc reservations, causing contention, idle gaps, conflicts, poor observability, and slow triage.
This talk presents a closed-loop orchestration model where Apache Airflow is the control plane for a software-defined validation lab. A central DAG coordinates robotic handling, thermal/power setup, stress and performance runs, and parametric characterization on hosts connected to silicon. It continuously ingests hardware health, measurements, and test outcomes, then feeds results into AI-assisted analysis to choose the next physical action: refine parameters, schedule follow-up experiments, or trigger mitigation.
Using Edge workers on dedicated lab machines, we replace manual coordination with reliable, auditable orchestration. The same pattern extends beyond silicon to robotics labs, device farms, and other cyber-physical environments.
Texas Ballroom 1-2-3What if your Airflow DAG could orchestrate robots, thermal chambers, and silicon tests, not just code?
Silicon validation labs rely on scarce, stateful physical resources: robotic handlers, DUT boards, thermal/power systems, instruments, and shared hardware queues. Teams often coordinate these via spreadsheets and ad hoc reservations, causing contention, idle gaps, conflicts, poor observability, and slow triage.
This talk presents a closed-loop orchestration model where Apache Airflow is the control plane for a software-defined validation lab. A central DAG coordinates robotic handling, thermal/power setup, stress and performance runs, and parametric characterization on hosts connected to silicon. It continuously ingests hardware health, measurements, and test outcomes, then feeds results into AI-assisted analysis to choose the next physical action: refine parameters, schedule follow-up experiments, or trigger mitigation.
Debugging Airflow failures in production can be harder than building the pipelines themselves. Engineers often encounter issues such as disappearing DAGs, hanging tasks, missing logs, zombie tasks, or sudden performance degradation, often with little visibility into the root cause.
Over the past year, while supporting multiple Airflow deployments and integrations, we investigated several such incidents across different teams and environments. This session shares lessons from these real debugging cases and explains how the issues were diagnosed and resolved.
We will walk through incidents involving scheduler behaviour, concurrency limits, memory pressure, and process-level failures. For each case, we highlight the symptoms, the investigation approach, and the root cause.
Attendees will learn
- How to systematically debug complex Airflow failures
- Which components commonly hide the root cause
- Practical signals to watch in logs and metrics
Debugging Airflow failures in production can be harder than building the pipelines themselves. Engineers often encounter issues such as disappearing DAGs, hanging tasks, missing logs, zombie tasks, or sudden performance degradation, often with little visibility into the root cause.
Over the past year, while supporting multiple Airflow deployments and integrations, we investigated several such incidents across different teams and environments. This session shares lessons from these real debugging cases and explains how the issues were diagnosed and resolved.
Thanks to AI, your data scientists can build models faster than ever. The new bottleneck? Their attention. When your team maintains a zoo of ML models (dbt/SQL scoring models, Python ML on Kubernetes, and point-and-click product UI models) every new species adds feeding schedules, health checks, and habitat needs. The real question becomes: which animals need the zookeeper right now?
At Pendo, we orchestrate 10+ ML models through Airflow, each with its own dbt Cloud feature prep, Kubernetes scoring pods, and downstream monitoring. This talk covers how we keep the zoo running: DAG dependencies across heterogeneous model types, conditional execution for models that only score on certain schedules, and model-specific sub-pipelines that keep each species healthy. Then we’ll demo DS ModelGuard, an agentic monitoring system we built internally that does the morning rounds, tracking API health, output volume, likelihood drift, and feature-level input drift, so your data scientists know which enclosure to check first.
You’ll leave knowing how to wire up a diverse model zoo in Airflow and how to build attention-routing so your team stops checking every cage and starts prioritizing.
Texas Ballroom 6Thanks to AI, your data scientists can build models faster than ever. The new bottleneck? Their attention. When your team maintains a zoo of ML models (dbt/SQL scoring models, Python ML on Kubernetes, and point-and-click product UI models) every new species adds feeding schedules, health checks, and habitat needs. The real question becomes: which animals need the zookeeper right now?
At Pendo, we orchestrate 10+ ML models through Airflow, each with its own dbt Cloud feature prep, Kubernetes scoring pods, and downstream monitoring. This talk covers how we keep the zoo running: DAG dependencies across heterogeneous model types, conditional execution for models that only score on certain schedules, and model-specific sub-pipelines that keep each species healthy. Then we’ll demo DS ModelGuard, an agentic monitoring system we built internally that does the morning rounds, tracking API health, output volume, likelihood drift, and feature-level input drift, so your data scientists know which enclosure to check first.
As Airflow becomes mission-critical, centralized data teams often become a bottleneck. This session provides a framework for building a Center of Excellence (CoE) that empowers autonomous domain teams while maintaining global standards.
We detail the shift toward “Data Platform Engineering,” treating orchestration as a product. Using case studies from large-scale organizations, we discuss a three-layer model: Strategic (governance), Tactical (platform development), and Operational (business unit execution).
Attendees will learn to design a self-service platform with guardrails that manages multiple teams without interference. We will explore using Airflow 3.0’s architecture for task isolation and conclude with a guide on aligning cross-functional teams and measuring value through consumption-based billing.
Texas Ballroom 7As Airflow becomes mission-critical, centralized data teams often become a bottleneck. This session provides a framework for building a Center of Excellence (CoE) that empowers autonomous domain teams while maintaining global standards.
We detail the shift toward “Data Platform Engineering,” treating orchestration as a product. Using case studies from large-scale organizations, we discuss a three-layer model: Strategic (governance), Tactical (platform development), and Operational (business unit execution).
Attendees will learn to design a self-service platform with guardrails that manages multiple teams without interference. We will explore using Airflow 3.0’s architecture for task isolation and conclude with a guide on aligning cross-functional teams and measuring value through consumption-based billing.
Upgrading to Apache Airflow in large, production-grade environments can be complex—especially in enterprise setups with hundreds of DAGs, custom plugins, and mission-critical pipelines. The challenge grows even more complex in decentralized setups, where platform teams are responsible for the system’s stability, but the DAG code lives across multiple teams you don’t directly control.
You will have the chance for personalised review of your current organizational setup, assess testing coverage, and identify concrete ways to improve your upgrade process. This hands-on workshop will provide:
- Environment Health Check & Audits (dependency checks, resource sizing)
- DAG refactoring for deprecated features and optimizations
- Database migrations and backward-compatibility strategies
- Improving CI/CD validation using GenAI to increase reliability
- Self-managed and Astronomer upgrade (with no downtime)
Supported by battle-tested approach and guided exercises. Recommended for platform teams, data engineers, and architects managing production Airflow deployments. At the end of this workshop participants will gain actionable strategies tailored to their specific upgrade challenges.
Hill Country CDUpgrading to Apache Airflow in large, production-grade environments can be complex—especially in enterprise setups with hundreds of DAGs, custom plugins, and mission-critical pipelines. The challenge grows even more complex in decentralized setups, where platform teams are responsible for the system’s stability, but the DAG code lives across multiple teams you don’t directly control.
You will have the chance for personalised review of your current organizational setup, assess testing coverage, and identify concrete ways to improve your upgrade process. This hands-on workshop will provide:
Today’s Pipeline authoring is synchronous: writing code, chasing error - every step blocks the engineer until resolved. You can’t step away or parallelize. Airflow Autopilot reimagines this to be AI-native and asynchronous. Describe your pipeline’s intent. The agent takes over - orchestrating two classes of purpose-built tools: tools that generate the DAG code and automate setup, and scorer tools that evaluate it across dimensions: e.g. data discovery, auth, compliance, DAG validation, even end-to-end execution. Every scorer returns a deterministic result and structured, prioritized hints. The agent runs the generate → verify → refine loop — calling scorers, reading hints, fixing code, re-scoring — until every dimension passes. You come back to a PR with DAGs that have been iteratively built, tested, and ready for review. For 10,000+ Airflow users, this shifts the engineer from executor to reviewer: you own the intent and final judgment, the agent owns the execution. Attendees leave with the architecture for an AI-native authoring experience, the principles behind decomposing work into scorer-sized verification units, and what it takes to scale this in production.
Texas Ballroom 1-2-3Today’s Pipeline authoring is synchronous: writing code, chasing error - every step blocks the engineer until resolved. You can’t step away or parallelize. Airflow Autopilot reimagines this to be AI-native and asynchronous. Describe your pipeline’s intent. The agent takes over - orchestrating two classes of purpose-built tools: tools that generate the DAG code and automate setup, and scorer tools that evaluate it across dimensions: e.g. data discovery, auth, compliance, DAG validation, even end-to-end execution. Every scorer returns a deterministic result and structured, prioritized hints. The agent runs the generate → verify → refine loop — calling scorers, reading hints, fixing code, re-scoring — until every dimension passes. You come back to a PR with DAGs that have been iteratively built, tested, and ready for review. For 10,000+ Airflow users, this shifts the engineer from executor to reviewer: you own the intent and final judgment, the agent owns the execution. Attendees leave with the architecture for an AI-native authoring experience, the principles behind decomposing work into scorer-sized verification units, and what it takes to scale this in production.
Most Airflow failures are still handled manually — retries, Slack alerts, and late-night debugging. This talk shows how to design Airflow as a self-healing platform that detects problems early, limits blast radius, and automatically recovers. We’ll cover practical patterns for DAG, schema, and dependency-drift detection; safe, selective backfills; predictive failure modeling using metadata; lineage-aware rollbacks; and canary deployment for DAGs. You’ll learn how to isolate unstable workloads before they impact others and how to turn Airflow into an intelligent control plane — not just a scheduler.
Texas Ballroom 5Most Airflow failures are still handled manually — retries, Slack alerts, and late-night debugging. This talk shows how to design Airflow as a self-healing platform that detects problems early, limits blast radius, and automatically recovers. We’ll cover practical patterns for DAG, schema, and dependency-drift detection; safe, selective backfills; predictive failure modeling using metadata; lineage-aware rollbacks; and canary deployment for DAGs. You’ll learn how to isolate unstable workloads before they impact others and how to turn Airflow into an intelligent control plane — not just a scheduler.
Graph databases are increasingly used for relationship-heavy data such as fraud detection, knowledge graphs and CRM systems, yet integrating them into orchestration workflows has remained difficult. This session introduces the Apache TinkerPop Provider for Airflow, enabling graph databases to be orchestrated as first-class citizens. I will demonstrate how it works with both self-hosted and managed services such as AWS Neptune and Azure Cosmos DB.
Texas Ballroom 6Graph databases are increasingly used for relationship-heavy data such as fraud detection, knowledge graphs and CRM systems, yet integrating them into orchestration workflows has remained difficult. This session introduces the Apache TinkerPop Provider for Airflow, enabling graph databases to be orchestrated as first-class citizens. I will demonstrate how it works with both self-hosted and managed services such as AWS Neptune and Azure Cosmos DB.
Airflow testing today is a patchwork: you can validate code and catch obvious breakage early, but many production failures live in the seams—runtime state, persistence, serialization boundaries, API behavior, and the way a real deployment executes work across components. The fast tools are valuable, yet they don’t fully model Airflow as a system. Meanwhile, the default development posture nudges you toward single-process behavior and away from realistic concurrency and state interactions. The result is a familiar trade: quick feedback vs. meaningful confidence. “Airflow in a Box” is a step toward collapsing that trade—making deeper, more production-relevant tests accessible without requiring a full, heavyweight instance for every iteration. In this talk, we’ll discuss methodology, quantify slickness, and share real code!
Texas Ballroom 7Airflow testing today is a patchwork: you can validate code and catch obvious breakage early, but many production failures live in the seams—runtime state, persistence, serialization boundaries, API behavior, and the way a real deployment executes work across components. The fast tools are valuable, yet they don’t fully model Airflow as a system. Meanwhile, the default development posture nudges you toward single-process behavior and away from realistic concurrency and state interactions. The result is a familiar trade: quick feedback vs. meaningful confidence. “Airflow in a Box” is a step toward collapsing that trade—making deeper, more production-relevant tests accessible without requiring a full, heavyweight instance for every iteration. In this talk, we’ll discuss methodology, quantify slickness, and share real code!
The industry treats agents and pipelines as opposing paradigms. We think that framing is wrong. Most agentic problem-solving, when you look at what it actually does, has pipeline structure: gather data, process each dimension independently, synthesize, evaluate. The question is not “agents or pipelines?” but where the LLM fits inside the pipeline and what you gain by making each step explicit.
This talk makes that concrete. We start with AIP-99 and the operator library that gives Airflow first-class LLM support: inference, SQL generation, branching, schema validation, and embedding, all backed by PydanticAI with 20+ model providers out of the box. We walk through a real pipeline that analyzes 5,856 survey responses using four parallel LLM-generated queries, DataFusion execution, and a synthesis step, showing exactly where the LLM reasons and where the pipeline handles everything else.
Then we go deeper. Fault-tolerant agentic systems need more than retry counts. AIP-105 introduces pluggable retry policies that classify failures at the exception level, including an LLM-powered variant that distinguishes a rate limit from an auth error from a transient network blip. LLMSchemaCheckOperator validates upstream data before the LLM ever sees it. DAG Result API lets a pipeline expose a semantic output, turning a DAG into a callable function for downstream agents. These are not theoretical. We demo each one.
We close with what is next: persistent task state for agentic workflows that survive retries (AIP-103), and the path toward dynamic execution graphs that support feedback loops while preserving the auditability that makes pipelines worth building in the first place.
Texas Ballroom 1-2-3The industry treats agents and pipelines as opposing paradigms. We think that framing is wrong. Most agentic problem-solving, when you look at what it actually does, has pipeline structure: gather data, process each dimension independently, synthesize, evaluate. The question is not “agents or pipelines?” but where the LLM fits inside the pipeline and what you gain by making each step explicit.
This talk makes that concrete. We start with AIP-99 and the operator library that gives Airflow first-class LLM support: inference, SQL generation, branching, schema validation, and embedding, all backed by PydanticAI with 20+ model providers out of the box. We walk through a real pipeline that analyzes 5,856 survey responses using four parallel LLM-generated queries, DataFusion execution, and a synthesis step, showing exactly where the LLM reasons and where the pipeline handles everything else.
Airflow’s callback system has undergone significant architectural changes recently. Originally driven by the introduction of Deadline Alerts, these improvements have far broader implications for how callbacks are defined, where they run, and how reliable they are.
In this talk, I’ll cover the user-facing and provider-facing changes along with a brief look at the significant technical design decisions and internal refactoring behind them, such as a new workload type and unified type-agnostic database model for callbacks. In the long term, this work makes both callbacks and the Dag Processor more robust, and the improved isolation is a key stepping stone toward Airflow’s upcoming multi-team capabilities.
Texas Ballroom 5Airflow’s callback system has undergone significant architectural changes recently. Originally driven by the introduction of Deadline Alerts, these improvements have far broader implications for how callbacks are defined, where they run, and how reliable they are.
In this talk, I’ll cover the user-facing and provider-facing changes along with a brief look at the significant technical design decisions and internal refactoring behind them, such as a new workload type and unified type-agnostic database model for callbacks. In the long term, this work makes both callbacks and the Dag Processor more robust, and the improved isolation is a key stepping stone toward Airflow’s upcoming multi-team capabilities.
How do you monitor Airflow across 50 teams in real-time? How do downstream systems react instantly to pipeline completions without polling APIs? How do you build custom dashboards without overloading Airflow’s database? This talk demonstrates how we use Change Data Capture to stream Airflow’s metadata to Kafka, making orchestration events consumable by any system in real-time. By capturing changes in Airflow’s Postgres database and publishing them to Kafka topics, we enable instant notifications, real-time dashboards, compliance audit trails, and cross-system orchestration without modifying Airflow code or impacting performance. You’ll learn how to set up Debezium CDC for Airflow’s metadata tables, design Kafka topics for task and DAG events, build real-time consumers for monitoring and alerting, handle schema evolution across Airflow upgrades, and implement cost attribution and SLA monitoring in real-time. Using production examples processing millions of events daily, I’ll share architecture decisions, performance optimizations, and lessons from running CDC at scale. You’ll leave with patterns for making Airflow observable to your entire organization.
Texas Ballroom 6How do you monitor Airflow across 50 teams in real-time? How do downstream systems react instantly to pipeline completions without polling APIs? How do you build custom dashboards without overloading Airflow’s database? This talk demonstrates how we use Change Data Capture to stream Airflow’s metadata to Kafka, making orchestration events consumable by any system in real-time. By capturing changes in Airflow’s Postgres database and publishing them to Kafka topics, we enable instant notifications, real-time dashboards, compliance audit trails, and cross-system orchestration without modifying Airflow code or impacting performance. You’ll learn how to set up Debezium CDC for Airflow’s metadata tables, design Kafka topics for task and DAG events, build real-time consumers for monitoring and alerting, handle schema evolution across Airflow upgrades, and implement cost attribution and SLA monitoring in real-time. Using production examples processing millions of events daily, I’ll share architecture decisions, performance optimizations, and lessons from running CDC at scale. You’ll leave with patterns for making Airflow observable to your entire organization.
Performance issues in Apache Airflow rarely appear as clear failures. Instead, they surface as subtle signals: longer task queue times, slower DAG parsing, scheduler lag, or workers hitting limits as workloads grow.
In this talk, we share lessons from profiling real production deployments across Airflow 2.x and 3.x. Combining frontline operational insights with focused technical investigation, we analysed task latency, DAG parsing time, worker behaviour, and metadata database performance under sustained load.
We show how configuration choices such as parallelism, max active runs, and worker resources can amplify or limit version-level improvements. We also discuss performance drift in long-running environments, where accumulated DAG runs expose slow queries or missing indexes that fresh deployments do not reveal.
Finally, we examine how dynamic DAG generation (e.g. with cosmos dbt dags) and custom user code can unintentionally impact parsing and execution performance.
Attendees leave with a practical framework to profile existing deployments, isolate bottlenecks, optimise performance, reduce recurring issues, and approach upgrades with confidence.
Texas Ballroom 7Performance issues in Apache Airflow rarely appear as clear failures. Instead, they surface as subtle signals: longer task queue times, slower DAG parsing, scheduler lag, or workers hitting limits as workloads grow.
In this talk, we share lessons from profiling real production deployments across Airflow 2.x and 3.x. Combining frontline operational insights with focused technical investigation, we analysed task latency, DAG parsing time, worker behaviour, and metadata database performance under sustained load.
In this session I will provide a deep dive into a task instance’s lifetime. From when the scheduler decides for it to be scheduled until it is marked as success or failed.
We will explore when in the process concepts like concurrency, pools and priority weights apply, what it means for a task to be “queued” and where things like cluster policies, operator links, callbacks and event listeners are evaluated.
The goal is to have a non-technical reference of the inner workings of Airflow applicable to the day-to-day of Dag and Plugin authors.
Texas Ballroom 5In this session I will provide a deep dive into a task instance’s lifetime. From when the scheduler decides for it to be scheduled until it is marked as success or failed.
We will explore when in the process concepts like concurrency, pools and priority weights apply, what it means for a task to be “queued” and where things like cluster policies, operator links, callbacks and event listeners are evaluated.
A drone doesn’t care what time it is. It takes off when the mission says so, lands when the battery says so, and uploads its logs whenever the LTE link or WiFi finally cooperates. Cron-based pipelines, by contrast, care deeply about the clock — and that mismatch is where most fleet telemetry stacks quietly bleed money, latency, and engineer sanity on empty polls, half-parsed flights, and workers pinned waiting on slow uploads.
This talk is about throwing the schedule away. In Airflow 3, every completed flight becomes a first-class Asset, and downstream DAGs — parsing, enrichment, anomaly detection, perception-model retraining, regulatory reporting — wake only when the flight they care about actually exists. We’ll cover a custom MAVLinkHook and TelemetryIngestOperator for .ulg and .tlog files, Dynamic Task Mapping across concurrent flights, deferrable sensors, Asset producer/consumer chains replacing ExternalTaskSensor tangles, and honest migration lessons from running old and new DAGs side-by-side on a live fleet.
Texas Ballroom 6A drone doesn’t care what time it is. It takes off when the mission says so, lands when the battery says so, and uploads its logs whenever the LTE link or WiFi finally cooperates. Cron-based pipelines, by contrast, care deeply about the clock — and that mismatch is where most fleet telemetry stacks quietly bleed money, latency, and engineer sanity on empty polls, half-parsed flights, and workers pinned waiting on slow uploads.
Airflow’s legacy SLA (Service level agreement) feature let users set a maximum expected duration for a DAG run and receive an email when it was exceeded, but it was inflexible and hard to configure. Deadline Alerts replaced it in 3.1 with a general-purpose system for time-based alerting. Since then, two release cycles have reshaped the feature.
Callbacks now run in supervised subprocesses with access to Connections, Variables, and Assets, which means they can query your infrastructure and respond to problems, not just send a notification. Deadline status is visible in the UI Grid view and DAG run overview. Named deadlines let you attach multiple alerts to a single DAG for different stakeholders. OpenLineage captures deadline events. And fixes for duplicate callbacks under HA schedulers and migration performance have made the feature production-solid.
As one of the developers of Deadline Alerts, I’ll walk through these changes and show callback patterns that take advantage of the new execution model. I’ll close with where the feature is going: deadlines that attach to individual tasks and assets. The end goal is for Deadline Alerts to make time constraints something the scheduler understands and acts on, not just something you get alerted about after the fact.
Texas Ballroom 7Airflow’s legacy SLA (Service level agreement) feature let users set a maximum expected duration for a DAG run and receive an email when it was exceeded, but it was inflexible and hard to configure. Deadline Alerts replaced it in 3.1 with a general-purpose system for time-based alerting. Since then, two release cycles have reshaped the feature.
Callbacks now run in supervised subprocesses with access to Connections, Variables, and Assets, which means they can query your infrastructure and respond to problems, not just send a notification. Deadline status is visible in the UI Grid view and DAG run overview. Named deadlines let you attach multiple alerts to a single DAG for different stakeholders. OpenLineage captures deadline events. And fixes for duplicate callbacks under HA schedulers and migration performance have made the feature production-solid.
Modern pharmacy enterprise systems must process high volumes of complex prescriptions while maintaining strict safety, compliance, and operational efficiency. However, traditional rule-based platforms frequently generate low-specificity alerts that contribute to alert fatigue, workflow bottlenecks, and increased manual intervention. As clinical guidelines, payer requirements, and treatment protocols evolve, static rule engines struggle to keep pace with the dynamic nature of modern pharmacy operations.
This session presents a practical architecture for AI-enabled prescription workflow automation orchestrated through Apache Airflow, enabling scalable, transparent, and auditable clinical workflows. By combining rule-based safety checks with machine learning models for classification, anomaly detection, and intelligent workflow routing, the system significantly improves routing precision, reduces false positives, and accelerates prescription verification.
Texas Ballroom 1-2-3Modern pharmacy enterprise systems must process high volumes of complex prescriptions while maintaining strict safety, compliance, and operational efficiency. However, traditional rule-based platforms frequently generate low-specificity alerts that contribute to alert fatigue, workflow bottlenecks, and increased manual intervention. As clinical guidelines, payer requirements, and treatment protocols evolve, static rule engines struggle to keep pace with the dynamic nature of modern pharmacy operations.
This session presents a practical architecture for AI-enabled prescription workflow automation orchestrated through Apache Airflow, enabling scalable, transparent, and auditable clinical workflows. By combining rule-based safety checks with machine learning models for classification, anomaly detection, and intelligent workflow routing, the system significantly improves routing precision, reduces false positives, and accelerates prescription verification.
Airflow 3 has officially arrived! If you’re considering an upgrade, this session will equip you with essential migration utilities that facilitate a smooth transition from Airflow 2.x. Attendees will learn the new CLI command, “airflow config lint”, to analyze your configuration files for any removed, deprecated, or renamed elements. This command provides comprehensive feedback and allows for filtering specific sections and options.
During the session, attendees will learn to leverage a set of rigorous Ruff rules - AIR301, AIR302, and AIR303 - crafted to detect migration issues within your codebase automatically. Notably, rule AIR301 flags DAG definitions lacking an explicit schedule argument, a critical update in Airflow 3. Rule AIR302 identifies deprecated functions and removes configuration settings, offering recommended alternatives. Rule AIR303 highlights code that references components now shifted to provider packages, ensuring your integrations are up to date.
Join this session for live demos and practical examples that will empower you to confidently upgrade, minimize downtime, and achieve optimal performance in Airflow 3.
Texas Ballroom 5Airflow 3 has officially arrived! If you’re considering an upgrade, this session will equip you with essential migration utilities that facilitate a smooth transition from Airflow 2.x. Attendees will learn the new CLI command, “airflow config lint”, to analyze your configuration files for any removed, deprecated, or renamed elements. This command provides comprehensive feedback and allows for filtering specific sections and options.
During the session, attendees will learn to leverage a set of rigorous Ruff rules - AIR301, AIR302, and AIR303 - crafted to detect migration issues within your codebase automatically. Notably, rule AIR301 flags DAG definitions lacking an explicit schedule argument, a critical update in Airflow 3. Rule AIR302 identifies deprecated functions and removes configuration settings, offering recommended alternatives. Rule AIR303 highlights code that references components now shifted to provider packages, ensuring your integrations are up to date.
Modern data platforms rely on real-time pipelines to process and analyze large volumes of streaming events. Apache Airflow is widely used for batch orchestration, but it can also coordinate complex streaming architectures. In this session, we explore how Airflow orchestrates scalable pipelines built with Apache Kafka and Apache Spark running on Kubernetes in cloud environments.
We walk through an architecture where Kafka handles high-throughput event ingestion, Spark processes streaming data for analytics and transformation, and Kubernetes provides scalable infrastructure for distributed workloads. Airflow acts as the orchestration layer, coordinating job scheduling, pipeline dependencies, and operational visibility.
Through practical examples and design patterns, attendees will learn how Airflow integrates with Kubernetes to manage Spark jobs, trigger processing pipelines, and coordinate streaming and batch workloads. We will also discuss monitoring strategies and best practices for operating production-grade streaming pipelines using Airflow, Kafka, Spark, and Kubernetes.
Texas Ballroom 6Modern data platforms rely on real-time pipelines to process and analyze large volumes of streaming events. Apache Airflow is widely used for batch orchestration, but it can also coordinate complex streaming architectures. In this session, we explore how Airflow orchestrates scalable pipelines built with Apache Kafka and Apache Spark running on Kubernetes in cloud environments.
We walk through an architecture where Kafka handles high-throughput event ingestion, Spark processes streaming data for analytics and transformation, and Kubernetes provides scalable infrastructure for distributed workloads. Airflow acts as the orchestration layer, coordinating job scheduling, pipeline dependencies, and operational visibility.
Asset partitions are a key building block in Expanded Data Awareness. This session explains the core semantics of partition definitions, partition mappings, and backfill behavior in AIP-76. I will show how these pieces fit together in the current design, then discuss where asset partitions can go next, including improvements in authoring ergonomics, observability, and partition-aware workflow capabilities. Attendees will leave with a clear mental model of today’s implementation and a practical view of future direction.
Texas Ballroom 7Asset partitions are a key building block in Expanded Data Awareness. This session explains the core semantics of partition definitions, partition mappings, and backfill behavior in AIP-76. I will show how these pieces fit together in the current design, then discuss where asset partitions can go next, including improvements in authoring ergonomics, observability, and partition-aware workflow capabilities. Attendees will leave with a clear mental model of today’s implementation and a practical view of future direction.
Apache Airflow® has long been the control plane for data pipelines. As AI workflows move into production, teams are discovering the same challenges apply: LLM calls fail, embeddings need regenerating, and agent outputs need human review. The operational discipline that Airflow brings to data pipelines is exactly what AI workflows need too.
Rather than managing data pipelines in Airflow and AI workflows in a separate system, Airflow lets you build both in one observable, reliable control plane. You get scheduling, retries, lineage, versioning, and human-in-the-loop capabilities for your LLM tasks the same way you already have them for your SQL transformations.
In this hands-on workshop, you will build an end-to-end AI pipeline using Airflow’s LLM task decorators, all in your browser, no setup required. The scenario: processing customer reviews for AstroTrips, a fictional interplanetary travel company, with LLM analysis, intelligent routing, vector embeddings, and an AI agent that drafts responses, all with human-in-the-loop approval.
Hill Country CDApache Airflow® has long been the control plane for data pipelines. As AI workflows move into production, teams are discovering the same challenges apply: LLM calls fail, embeddings need regenerating, and agent outputs need human review. The operational discipline that Airflow brings to data pipelines is exactly what AI workflows need too.
Rather than managing data pipelines in Airflow and AI workflows in a separate system, Airflow lets you build both in one observable, reliable control plane. You get scheduling, retries, lineage, versioning, and human-in-the-loop capabilities for your LLM tasks the same way you already have them for your SQL transformations.
In healthcare data, standards are often anything but standard. Every new partner arrives with its own requirements for data exchange spanning FHIR APIs, HL7 feeds, SFTP drops, and custom vendor extracts.
The result? Integration projects that stretch from weeks into months, custom pipelines that only one engineer understands, and implementation teams who are already counting down to your next missed deadline.
This session shows how Airflow can change your approach to managing data transfer for healthcare partners.
We’ll cover how to structure your pipelines with DAGFactory patterns so that you don’t need to treat partner onboarding as a custom engineering effort and where this approach breaks down. We will also cover how you can integrate with tools like OpenMetadata (using Airflow under the hood) to track data assets, so your implementation team knows what is happening without having to ask an engineer.
This session is for data engineers building or maintaining healthcare integration pipelines, and healthcare leaders who don’t want to keep hearing “the next partner integration is a few months away.”
TTexas Ballroom 1-2-3In healthcare data, standards are often anything but standard. Every new partner arrives with its own requirements for data exchange spanning FHIR APIs, HL7 feeds, SFTP drops, and custom vendor extracts.
The result? Integration projects that stretch from weeks into months, custom pipelines that only one engineer understands, and implementation teams who are already counting down to your next missed deadline.
This session shows how Airflow can change your approach to managing data transfer for healthcare partners.
This talk covers migrating a production Airflow platform that orchestrates a large VM fleet — provisioning, OS patching, and decommissioning at high concurrency. This is not a data pipeline — it is infrastructure operations at fleet scale We’ll share workflow patterns that make fleet-scale orchestration possible in Airflow, then cover how we moved from an Airflow 2 monolith — all components on every node with fixed worker counts — to Airflow 3 with independently scalable services, each with its own release cycle. We’ll dig into a silent breaking change in Airflow 3’s XCom behavior: xcom_pull(key=…) without task_ids no longer searches upstream tasks, returning None with no warning. We’ll present three iterations of solving this — from O(n) DAG traversal to a custom XCom backend that restores Airflow 2 semantics with zero DAG code changes — and the design tradeoffs at each stage. Attendees will learn how Airflow powers infrastructure operations beyond data pipelines, how Airflow 3’s XCom silently breaks Airflow 2 workflows, three approaches to the same migration problem, and lessons from running both versions in parallel.
Texas Ballroom 5This talk covers migrating a production Airflow platform that orchestrates a large VM fleet — provisioning, OS patching, and decommissioning at high concurrency. This is not a data pipeline — it is infrastructure operations at fleet scale We’ll share workflow patterns that make fleet-scale orchestration possible in Airflow, then cover how we moved from an Airflow 2 monolith — all components on every node with fixed worker counts — to Airflow 3 with independently scalable services, each with its own release cycle. We’ll dig into a silent breaking change in Airflow 3’s XCom behavior: xcom_pull(key=…) without task_ids no longer searches upstream tasks, returning None with no warning. We’ll present three iterations of solving this — from O(n) DAG traversal to a custom XCom backend that restores Airflow 2 semantics with zero DAG code changes — and the design tradeoffs at each stage. Attendees will learn how Airflow powers infrastructure operations beyond data pipelines, how Airflow 3’s XCom silently breaks Airflow 2 workflows, three approaches to the same migration problem, and lessons from running both versions in parallel.
Apache Spark’s new Declarative Pipelines (SDP) let engineers define WHAT their data should look like, not HOW to build it. Apache Airflow 3 brings a declarized orchestration model. Together, they eliminate an entire category of boilerplate: the DAG that exists only to babysit a pipeline. This talk walks through building a production Spark SDP pipeline orchestrated by Airflow 3, showing how dependency graphs replace imperative task chains, how testing and recovery patterns change when your pipeline is declarative end-to-end, and what this means for the 80% of data engineering time currently spent on operational plumbing.
Texas Ballroom 6Apache Spark’s new Declarative Pipelines (SDP) let engineers define WHAT their data should look like, not HOW to build it. Apache Airflow 3 brings a declarized orchestration model. Together, they eliminate an entire category of boilerplate: the DAG that exists only to babysit a pipeline. This talk walks through building a production Spark SDP pipeline orchestrated by Airflow 3, showing how dependency graphs replace imperative task chains, how testing and recovery patterns change when your pipeline is declarative end-to-end, and what this means for the 80% of data engineering time currently spent on operational plumbing.
Problem Statement: As our data platform scaled, our shared Airflow 2.9 deployment became a bottleneck with critical challenges: development friction from shared repositories, custom security workarounds, release coordination complexity, data isolation concerns, and cost attribution opacity. When Airflow 3.x launched with hybrid execution support, we restructured our architecture. Following a successful proof of value, we implemented remote execution - enabling teams to run workloads in isolated Kubernetes clusters while maintaining centralized orchestration. This session shares our journey, architectural decisions, and how we leveraged agentic AI to streamline migration and developer experience.
Presentation Details: Join us for a practitioner’s guide to transforming Airflow from a shared bottleneck into scalable execution with multiple Kubernetes clusters. The Journey: Why we moved from Airflow 2.9’s monolithic deployment to Airflow 3.x’s remote execution. The Architecture: Astronomer orchestrating workloads across team-owned Azure Kubernetes clusters. The Innovation: How agentic AI automates DAG development, from code generation to deployment.
Texas Ballroom 7Problem Statement: As our data platform scaled, our shared Airflow 2.9 deployment became a bottleneck with critical challenges: development friction from shared repositories, custom security workarounds, release coordination complexity, data isolation concerns, and cost attribution opacity. When Airflow 3.x launched with hybrid execution support, we restructured our architecture. Following a successful proof of value, we implemented remote execution - enabling teams to run workloads in isolated Kubernetes clusters while maintaining centralized orchestration. This session shares our journey, architectural decisions, and how we leveraged agentic AI to streamline migration and developer experience.
As analytics teams grow, monolithic dbt projects can become tightly coupled and difficult to scale. Cross-domain dependencies multiply, deployment cycles slow down, and ownership boundaries blur.
dbt Mesh proposes a domain-oriented approach with independently owned dbt projects, explicit cross-project contracts, and controlled exposure to dependencies. Applying Mesh principles is not just about splitting repositories; orchestration must also support these boundaries.
In this session, we explore how to design dbt projects according to Mesh principles and how Airflow orchestration can reinforce those architectural decisions. Using multi-project capabilities in Cosmos that leverage dbt Loom-style cross-project referencing, we demonstrate how Airflow can model domain separation while still enabling controlled cross-project dependencies.
We will discuss architectural trade-offs, dependency modelling strategies, and lessons learned while enabling multi-project orchestration. Attendees will leave with practical guidance on moving from monolithic transformation workflows to domain-oriented patterns and understanding what orchestration support is required to make them a success.
Texas Ballroom 1-2-3As analytics teams grow, monolithic dbt projects can become tightly coupled and difficult to scale. Cross-domain dependencies multiply, deployment cycles slow down, and ownership boundaries blur.
dbt Mesh proposes a domain-oriented approach with independently owned dbt projects, explicit cross-project contracts, and controlled exposure to dependencies. Applying Mesh principles is not just about splitting repositories; orchestration must also support these boundaries.
In this session, we explore how to design dbt projects according to Mesh principles and how Airflow orchestration can reinforce those architectural decisions. Using multi-project capabilities in Cosmos that leverage dbt Loom-style cross-project referencing, we demonstrate how Airflow can model domain separation while still enabling controlled cross-project dependencies.
AI agents break the traditional Airflow trust model. While standard tasks are deterministic, agents execute dynamic logic and invoke external tools, meaning untrusted code is suddenly running inside standard containers sharing your host kernel. This session demonstrates how to secure AI workloads in Airflow without rewriting the orchestrator or building custom executors. We will introduce a custom, policy-driven @agent TaskFlow abstraction that leverages Kubernetes executor_config overrides (like runtimeClassName) to isolate workloads on the fly.
Key Takeaways for Attendees:
- The Threat Model: Why containers are not a strong enough security boundary for AI agents.
- The Implementation: How to build an @agent decorator that routes tasks to sandboxed environments (gVisor, Kata, Peer Pods) while keeping the KubernetesExecutor unchanged.
- Kubernetes in Production: How to achieve a VM-per-pod pattern using open-source tools without requiring nested node virtualization.
- Operational Realities: A candid look at execution flow, pod spec mutation, and the latency/cost trade-offs of runtime isolation.
AI agents break the traditional Airflow trust model. While standard tasks are deterministic, agents execute dynamic logic and invoke external tools, meaning untrusted code is suddenly running inside standard containers sharing your host kernel. This session demonstrates how to secure AI workloads in Airflow without rewriting the orchestrator or building custom executors. We will introduce a custom, policy-driven @agent TaskFlow abstraction that leverages Kubernetes executor_config overrides (like runtimeClassName) to isolate workloads on the fly.
Processing unstructured data in regulated industries, healthcare, finance, legal, is one of the hardest data engineering challenges: the data is messy, privacy constraints prevent sending it to external APIs, and scale makes manual processing impossible.
In this talk, I’ll walk through how to design and deploy an Apache Airflow–orchestrated LangChain pipeline powered by LLMs to digitize unstructured documents into a unified structured platform.
I’ll cover the full architecture: how Airflow DAGs coordinate multi-step LLM inference, validation, and ingestion stages; how LoRA/PEFT fine-tuning adapted open-source LLMs for domain-specific language without leaking sensitive data; and how failure handling, retries, and data quality checks were built natively into Airflow.
Attendees will leave with a reproducible blueprint applicable to any domain, practical patterns for integrating local LLMs into DAG-driven pipelines, and honest lessons from running this in production at scale.
Texas Ballroom 6Processing unstructured data in regulated industries, healthcare, finance, legal, is one of the hardest data engineering challenges: the data is messy, privacy constraints prevent sending it to external APIs, and scale makes manual processing impossible.
In this talk, I’ll walk through how to design and deploy an Apache Airflow–orchestrated LangChain pipeline powered by LLMs to digitize unstructured documents into a unified structured platform.
I’ll cover the full architecture: how Airflow DAGs coordinate multi-step LLM inference, validation, and ingestion stages; how LoRA/PEFT fine-tuning adapted open-source LLMs for domain-specific language without leaking sensitive data; and how failure handling, retries, and data quality checks were built natively into Airflow.
Teams running Airflow on Kubernetes know the trade‑off all too well: Kubernetes scales beautifully in production, but makes local development slow, brittle, and unrealistic. Engineers struggle to replicate production environments locally, forcing them into inefficient “test-in-production” cycles that slow delivery velocity, increase deployment risk, and frustrate data teams.
In this talk, we’ll walk through the architectural patterns and platform engineering approach we used to give engineers on‑demand, isolated, production‑like Airflow environments, without sacrificing the benefits of shared Kubernetes infrastructure.
What You’ll Learn:
- An architectural pattern for provisioning on-demand, isolated Airflow environments on shared EKS infrastructure
- Real-world lessons from operating this solution in production: what worked, what didn’t, and what we’d do differently
- Measurable outcomes: how this approach reduced DAG development cycle time and improved engineer satisfaction
If you’re operating Airflow on Kubernetes—or designing internal platforms for data and ML teams—this session offers a concrete, battle‑tested blueprint to improve Airflow delivery from your stakeholders.
Texas Ballroom 7Teams running Airflow on Kubernetes know the trade‑off all too well: Kubernetes scales beautifully in production, but makes local development slow, brittle, and unrealistic. Engineers struggle to replicate production environments locally, forcing them into inefficient “test-in-production” cycles that slow delivery velocity, increase deployment risk, and frustrate data teams.
In this talk, we’ll walk through the architectural patterns and platform engineering approach we used to give engineers on‑demand, isolated, production‑like Airflow environments, without sacrificing the benefits of shared Kubernetes infrastructure.
We built a centralized Gateway that sits in front of our entire scheduling fleet and solves three problems no single-cluster Airflow setup ever faces.
Composite Routing — Workflows are bound to clusters via a tag or their workspace
Global Concurrency Control — Each cluster enforces its own Airflow pool locally, unaware of what the other five are running. Shared downstream systems — rate-limited APIs, licensed compute engines — can be overwhelmed even when every individual pool looks healthy. The Gateway acts as a platform-wide slot broker: operators acquire a slot before doing real work. A built-in heartbeat scheduler reconciles stale slots against each cluster’s REST API, handling crashes and OOM kills transparently.
Transparent Version Upgrade — Each cluster carries version tags. During an Airflow upgrade: re-tag routing rules to send new submissions to the high-version cluster; existing runs drain on the old cluster undisturbed. Once drained, upgrade the old cluster and rejoin it. No maintenance window.
Takeaway: a thin routing layer makes your Airflow fleet elastic and upgradable without touching the scheduler or any pipeline code.
Texas Ballroom 1-2-3We built a centralized Gateway that sits in front of our entire scheduling fleet and solves three problems no single-cluster Airflow setup ever faces.
Composite Routing — Workflows are bound to clusters via a tag or their workspace
Global Concurrency Control — Each cluster enforces its own Airflow pool locally, unaware of what the other five are running. Shared downstream systems — rate-limited APIs, licensed compute engines — can be overwhelmed even when every individual pool looks healthy. The Gateway acts as a platform-wide slot broker: operators acquire a slot before doing real work. A built-in heartbeat scheduler reconciles stale slots against each cluster’s REST API, handling crashes and OOM kills transparently.
When Airflow 3 introduced JWT based task authentication, it also introduced new attack surfaces; such as, Tokens that can’t be revoked,Tasks that lose authentication while waiting in queues and Forked processes that inherit signing keys and also can forge tokens for other tasks.
In this talk, I’ll walk through three security challenges at the task execution boundary and the code contributed to fix them:
Token revocation (merged, PR #61339): Airflow 3.x had no way to invalidate issued JWTs with implications for common compliance frameworks.
Scope separation (in progress, PR #60108): A two-token mechanism separating long lived workload tokens from short lived execution tokens which is in review with the Airflow core team.
Task identity provenance (in active discussion): I’ll present a proposed defense, a server-side JTI allowlisting that could make forged tokens useless across all execution topologies.
This session is deeply technical and grounded in real contributed code including what attack vectors existed before each fix and the audience will leave understanding Airflow 3’s token security model and practical patterns for securing multi team task execution.
Texas Ballroom 5When Airflow 3 introduced JWT based task authentication, it also introduced new attack surfaces; such as, Tokens that can’t be revoked,Tasks that lose authentication while waiting in queues and Forked processes that inherit signing keys and also can forge tokens for other tasks.
In this talk, I’ll walk through three security challenges at the task execution boundary and the code contributed to fix them:
Token revocation (merged, PR #61339): Airflow 3.x had no way to invalidate issued JWTs with implications for common compliance frameworks.
What does solving a Rubik’s Cube have to do with Apache Airflow? More than you’d think.
In this talk, I’ll walk through a project where Airflow orchestrates the process of solving a Rubik’s Cube — not as a gimmick, but as a framework for exploring cyclic workflows, state management, and iterative computation in a system designed for DAGs. Cube-solving algorithms naturally require feedback loops, evolving state, and conditional branching — all things that challenge Airflow’s acyclic model.
We’ll explore how to model “cycles” without breaking DAG semantics, manage cube state across tasks, handle convergence and termination conditions, and avoid common anti-patterns. Along the way, I’ll share practical lessons about idempotency, XCom design, task explosion, and when to rethink orchestration boundaries.
If you’ve ever tried to push Airflow beyond straightforward ETL, this session will give you concrete patterns for safely orchestrating iterative, stateful workflows in production.
Texas Ballroom 7What does solving a Rubik’s Cube have to do with Apache Airflow? More than you’d think.
In this talk, I’ll walk through a project where Airflow orchestrates the process of solving a Rubik’s Cube — not as a gimmick, but as a framework for exploring cyclic workflows, state management, and iterative computation in a system designed for DAGs. Cube-solving algorithms naturally require feedback loops, evolving state, and conditional branching — all things that challenge Airflow’s acyclic model.
Many teams develop their own “Dag factory” to make Airflow easier to use in their organizations. This can help their users avoid python and configure Dags in a simpler manner. However, there is a huge spike in the difficulty curve of writing a DAG if it requires logic that does not fit into the confines of the Dag factory. If you want to create such a DAG, you are then having to completely abandon the pre-made framework and go back to writing a pure airflow DAG. Instead, I will present a different perspective that instead of producing entire DAGs, you should create pre-made task groups that can be dropped into a DAG to cover common steps, but in a manner that maintains a smooth difficulty curve if you want to add customer elements.
Texas Ballroom 6Many teams develop their own “Dag factory” to make Airflow easier to use in their organizations. This can help their users avoid python and configure Dags in a simpler manner. However, there is a huge spike in the difficulty curve of writing a DAG if it requires logic that does not fit into the confines of the Dag factory. If you want to create such a DAG, you are then having to completely abandon the pre-made framework and go back to writing a pure airflow DAG. Instead, I will present a different perspective that instead of producing entire DAGs, you should create pre-made task groups that can be dropped into a DAG to cover common steps, but in a manner that maintains a smooth difficulty curve if you want to add customer elements.
Airflow 3 has been out for a year. In this keynote, we take stock of where the community stands, what we built together, and where we are headed.
We open with the data: adoption trends, community growth, and honest feedback from teams running Airflow 3 in production. What is working, what surprised us, and what the survey tells us about how the ecosystem is evolving.
The second section covers the year in Airflow. Provider discovery and distribution has been modernized. Airflow gained first-class support for AI and LLM workloads. And scheduling became more powerful, letting pipelines respond to data at a finer granularity.
We close with what is coming. Pipelines will soon persist state across retries, making long-running workloads more reliable. Resumable operators will eliminate the restart-from-scratch failure mode. And multi-language support will open Airflow beyond the Python ecosystem.
Twenty-five minutes. A lot of ground to cover
Texas Ballroom 1-2-3Airflow 3 has been out for a year. In this keynote, we take stock of where the community stands, what we built together, and where we are headed.
We open with the data: adoption trends, community growth, and honest feedback from teams running Airflow 3 in production. What is working, what surprised us, and what the survey tells us about how the ecosystem is evolving.
The second section covers the year in Airflow. Provider discovery and distribution has been modernized. Airflow gained first-class support for AI and LLM workloads. And scheduling became more powerful, letting pipelines respond to data at a finer granularity.
Tuesday, September 1, 2026
At Equifax, Apache Airflow is used across many departments, helping Data Engineers, Data Scientists, and Business Analysts in their daily work.
This presentation is about how to use modern orchestration technology at the heart of data processing and business processes to support daily company operations.
Texas Ballroom 1-2-3At Equifax, Apache Airflow is used across many departments, helping Data Engineers, Data Scientists, and Business Analysts in their daily work.
This presentation is about how to use modern orchestration technology at the heart of data processing and business processes to support daily company operations.
This talk is the story of getting a PR merged into a Apache Airflow without writing a single line of code, using Apache Airflow itself as an agentic orchestration harness to replicate the functionality of Claude Code for any pluggable LLM.
We’ll walk through how Airflow’s AIP-99 Dag functionality map naturally onto the tool-use loops, context management, and decision branching that power modern agentic coding workflows. The result is a model-agnostic harness that can read a codebase, reason about changes, write and test code, and deploy a commit to a git repository, all orchestrated as an Airflow Dag.
The proof of concept is a merged PR into Apache Airflow itself!
Attendees will leave with a mental model for thinking about Airflow as an agentic harness, practical patterns for wrapping LLM tool-use in Dag tasks, and a healthy appreciation for what happens when you point an autonomous coding agent at its own codebase.
Texas Ballroom 6This talk is the story of getting a PR merged into a Apache Airflow without writing a single line of code, using Apache Airflow itself as an agentic orchestration harness to replicate the functionality of Claude Code for any pluggable LLM.
We’ll walk through how Airflow’s AIP-99 Dag functionality map naturally onto the tool-use loops, context management, and decision branching that power modern agentic coding workflows. The result is a model-agnostic harness that can read a codebase, reason about changes, write and test code, and deploy a commit to a git repository, all orchestrated as an Airflow Dag.
In many modern data platforms, orchestration tools are combined with transformation frameworks. A common pattern is orchestrating dbt (data build tool) transformations using Apache Airflow — something reported by roughly 44% of the community.
At first glance, the integration seems straightforward: simply run dbt run inside an Airflow task. Some teams go further and use libraries that convert dbt projects into native Airflow DAGs, such as Astronomer Cosmos.
In practice, however, teams quickly run into operational and architectural challenges. Slowness, out-of-memory errors, zombie tasks, and DAGs that take minutes to appear in the UI are just a few of the issues that can emerge as projects scale.
In this talk, I’ll walk through the most common problems I’ve encountered while supporting organisations running dbt with Airflow, explain why they occur, and share practical strategies to avoid or resolve them. The goal is to help teams build more reliable and scalable pipelines when combining Airflow orchestration with dbt transformations.
Texas Ballroom 1-2-3In many modern data platforms, orchestration tools are combined with transformation frameworks. A common pattern is orchestrating dbt (data build tool) transformations using Apache Airflow — something reported by roughly 44% of the community.
At first glance, the integration seems straightforward: simply run dbt run inside an Airflow task. Some teams go further and use libraries that convert dbt projects into native Airflow DAGs, such as Astronomer Cosmos.
In practice, however, teams quickly run into operational and architectural challenges. Slowness, out-of-memory errors, zombie tasks, and DAGs that take minutes to appear in the UI are just a few of the issues that can emerge as projects scale.
As data platforms mature, organizations often experience “Airflow Sprawl”—the rapid, organic growth of isolated Airflow instances across different teams and projects. While this empowers localized control, it creates dangerous silos that hinder visibility, increase operational risk, and erode developer productivity. In this session, we will explore the critical challenges of managing a fragmented Airflow ecosystem and discuss strategies for regaining control. We will examine why centralizing execution history and establishing unified observability is essential for reducing Mean Time to Recovery (MTTR), mitigating hidden security risks, and transforming fragmented instances into a cohesive, reliable data service. Attendees will leave with a strategic framework for managing Airflow at scale.
Texas Ballroom 5Modern data platforms generate overwhelming amounts of operational data across distributed systems. For teams running Apache Airflow at scale, incidents often mean high mean time to resolution (MTTR), constant context switching between observability tools, and a growing on-call burden.
What if your Airflow environment had an always-on, autonomous on-call engineer?
In this workshop, we’ll explore how an AI-powered DevOps agent can supercharge Airflow operations — from automated DAG failure diagnosis and intelligent log analysis to proactive prevention of recurring incidents. Whether you’re running Airflow on a managed cloud service or self-hosted, the patterns and practices covered here apply broadly to modern data pipeline operations.
Key topics covered:
- Autonomous incident detection & response — topology-aware analysis across DAG dependencies with actionable mitigation plans
- Real-time root cause analysis — correlating logs, metrics, and traces across your Airflow application stack
- Collaboration integrations — automatic messaging, ticketing, and incident management workflows
Join us to transform your Airflow DevOps experience.
Hill Country CDModern data platforms generate overwhelming amounts of operational data across distributed systems. For teams running Apache Airflow at scale, incidents often mean high mean time to resolution (MTTR), constant context switching between observability tools, and a growing on-call burden.
What if your Airflow environment had an always-on, autonomous on-call engineer?
In this workshop, we’ll explore how an AI-powered DevOps agent can supercharge Airflow operations — from automated DAG failure diagnosis and intelligent log analysis to proactive prevention of recurring incidents. Whether you’re running Airflow on a managed cloud service or self-hosted, the patterns and practices covered here apply broadly to modern data pipeline operations.
Apache Airflow has long been the go-to orchestration platform for data engineering teams, but managing the underlying infrastructure remains a persistent challenge. Amazon MWAA Serverless eliminates that burden entirely — no environment sizing, no capacity planning, and no idle costs. In this hands-on workshop, attendees will get a practical introduction to MWAA Serverless and walk away having built and run a real end-to-end ML pipeline on AWS.
In this workshop, we’ll use an agent equipped with MWAA serverless knowledge and Airflow DAG tooling to build a pipeline that takes raw training data from S3, kicks off a SageMaker training job, evaluates the output using Claude on Bedrock, and deploys if evaluation passes. You’ll watch the agent reason about the task, generate a valid YAML workflow using the DAG factory pattern, and deploy it to MWAA Serverless — no hand-written code required. Once deployed, we trigger a run and provide full observability: task-level logs from the SageMaker job, the Bedrock evaluation output, and the deploy decision. If something fails, we show how to identify the broken step, read its isolated logs, and iterate — either asking the agent to fix the YAML or rolling back to a prior version. The goal is the full loop: prompt, deploy, run, observe, debug, iterate.
By the end of the session, attendees will have hands-on experience with the full MWAA Serverless workflow lifecycle. Whether you’re new to MWAA Serverless or looking to accelerate pipeline development with AI tooling, you’ll leave with a repeatable pattern you can apply immediately.
Hill Country ABApache Airflow has long been the go-to orchestration platform for data engineering teams, but managing the underlying infrastructure remains a persistent challenge. Amazon MWAA Serverless eliminates that burden entirely — no environment sizing, no capacity planning, and no idle costs. In this hands-on workshop, attendees will get a practical introduction to MWAA Serverless and walk away having built and run a real end-to-end ML pipeline on AWS.
In this workshop, we’ll use an agent equipped with MWAA serverless knowledge and Airflow DAG tooling to build a pipeline that takes raw training data from S3, kicks off a SageMaker training job, evaluates the output using Claude on Bedrock, and deploys if evaluation passes. You’ll watch the agent reason about the task, generate a valid YAML workflow using the DAG factory pattern, and deploy it to MWAA Serverless — no hand-written code required. Once deployed, we trigger a run and provide full observability: task-level logs from the SageMaker job, the Bedrock evaluation output, and the deploy decision. If something fails, we show how to identify the broken step, read its isolated logs, and iterate — either asking the agent to fix the YAML or rolling back to a prior version. The goal is the full loop: prompt, deploy, run, observe, debug, iterate.
Generic AI coding assistants like Cursor and Claude code are powerful, but they struggle with proprietary infrastructures. At Wix, managing 7,500 active DAGs across 120 Data Engineers, we found that standard AI tools lacked the context to be truly effective - they didn’t know our custom operators, DWH modeling patterns, or strict governance rules. In this session, we introduce our internal “Agentic IDE Configuration Manager” that bridges this gap. We will demonstrate how we leverage MCPs to inject deep Airflow context into our AI agents. You will learn how we enabled our coding agents to: Generate compliant code by utilizing custom Cursor rules to ensure every DAG meets production standards and naming conventions. Interact with Airflow by using our custom MCPs to run DAGs locally, parse error logs, and autonomously fix pipeline failures. Understand data by accessing our Data Catalog and Trino engine to validate schema logic in real-time. Whether you are trying to optimize your team’s workflows or simply curious how far can coding agents go in the current age, join us in this exciting talk.
Texas Ballroom 1-2-3Generic AI coding assistants like Cursor and Claude code are powerful, but they struggle with proprietary infrastructures. At Wix, managing 7,500 active DAGs across 120 Data Engineers, we found that standard AI tools lacked the context to be truly effective - they didn’t know our custom operators, DWH modeling patterns, or strict governance rules. In this session, we introduce our internal “Agentic IDE Configuration Manager” that bridges this gap. We will demonstrate how we leverage MCPs to inject deep Airflow context into our AI agents. You will learn how we enabled our coding agents to: Generate compliant code by utilizing custom Cursor rules to ensure every DAG meets production standards and naming conventions. Interact with Airflow by using our custom MCPs to run DAGs locally, parse error logs, and autonomously fix pipeline failures. Understand data by accessing our Data Catalog and Trino engine to validate schema logic in real-time. Whether you are trying to optimize your team’s workflows or simply curious how far can coding agents go in the current age, join us in this exciting talk.
During this session, I’ll deep dive into the implementation of an AI-powered endurance sports coach using Apache Airflow as the backbone for data ingestion and processing. Beyond data pipelines, I’ll explain what’s required to build a conversational AI system, from structured data modeling to orchestration and retrieval. We’ll explore how metrics are precomputed, how vector search enables contextual memory, and which front-end patterns work best for interacting with AI agents. The result is a reproducible architecture where Airflow powers the data layer and an LLM provides the reasoning on top to help athletes perform at their best in numbers-driven endurance sports.
Texas Ballroom 6During this session, I’ll deep dive into the implementation of an AI-powered endurance sports coach using Apache Airflow as the backbone for data ingestion and processing. Beyond data pipelines, I’ll explain what’s required to build a conversational AI system, from structured data modeling to orchestration and retrieval. We’ll explore how metrics are precomputed, how vector search enables contextual memory, and which front-end patterns work best for interacting with AI agents. The result is a reproducible architecture where Airflow powers the data layer and an LLM provides the reasoning on top to help athletes perform at their best in numbers-driven endurance sports.
As Apache Airflow expands beyond batch into real-time, event-driven architectures, data teams face a new set of challenges: duplicated DAG patterns, fragile Kafka-triggered workflows, and debugging cycles that happen too late—often in production.
In this session, we introduce a shift-left approach to pipeline reliability for environments combining Airflow with streaming platforms like Confluent. We’ll explore how event-driven pipelines increase complexity—and why traditional debugging and validation approaches no longer scale. You’ll see how IBM Bob, an AI-powered assistant for data engineers, brings real-time code review, refactoring guidance, and debugging insights directly into developer workflows.
From catching DAG anti-patterns early to improving consistency across batch and streaming pipelines, we’ll demonstrate how teams can prevent issues before they reach production.
We’ll also share practical patterns to: -Improve Airflow code quality across distributed teams -Standardize DAG development for batch and streaming use cases -Reduce MTTD (Mean Time to Detections) and MTTR (Mean Time To Resolution) -Automate DAG tracking across your enterprise through lineage graphs -Minimize technical debt as pipeline complexity grows
We’ll close with a preview of our hands-on workshop, where attendees can apply these concepts in a live lab—using AI to debug, optimize, and standardize Airflow pipelines in real-world scenarios.
Texas Ballroom 1-2-3As Apache Airflow expands beyond batch into real-time, event-driven architectures, data teams face a new set of challenges: duplicated DAG patterns, fragile Kafka-triggered workflows, and debugging cycles that happen too late—often in production.
In this session, we introduce a shift-left approach to pipeline reliability for environments combining Airflow with streaming platforms like Confluent. We’ll explore how event-driven pipelines increase complexity—and why traditional debugging and validation approaches no longer scale. You’ll see how IBM Bob, an AI-powered assistant for data engineers, brings real-time code review, refactoring guidance, and debugging insights directly into developer workflows.
Airflow’s evolution toward a client-server architecture faced a fundamental challenge: splitting a monolithic codebase into independent distributions (airflow-core, task-sdk, providers) without triggering dependency hell. Traditional PyPi packaging and code duplication both fail at Airflow’s scale.
Airflow 3.2 solves this through modular isolation and shared libraries using in-repository symlinks. This approach ensures each distribution ships with the exact version of shared code it requires, eliminating runtime version conflicts and allowing for independent dependency management. We have already migrated 10+ critical components—including the config parser, observability, and secrets masking—into this shared model.
This architecture unlocks:
Zero Version Conflicts: Mix and match airflow-core and task-sdk versions seamlessly.
Streamlined Maintenance: Automatic security fixes across all components.
AIP-72 Vision: Lightweight workers with API-first communication, removing the need for direct database access.
Join us to explore how shared libraries transform Airflow’s monorepo into a modular, scalable, and truly distributed orchestration platform!
Texas Ballroom 1-2-3Airflow’s evolution toward a client-server architecture faced a fundamental challenge: splitting a monolithic codebase into independent distributions (airflow-core, task-sdk, providers) without triggering dependency hell. Traditional PyPi packaging and code duplication both fail at Airflow’s scale.
Airflow 3.2 solves this through modular isolation and shared libraries using in-repository symlinks. This approach ensures each distribution ships with the exact version of shared code it requires, eliminating runtime version conflicts and allowing for independent dependency management. We have already migrated 10+ critical components—including the config parser, observability, and secrets masking—into this shared model.
Apache Airflow is the de facto orchestrator for modern data platforms, while Apache Spark powers large-scale data processing. But when the two meet in production, teams quickly face architectural decisions that affect reliability, performance, and cloud cost.
In this talk we explore key design questions when orchestrating Spark with Airflow: • Should you run a shared Spark cluster, a cluster per DAG run, or clusters per task? • When should Spark workloads run in parallel vs sequentially within a workflow? • How can teams benchmark pipeline performance in terms of both runtime and cost? • How do emerging features like Spark Declarative Pipelines change how Spark integrates with orchestration systems?
Using real production scenarios, we’ll examine the tradeoffs between orchestration strategies and show how Spark observability with DataFlint OSS helps analyze execution plans, task behavior, and runtime metrics.
We’ll also demonstrate how propagating Airflow metadata into Spark jobs allows teams to attribute infrastructure usage and cost back to individual DAGs and benchmark workflow performance.
Texas Ballroom 6Apache Airflow is the de facto orchestrator for modern data platforms, while Apache Spark powers large-scale data processing. But when the two meet in production, teams quickly face architectural decisions that affect reliability, performance, and cloud cost.
In this talk we explore key design questions when orchestrating Spark with Airflow: • Should you run a shared Spark cluster, a cluster per DAG run, or clusters per task? • When should Spark workloads run in parallel vs sequentially within a workflow? • How can teams benchmark pipeline performance in terms of both runtime and cost? • How do emerging features like Spark Declarative Pipelines change how Spark integrates with orchestration systems?
Airflow 3’s Deadline Alerts let you set “need-by” times on DAGs and fire callbacks when deadlines are missed. The built-in references cover common cases, but the real power is the feature’s extensibility. In this workshop, led by the feature’s author, we will go beyond the basics and explore these more advanced features.
We start with an overview of how DeadlineAlert, DeadlineReference, and Callback fit together, and how the scheduler detects misses. Then, a guided project: coding our own Callback implementation and building custom DeadlineReference classes using the @deadline_reference decorator, implementing _evaluate_with(), serialization, and required_kwargs. We wrap up with a hackathon-style “competition” to build the most creative WORKING DeadlineReference (business hours, the last time it didn’t rain in Vancouver, the moon phases, the last time the Leafs won the cup… anything goes, as long as it serializes and returns a valid datetime).
BONUS! Your hackathon “projects” would make a great lightning talk topic!
Submission reviewer note: OR a 25 minute talk. Architecture overview and live demo with the rest as a github link, but the workshop format would be more engaging.
Hill Country ABAirflow 3’s Deadline Alerts let you set “need-by” times on DAGs and fire callbacks when deadlines are missed. The built-in references cover common cases, but the real power is the feature’s extensibility. In this workshop, led by the feature’s author, we will go beyond the basics and explore these more advanced features.
We start with an overview of how DeadlineAlert, DeadlineReference, and Callback fit together, and how the scheduler detects misses. Then, a guided project: coding our own Callback implementation and building custom DeadlineReference classes using the @deadline_reference decorator, implementing _evaluate_with(), serialization, and required_kwargs. We wrap up with a hackathon-style “competition” to build the most creative WORKING DeadlineReference (business hours, the last time it didn’t rain in Vancouver, the moon phases, the last time the Leafs won the cup… anything goes, as long as it serializes and returns a valid datetime).
AI and ML pipelines built in Airflow often power critical business outcomes, but they rarely operate in isolation. In this hands-on workshop, learn how Control-M integrates with Airflow to orchestrate end-to-end workflows that connect data pipelines with upstream and downstream enterprise systems such as supply chain, billing, and other mission-critical applications. You’ll gain practical insight into how teams can improve visibility, reliability, and governance across Airflow-driven workflows, helping move AI and data initiatives from pipeline execution to enterprise-ready business impact.
Hill Country CDAI and ML pipelines built in Airflow often power critical business outcomes, but they rarely operate in isolation. In this hands-on workshop, learn how Control-M integrates with Airflow to orchestrate end-to-end workflows that connect data pipelines with upstream and downstream enterprise systems such as supply chain, billing, and other mission-critical applications. You’ll gain practical insight into how teams can improve visibility, reliability, and governance across Airflow-driven workflows, helping move AI and data initiatives from pipeline execution to enterprise-ready business impact.
As organizations scale their data platforms, managing access to Apache Airflow becomes increasingly complex. In this talk, we introduce the Keycloak Auth Manager — a pluggable authentication and authorization backend for Airflow that delegates identity management to Keycloak, a battle-tested open-source Identity and Access Management solution.
We’ll start with the big picture: what problem does the Keycloak Auth Manager solve, and why Keycloak? We’ll walk through the architecture — how Airflow’s auth manager interface works, how the Keycloak integration hooks into it, and how authentication flows (OIDC/OAuth2) and authorization (role mapping, resource-based permissions) are handled under the hood.
In the second part, we shift to the user perspective with a demo. You’ll see how to configure and deploy the integration, how end users experience SSO login, and how administrators manage roles and permissions in Keycloak that reflect directly in Airflow’s UI and API.
Finally, We’ll also touch on how Keycloak naturally fits into multi-team scenarios, and what that unlocks for teams operating at scale.
Key takeaways:
- Understand how Airflow’s pluggable auth manager interface works
- Learn how Keycloak handles authentication (OIDC) and authorization (roles/permissions) for Airflow
- See a real-world deployment in action, from config to login to access control
- Walk away with practical tips for adopting this in your own stack
As organizations scale their data platforms, managing access to Apache Airflow becomes increasingly complex. In this talk, we introduce the Keycloak Auth Manager — a pluggable authentication and authorization backend for Airflow that delegates identity management to Keycloak, a battle-tested open-source Identity and Access Management solution.
We’ll start with the big picture: what problem does the Keycloak Auth Manager solve, and why Keycloak? We’ll walk through the architecture — how Airflow’s auth manager interface works, how the Keycloak integration hooks into it, and how authentication flows (OIDC/OAuth2) and authorization (role mapping, resource-based permissions) are handled under the hood.
Last year, we showed how LinkedIn’s continuous deployment (LCD) runs on Apache Airflow to orchestrate safe, repeatable releases across thousands of services—powering everyday deployments for 10,000+ engineers.
This year, we’ll dive into the hard‑won patterns that keep those deployments stable at scale: preserving DAG consistency during live updates; routing seamlessly across multiple clusters for graceful failover; enforcing HA guardrails on the control plane; and using dynamic task mapping to deliver faster rollbacks and reduce deployment overhead. You’ll see how we abstract Airflow for a cleaner user experience, what really moved the needle on launching tasks faster, and portable observability practices that cut on‑call toil.
Key takeaways:
- Abstracting Airflow UX: define and update workflows without Airflow internals
- Multi‑cluster routing and failover: keep pipelines running during degradation
- DAG consistency without per‑run versioning: controlled ingestion and safe re‑ingestion
- Dynamic task mapping: faster rollbacks and practical rerun strategies
- Observability: health metrics, alerts, SLOs, and incident playbooks
Last year, we showed how LinkedIn’s continuous deployment (LCD) runs on Apache Airflow to orchestrate safe, repeatable releases across thousands of services—powering everyday deployments for 10,000+ engineers.
This year, we’ll dive into the hard‑won patterns that keep those deployments stable at scale: preserving DAG consistency during live updates; routing seamlessly across multiple clusters for graceful failover; enforcing HA guardrails on the control plane; and using dynamic task mapping to deliver faster rollbacks and reduce deployment overhead. You’ll see how we abstract Airflow for a cleaner user experience, what really moved the needle on launching tasks faster, and portable observability practices that cut on‑call toil.
The rise of more complex asset and agentic powered workflows, the Airflow UI needs to evolve beyond just a way to view failed logs and relationships between tasks.
Come see how we are leveraging the latest Airflow features to build new user experiences that can handle growing agentic workflows.
We’ll go through a few workflows to see how they can be solved through a traditional Dag-centric view or a new Asset-centric view. We will also showcase how both are becoming more realtime so you can always see what is happening.
Come with your own use cases for us to discuss what user experience fits each best.
Texas Ballroom 5The rise of more complex asset and agentic powered workflows, the Airflow UI needs to evolve beyond just a way to view failed logs and relationships between tasks.
Come see how we are leveraging the latest Airflow features to build new user experiences that can handle growing agentic workflows.
We’ll go through a few workflows to see how they can be solved through a traditional Dag-centric view or a new Asset-centric view. We will also showcase how both are becoming more realtime so you can always see what is happening.
Airflow running slow? Memory is spiking. Tasks are queuing forever. Now what? Debugging performance issues in a distributed system like Airflow can feel overwhelming—is it the scheduler, the database, the DAG Processor, or your DAG code? This talk shares practical techniques for isolating and fixing performance problems, using real examples from the Airflow codebase.
We’ll cover:
-
Understanding Airflow’s moving parts – Where bottlenecks typically hide (scheduler loop, DAG parsing, database queries).
-
Profiling techniques – Memory profiling, query analysis, and metrics that actually matter.
-
Case study: DAG Processor OOM – How a single SQLAlchemy query caused a memory explosion, and how we fixed it.
-
Testing your fixes – Setting up reproducible performance tests before and after.
Whether you’re troubleshooting your own deployment or contributing fixes upstream, you’ll leave with a toolkit for tackling Airflow performance issues.
Texas Ballroom 6Airflow running slow? Memory is spiking. Tasks are queuing forever. Now what? Debugging performance issues in a distributed system like Airflow can feel overwhelming—is it the scheduler, the database, the DAG Processor, or your DAG code? This talk shares practical techniques for isolating and fixing performance problems, using real examples from the Airflow codebase.
We’ll cover:
-
Understanding Airflow’s moving parts – Where bottlenecks typically hide (scheduler loop, DAG parsing, database queries).
This session details Idelic/Descartes’s critical journey to a robust, scaled Astronomer Airflow environment. We’ll share technical lessons from overcoming initial orchestration challenges and successfully scaling to over 1,000 active DAGs. The session will showcase our advanced, Jenkins-integrated testing deployment for managing this scale, and the development of a standardized framework that simplifies DAG creation, eliminates code repetition, and enables configuration changes without a full deployment. This is essential for any team managing complex data pipelines, offering a blueprint for standardized Airflow development, maximum data reliability, and future growth at a large scale.
Texas Ballroom 1-2-3This session details Idelic/Descartes’s critical journey to a robust, scaled Astronomer Airflow environment. We’ll share technical lessons from overcoming initial orchestration challenges and successfully scaling to over 1,000 active DAGs. The session will showcase our advanced, Jenkins-integrated testing deployment for managing this scale, and the development of a standardized framework that simplifies DAG creation, eliminates code repetition, and enables configuration changes without a full deployment. This is essential for any team managing complex data pipelines, offering a blueprint for standardized Airflow development, maximum data reliability, and future growth at a large scale.
At Lyft, driver pay configs on GitHub must be validated through Airflow DAGs before merging. However, Scientists and Analysts who change configs are not familiar with Airflow. How do we make such validation self-service while meeting SOX compliance?
This talk presents a design pattern for bidirectional GitHub-Airflow integration: GitHub Actions trigger DAGs, and DAGs push results back as PR status checks via the GitHub Commit Status API. We cover event-driven push-style vs traditional polling style, and why an event-driven push-style works well with Dynamic Task Mapping. This pattern aligns with Airflow 3’s event-driven scheduling vision. We also discuss how SOX requirements shaped this design.
GitHub-Airflow integration is not just hitting APIs. We also address real-world challenges: choosing between GitHub API and GitPython for read/write patterns, preventing DAGs from overriding built-in PR checks, blocking merges during long-running execution, and handling PR rebases that orphan commit statuses. Our solution uses a persistent mapping table to propagate results across rebased commits.
Attendees will leave with a reusable blueprint for event-driven GitHub-Airflow integration.
Texas Ballroom 1-2-3At Lyft, driver pay configs on GitHub must be validated through Airflow DAGs before merging. However, Scientists and Analysts who change configs are not familiar with Airflow. How do we make such validation self-service while meeting SOX compliance?
This talk presents a design pattern for bidirectional GitHub-Airflow integration: GitHub Actions trigger DAGs, and DAGs push results back as PR status checks via the GitHub Commit Status API. We cover event-driven push-style vs traditional polling style, and why an event-driven push-style works well with Dynamic Task Mapping. This pattern aligns with Airflow 3’s event-driven scheduling vision. We also discuss how SOX requirements shaped this design.
Orchestrating AI workloads introduces a two-front battle with infrastructure instability. First, the Airflow workers themselves (e.g., Kubernetes pod evictions, Celery node scaling) can restart and lose track of active tasks. Second, the external AI cluster running the heavy compute can experience temporary network blips, API timeouts or compute rescheduling. With standard Dag designs, these transient hiccups often cause Airflow to panic, fail the task, and tragically send a kill signal to an expensive, perfectly healthy AI job.
This Builder Track session explores a specialized Dag design pattern engineered to solve this dual-instability problem entirely at the code level. Rather than managing the underlying infrastructure, we will dive into how to write resilient Airflow tasks that act as fault-tolerant “watchers.” You will learn how to author Dags that survive worker evictions, patiently handle external AI cluster timeouts, and accurately reflect the true state of the workload, ensuring your pipelines remain bulletproof.
Texas Ballroom 6Orchestrating AI workloads introduces a two-front battle with infrastructure instability. First, the Airflow workers themselves (e.g., Kubernetes pod evictions, Celery node scaling) can restart and lose track of active tasks. Second, the external AI cluster running the heavy compute can experience temporary network blips, API timeouts or compute rescheduling. With standard Dag designs, these transient hiccups often cause Airflow to panic, fail the task, and tragically send a kill signal to an expensive, perfectly healthy AI job.
With the growing recognition if the need for agentic orchestration, Airflow is evolving to support a growing set of agentic patterns. Dynamic Task mapping provided a foundation for RAG workflows. Learn how to go beyond those and orchestrate reasoning patterns with dynamic graphs
Texas Ballroom 1-2-3With the growing recognition if the need for agentic orchestration, Airflow is evolving to support a growing set of agentic patterns. Dynamic Task mapping provided a foundation for RAG workflows. Learn how to go beyond those and orchestrate reasoning patterns with dynamic graphs
If the idea of pushing to production on a Friday still makes your stomach drop, you’re in good company because most data professionals know that particular flavor of dread. But that fear says more about systemic fragility than the day of the week. This talk explores how unclear ownership, hidden dependencies, and late validation create production risk in data platforms. I’ll show how data contracts clarify expectations between producers and consumers, how Behavior‑Driven Development (BDD) provides a shared language for system behavior, and how Airflow can enforce guardrails that shift validation earlier and reduce blast radius. This session focuses on the organizational and architectural decisions that shape platform reliability. Because Airflow often becomes the visible surface of upstream uncertainty, its teams feel the impact of broader design and governance choices. Attendees will learn to interpret “Friday fear” as a strategic signal, how contracts and BDD strengthen alignment and predictability, and how Airflow can act as a platform‑level safety system that builds trust and supports confident deployments - Fridays included.
Texas Ballroom 5If the idea of pushing to production on a Friday still makes your stomach drop, you’re in good company because most data professionals know that particular flavor of dread. But that fear says more about systemic fragility than the day of the week. This talk explores how unclear ownership, hidden dependencies, and late validation create production risk in data platforms. I’ll show how data contracts clarify expectations between producers and consumers, how Behavior‑Driven Development (BDD) provides a shared language for system behavior, and how Airflow can enforce guardrails that shift validation earlier and reduce blast radius. This session focuses on the organizational and architectural decisions that shape platform reliability. Because Airflow often becomes the visible surface of upstream uncertainty, its teams feel the impact of broader design and governance choices. Attendees will learn to interpret “Friday fear” as a strategic signal, how contracts and BDD strengthen alignment and predictability, and how Airflow can act as a platform‑level safety system that builds trust and supports confident deployments - Fridays included.
Amazon Prime Video uses Airflow to forecast traffic for hundreds of micro-services to deliver the best customer experience for some of the world’s biggest live events across multiple global regions. The forecasting methodology involves complex job dependencies between customer interaction metrics and geographies - translating to ~50 production DAGs with cross-DAG dependencies that process terabytes of customer activity data daily across tens of thousands of compute cores. In this talk, we’ll cover how we manage dependency complexity at scale, coordinate data flows across geographical boundaries, and keep forecasts reliable as the system grows.
Texas Ballroom 6Amazon Prime Video uses Airflow to forecast traffic for hundreds of micro-services to deliver the best customer experience for some of the world’s biggest live events across multiple global regions. The forecasting methodology involves complex job dependencies between customer interaction metrics and geographies - translating to ~50 production DAGs with cross-DAG dependencies that process terabytes of customer activity data daily across tens of thousands of compute cores. In this talk, we’ll cover how we manage dependency complexity at scale, coordinate data flows across geographical boundaries, and keep forecasts reliable as the system grows.
Ever wondered what happens between typing SELECT ... GROUP BY and getting results back? Inside every SQL engine lives a scheduler that breaks your query into a DAG of tasks — shuffling, sorting, aggregating, and parallelizing work across partitions. Sound familiar?
In this talk, I’ll demystify SQL engine internals by building one on top of Apache Airflow. We’ll take a SQL query, parse it, optimize it, and transform it into a DAG of Airflow tasks that you can watch execute step by step in the Airflow UI.
You’ll walk away understanding:
- How SQL engines plan and schedule query execution
- What shuffle, partition, and pipeline-breaking actually mean
- How query parallelism works under the hood
No PhD in databases required — just curiosity and an Airflow UI to watch it all unfold.
Texas Ballroom 5Ever wondered what happens between typing SELECT ... GROUP BY and getting results back? Inside every SQL engine lives a scheduler that breaks your query into a DAG of tasks — shuffling, sorting, aggregating, and parallelizing work across partitions. Sound familiar?
In this talk, I’ll demystify SQL engine internals by building one on top of Apache Airflow. We’ll take a SQL query, parse it, optimize it, and transform it into a DAG of Airflow tasks that you can watch execute step by step in the Airflow UI.
Data Mesh decentralises data ownership across business domains. In regulated industries each domain operates in its own account where producers publish data products and consumers subscribe. This enforces governance, limits blast radius and preserves autonomy. When each domain runs its own Airflow, orchestrating across these boundaries is the central challenge. Airflow 2.4 introduced data-aware scheduling which were designed for single Airflow instance with no native cross-instance event propagation. In practice this meant building polling sensors that queried the producer REST API to check upstream completion, but it is unreliable as events were lost and ordering not guaranteed. Airflow 3.0 resolves this with Event-driven scheduling via AssetWatcher. The Triggerer monitors a message queue and triggers the consumer DAG when the producer publishes a completion event. This talk traces that journey through a regulated enterprise Data Mesh. We also share how we built an agentic AI skills framework that encodes operational Airflow knowledge into reusable skills, enabling an AI agent to autonomously deploy, validate and troubleshoot the cross-environment pattern end-to-end.
Texas Ballroom 1-2-3Data Mesh decentralises data ownership across business domains. In regulated industries each domain operates in its own account where producers publish data products and consumers subscribe. This enforces governance, limits blast radius and preserves autonomy. When each domain runs its own Airflow, orchestrating across these boundaries is the central challenge. Airflow 2.4 introduced data-aware scheduling which were designed for single Airflow instance with no native cross-instance event propagation. In practice this meant building polling sensors that queried the producer REST API to check upstream completion, but it is unreliable as events were lost and ordering not guaranteed. Airflow 3.0 resolves this with Event-driven scheduling via AssetWatcher. The Triggerer monitors a message queue and triggers the consumer DAG when the producer publishes a completion event. This talk traces that journey through a regulated enterprise Data Mesh. We also share how we built an agentic AI skills framework that encodes operational Airflow knowledge into reusable skills, enabling an AI agent to autonomously deploy, validate and troubleshoot the cross-environment pattern end-to-end.
During this workshop you are going to learn how to effectively set up a CI/CD pipeline that builds, tests, and even corrects (using Gemini) your DAGs BEFORE deploying them into your Composer environment. You will also learn how to deploy a single custom web app to centrally control all your DAGs across many Cloud Composer environments/projects.
Hill Country ABDuring this workshop you are going to learn how to effectively set up a CI/CD pipeline that builds, tests, and even corrects (using Gemini) your DAGs BEFORE deploying them into your Composer environment. You will also learn how to deploy a single custom web app to centrally control all your DAGs across many Cloud Composer environments/projects.
This talk focuses on leveraging the Task State Management (AIP-103) and Enhanced Retry Policy work (AIP-105) being released in Airflow 3.3 to enable enhanced execution of long running tasks including checkpointing, sophisticated (and automated) retry policies, and intra-task observability.
Initially focused on Apache Spark, which is one of the most widely used workload frameworks for data engineers, this is extensible to long running tasks of any type including agentic workflows.
This solves a key pain point for Airflow users, without requiring custom async code development.
Texas Ballroom 6This talk focuses on leveraging the Task State Management (AIP-103) and Enhanced Retry Policy work (AIP-105) being released in Airflow 3.3 to enable enhanced execution of long running tasks including checkpointing, sophisticated (and automated) retry policies, and intra-task observability.
Initially focused on Apache Spark, which is one of the most widely used workload frameworks for data engineers, this is extensible to long running tasks of any type including agentic workflows.
As Apache Apache Airflow environments scale, teams face duplicated DAG patterns, slow debugging cycles, and technical debt that often surfaces only in production. This workshop explores how AI can shift pipeline quality left by enabling earlier detection of issues and improving code reliability during development.
Using IBM Bob, we demonstrate real-time code review and refactoring guidance across IDE and terminal workflows, helping engineers identify complexity and performance risks before deployment. We also show how AI accelerates DAG debugging and improves consistency across pipelines in environments that span Airflow and streaming systems such as Confluent.
Attendees will learn practical patterns to improve Airflow reliability, reduce technical debt, and shift debugging earlier in the lifecycle.
Hill Country CDAs Apache Apache Airflow environments scale, teams face duplicated DAG patterns, slow debugging cycles, and technical debt that often surfaces only in production. This workshop explores how AI can shift pipeline quality left by enabling earlier detection of issues and improving code reliability during development.
Using IBM Bob, we demonstrate real-time code review and refactoring guidance across IDE and terminal workflows, helping engineers identify complexity and performance risks before deployment. We also show how AI accelerates DAG debugging and improves consistency across pipelines in environments that span Airflow and streaming systems such as Confluent.
Building an AI capability in Airflow is the easy part. The hard part is what comes next.
You want to swap a model, refactor a prompt, cut token costs, or try a local model instead of paying for cloud. How do you know it still works as expected? Without a fast feedback loop, every change is a gamble.
This talk shows practical patterns for building that feedback loop, with real examples using agent skills, MCPs, and local and cloud models. It covers the challenges too: sandboxing, observability, non-determinism, and keeping checks simple enough that people actually use them.
Texas Ballroom 1-2-3Building an AI capability in Airflow is the easy part. The hard part is what comes next.
You want to swap a model, refactor a prompt, cut token costs, or try a local model instead of paying for cloud. How do you know it still works as expected? Without a fast feedback loop, every change is a gamble.
This talk shows practical patterns for building that feedback loop, with real examples using agent skills, MCPs, and local and cloud models. It covers the challenges too: sandboxing, observability, non-determinism, and keeping checks simple enough that people actually use them.
In my almost 15 years as a data engineer, I’ve learned one universal truth: everyone needs orchestration. The marketing team needs daily attribution reports. The CRM team needs personalized newsletter triggers. The platform team needs cross-cloud data transfers. The analytics team needs third-party data imports. Data touches every corner of the business, and the orchestration layer is the one layer that connects it all.
This talk explores what becomes possible when we decouple pipeline logic (what happens) from definition (how it’s authored). With the right abstractions, the authoring interface can be anything: Python, declarative YAML, templates, spreadsheets, or even a video game.
We will explore different levels of orchestration abstraction: native Python, declarative config (using the open-source project DAG Factory), templates (using the open-source project Blueprint), and AI-assisted Authoring. To prove the power of this decoupled architecture, I will showcase a custom Minecraft plugin that lets you build Airflow Dags by placing blocks in Minecraft (yes, you heard that right).
Texas Ballroom 6In my almost 15 years as a data engineer, I’ve learned one universal truth: everyone needs orchestration. The marketing team needs daily attribution reports. The CRM team needs personalized newsletter triggers. The platform team needs cross-cloud data transfers. The analytics team needs third-party data imports. Data touches every corner of the business, and the orchestration layer is the one layer that connects it all.
This talk explores what becomes possible when we decouple pipeline logic (what happens) from definition (how it’s authored). With the right abstractions, the authoring interface can be anything: Python, declarative YAML, templates, spreadsheets, or even a video game.
Apache Airflow’s Kubernetes integration enables flexible workload execution on Kubernetes but lacks advanced resource management features including application queueing, tenant isolation and gang scheduling. These features are increasingly critical for data engineering as well as AI/ML use cases, particularly GPU utilization optimization. For example, gang scheduling ensures all required resources for a job are allocated atomically, preventing partial allocations that waste resources. Apache Yunikorn, a Kubernetes-native scheduler, addresses these gaps by offering a high-performance alternative to Kubernetes default scheduler. In this talk, we’ll demonstrate how to conveniently leverage Yunikorn’s power in Airflow, along with practical use cases and examples.
Texas Ballroom 5Apache Airflow’s Kubernetes integration enables flexible workload execution on Kubernetes but lacks advanced resource management features including application queueing, tenant isolation and gang scheduling. These features are increasingly critical for data engineering as well as AI/ML use cases, particularly GPU utilization optimization. For example, gang scheduling ensures all required resources for a job are allocated atomically, preventing partial allocations that waste resources. Apache Yunikorn, a Kubernetes-native scheduler, addresses these gaps by offering a high-performance alternative to Kubernetes default scheduler. In this talk, we’ll demonstrate how to conveniently leverage Yunikorn’s power in Airflow, along with practical use cases and examples.
RAG pipelines fail silently. Bad retrievals, hallucinated answers, and stale vectors rarely trigger alerts; they quietly degrade your AI product. This session presents a reference DAG architecture for production-grade RAG ingestion built on Airflow 3, with inline quality gates that evaluate retrieval accuracy and LLM faithfulness before a single vector reaches production. We’ll walk through four common RAG failure modes and the specific Airflow pattern that stops each one using RAGAS as the evaluation framework, and Airflow 3’s TaskFlow API, Assets, and DAG Versioning to make pipelines reproducible and event-driven. You’ll leave with reusable quality gate patterns and a concrete architecture you can adapt — because in RAG systems, quality shouldn’t be an afterthought. It should be built into the pipeline from the start.
Texas Ballroom 1-2-3RAG pipelines fail silently. Bad retrievals, hallucinated answers, and stale vectors rarely trigger alerts; they quietly degrade your AI product. This session presents a reference DAG architecture for production-grade RAG ingestion built on Airflow 3, with inline quality gates that evaluate retrieval accuracy and LLM faithfulness before a single vector reaches production. We’ll walk through four common RAG failure modes and the specific Airflow pattern that stops each one using RAGAS as the evaluation framework, and Airflow 3’s TaskFlow API, Assets, and DAG Versioning to make pipelines reproducible and event-driven. You’ll leave with reusable quality gate patterns and a concrete architecture you can adapt — because in RAG systems, quality shouldn’t be an afterthought. It should be built into the pipeline from the start.
Orchestrating Cross-Account ML & Data Pipelines with Apache Airflow As organizations scale data and ML workloads across multiple AWS accounts and Regions, orchestration becomes the hardest engineering problem — not the models themselves. This session shows how Apache Airflow serves as a centralized orchestration hub for distributed data-processing and machine-learning pipelines that span account and regional boundaries.
We walk through a production-ready architecture where a single Airflow environment coordinates:
-
Cross-account DAG patterns — using Airflow connections, IAM role assumption, and custom hooks to trigger AWS Glue, SageMaker, and Lambda in remote accounts
-
Cross-Region data flow — leveraging S3 Cross-Region Replication with S3KeySensor operators to gate downstream tasks on data availability
-
Custom operators for cross-account ML — extending SageMakerHook and SageMakerTrainingOperator to train models in a separate account while keeping orchestration centralized
-
Sensor and operator design — choosing the right sensor modes, timeouts, and poke intervals for long-running training jobs and inference calls
Attendees leave with reusable DAG patterns, operator recipes, and an architecture blueprint for running multi-account, multi-Region data and ML pipelines — all orchestrated through Airflow.
Texas Ballroom 5Orchestrating Cross-Account ML & Data Pipelines with Apache Airflow As organizations scale data and ML workloads across multiple AWS accounts and Regions, orchestration becomes the hardest engineering problem — not the models themselves. This session shows how Apache Airflow serves as a centralized orchestration hub for distributed data-processing and machine-learning pipelines that span account and regional boundaries.
We walk through a production-ready architecture where a single Airflow environment coordinates:
-
Cross-account DAG patterns — using Airflow connections, IAM role assumption, and custom hooks to trigger AWS Glue, SageMaker, and Lambda in remote accounts
Meet airflowctl, the new default for API-driven remote operations. You will see how separating control from execution enhances security, enables isolation, and simplifies automation across different environments. I will discuss the development of airflowctl, demonstrate practical examples of secure remote execution, and provide a guide for transitioning from legacy workflows. You will learn how to easily migrate towards airflowctl and leverage the flexibility of an API-driven approach.
Texas Ballroom 6Meet airflowctl, the new default for API-driven remote operations. You will see how separating control from execution enhances security, enables isolation, and simplifies automation across different environments. I will discuss the development of airflowctl, demonstrate practical examples of secure remote execution, and provide a guide for transitioning from legacy workflows. You will learn how to easily migrate towards airflowctl and leverage the flexibility of an API-driven approach.
Your data platform team didn’t sign up to be a Dag factory. But when Airflow expertise is concentrated in a small group of engineers, that’s exactly what happens. Analysts wait days for simple workflows, engineers burn cycles rebuilding the same patterns, and frustrated teams start building outside the stack entirely. The real fix isn’t a better onboarding guide or a friendlier UI. It’s rethinking the abstraction layer your team exposes to the rest of the business.
In this talk, we’ll introduce astronomer/blueprint, an open-source Python library built around the idea that platform teams should define the building blocks and everyone else should be able to assemble them. We’ll walk through how to design composable, type-safe templates using Pydantic validation and control exactly which parameters downstream users can configure. We’ll also show how we’ve layered a no-code visual interface on top in Astro, Astronomer’s unified orchestration platform for Apache Airflow®, giving non-technical users a path to self-service without sacrificing governance or control. You’ll leave with a concrete framework and open-source tooling you can put to work immediately.
Texas Ballroom 1-2-3Your data platform team didn’t sign up to be a Dag factory. But when Airflow expertise is concentrated in a small group of engineers, that’s exactly what happens. Analysts wait days for simple workflows, engineers burn cycles rebuilding the same patterns, and frustrated teams start building outside the stack entirely. The real fix isn’t a better onboarding guide or a friendlier UI. It’s rethinking the abstraction layer your team exposes to the rest of the business.
A live text adventure where Airflow is the game engine. Rooms are tasks. Choices are branches. Inventory lives in XComs. Monsters have Deadline Alerts. The audience votes at every fork, and the DAG decides what happens next. It’s silly, it’s live, and every concept maps to a real production pattern.
Texas Ballroom 5A live text adventure where Airflow is the game engine. Rooms are tasks. Choices are branches. Inventory lives in XComs. Monsters have Deadline Alerts. The audience votes at every fork, and the DAG decides what happens next. It’s silly, it’s live, and every concept maps to a real production pattern.
At Equifax, Apache Airflow is used across many departments, helping Data Engineers, Data Scientists, and Business Analysts in their daily work.
This presentation is about how to use modern orchestration technology at the heart of data processing and business processes to support daily company operations.
Texas Ballroom 1-2-3At Equifax, Apache Airflow is used across many departments, helping Data Engineers, Data Scientists, and Business Analysts in their daily work.
This presentation is about how to use modern orchestration technology at the heart of data processing and business processes to support daily company operations.
In many modern data platforms, orchestration tools are combined with transformation frameworks. A common pattern is orchestrating dbt (data build tool) transformations using Apache Airflow — something reported by roughly 44% of the community.
At first glance, the integration seems straightforward: simply run dbt run inside an Airflow task. Some teams go further and use libraries that convert dbt projects into native Airflow DAGs, such as Astronomer Cosmos.
In practice, however, teams quickly run into operational and architectural challenges. Slowness, out-of-memory errors, zombie tasks, and DAGs that take minutes to appear in the UI are just a few of the issues that can emerge as projects scale.
In this talk, I’ll walk through the most common problems I’ve encountered while supporting organisations running dbt with Airflow, explain why they occur, and share practical strategies to avoid or resolve them. The goal is to help teams build more reliable and scalable pipelines when combining Airflow orchestration with dbt transformations.
Texas Ballroom 1-2-3In many modern data platforms, orchestration tools are combined with transformation frameworks. A common pattern is orchestrating dbt (data build tool) transformations using Apache Airflow — something reported by roughly 44% of the community.
At first glance, the integration seems straightforward: simply run dbt run inside an Airflow task. Some teams go further and use libraries that convert dbt projects into native Airflow DAGs, such as Astronomer Cosmos.
In practice, however, teams quickly run into operational and architectural challenges. Slowness, out-of-memory errors, zombie tasks, and DAGs that take minutes to appear in the UI are just a few of the issues that can emerge as projects scale.
As data platforms mature, organizations often experience “Airflow Sprawl”—the rapid, organic growth of isolated Airflow instances across different teams and projects. While this empowers localized control, it creates dangerous silos that hinder visibility, increase operational risk, and erode developer productivity. In this session, we will explore the critical challenges of managing a fragmented Airflow ecosystem and discuss strategies for regaining control. We will examine why centralizing execution history and establishing unified observability is essential for reducing Mean Time to Recovery (MTTR), mitigating hidden security risks, and transforming fragmented instances into a cohesive, reliable data service. Attendees will leave with a strategic framework for managing Airflow at scale.
Texas Ballroom 5This talk is the story of getting a PR merged into a Apache Airflow without writing a single line of code, using Apache Airflow itself as an agentic orchestration harness to replicate the functionality of Claude Code for any pluggable LLM.
We’ll walk through how Airflow’s AIP-99 Dag functionality map naturally onto the tool-use loops, context management, and decision branching that power modern agentic coding workflows. The result is a model-agnostic harness that can read a codebase, reason about changes, write and test code, and deploy a commit to a git repository, all orchestrated as an Airflow Dag.
The proof of concept is a merged PR into Apache Airflow itself!
Attendees will leave with a mental model for thinking about Airflow as an agentic harness, practical patterns for wrapping LLM tool-use in Dag tasks, and a healthy appreciation for what happens when you point an autonomous coding agent at its own codebase.
Texas Ballroom 6This talk is the story of getting a PR merged into a Apache Airflow without writing a single line of code, using Apache Airflow itself as an agentic orchestration harness to replicate the functionality of Claude Code for any pluggable LLM.
We’ll walk through how Airflow’s AIP-99 Dag functionality map naturally onto the tool-use loops, context management, and decision branching that power modern agentic coding workflows. The result is a model-agnostic harness that can read a codebase, reason about changes, write and test code, and deploy a commit to a git repository, all orchestrated as an Airflow Dag.
Modern data platforms generate overwhelming amounts of operational data across distributed systems. For teams running Apache Airflow at scale, incidents often mean high mean time to resolution (MTTR), constant context switching between observability tools, and a growing on-call burden.
What if your Airflow environment had an always-on, autonomous on-call engineer?
In this workshop, we’ll explore how an AI-powered DevOps agent can supercharge Airflow operations — from automated DAG failure diagnosis and intelligent log analysis to proactive prevention of recurring incidents. Whether you’re running Airflow on a managed cloud service or self-hosted, the patterns and practices covered here apply broadly to modern data pipeline operations.
Key topics covered:
- Autonomous incident detection & response — topology-aware analysis across DAG dependencies with actionable mitigation plans
- Real-time root cause analysis — correlating logs, metrics, and traces across your Airflow application stack
- Collaboration integrations — automatic messaging, ticketing, and incident management workflows
Join us to transform your Airflow DevOps experience.
Hill Country CDModern data platforms generate overwhelming amounts of operational data across distributed systems. For teams running Apache Airflow at scale, incidents often mean high mean time to resolution (MTTR), constant context switching between observability tools, and a growing on-call burden.
What if your Airflow environment had an always-on, autonomous on-call engineer?
In this workshop, we’ll explore how an AI-powered DevOps agent can supercharge Airflow operations — from automated DAG failure diagnosis and intelligent log analysis to proactive prevention of recurring incidents. Whether you’re running Airflow on a managed cloud service or self-hosted, the patterns and practices covered here apply broadly to modern data pipeline operations.
Apache Airflow has long been the go-to orchestration platform for data engineering teams, but managing the underlying infrastructure remains a persistent challenge. Amazon MWAA Serverless eliminates that burden entirely — no environment sizing, no capacity planning, and no idle costs. In this hands-on workshop, attendees will get a practical introduction to MWAA Serverless and walk away having built and run a real end-to-end ML pipeline on AWS.
In this workshop, we’ll use an agent equipped with MWAA serverless knowledge and Airflow DAG tooling to build a pipeline that takes raw training data from S3, kicks off a SageMaker training job, evaluates the output using Claude on Bedrock, and deploys if evaluation passes. You’ll watch the agent reason about the task, generate a valid YAML workflow using the DAG factory pattern, and deploy it to MWAA Serverless — no hand-written code required. Once deployed, we trigger a run and provide full observability: task-level logs from the SageMaker job, the Bedrock evaluation output, and the deploy decision. If something fails, we show how to identify the broken step, read its isolated logs, and iterate — either asking the agent to fix the YAML or rolling back to a prior version. The goal is the full loop: prompt, deploy, run, observe, debug, iterate.
By the end of the session, attendees will have hands-on experience with the full MWAA Serverless workflow lifecycle. Whether you’re new to MWAA Serverless or looking to accelerate pipeline development with AI tooling, you’ll leave with a repeatable pattern you can apply immediately.
Hill Country ABApache Airflow has long been the go-to orchestration platform for data engineering teams, but managing the underlying infrastructure remains a persistent challenge. Amazon MWAA Serverless eliminates that burden entirely — no environment sizing, no capacity planning, and no idle costs. In this hands-on workshop, attendees will get a practical introduction to MWAA Serverless and walk away having built and run a real end-to-end ML pipeline on AWS.
In this workshop, we’ll use an agent equipped with MWAA serverless knowledge and Airflow DAG tooling to build a pipeline that takes raw training data from S3, kicks off a SageMaker training job, evaluates the output using Claude on Bedrock, and deploys if evaluation passes. You’ll watch the agent reason about the task, generate a valid YAML workflow using the DAG factory pattern, and deploy it to MWAA Serverless — no hand-written code required. Once deployed, we trigger a run and provide full observability: task-level logs from the SageMaker job, the Bedrock evaluation output, and the deploy decision. If something fails, we show how to identify the broken step, read its isolated logs, and iterate — either asking the agent to fix the YAML or rolling back to a prior version. The goal is the full loop: prompt, deploy, run, observe, debug, iterate.
Last year, we showed how LinkedIn’s continuous deployment (LCD) runs on Apache Airflow to orchestrate safe, repeatable releases across thousands of services—powering everyday deployments for 10,000+ engineers.
This year, we’ll dive into the hard‑won patterns that keep those deployments stable at scale: preserving DAG consistency during live updates; routing seamlessly across multiple clusters for graceful failover; enforcing HA guardrails on the control plane; and using dynamic task mapping to deliver faster rollbacks and reduce deployment overhead. You’ll see how we abstract Airflow for a cleaner user experience, what really moved the needle on launching tasks faster, and portable observability practices that cut on‑call toil.
Key takeaways:
- Abstracting Airflow UX: define and update workflows without Airflow internals
- Multi‑cluster routing and failover: keep pipelines running during degradation
- DAG consistency without per‑run versioning: controlled ingestion and safe re‑ingestion
- Dynamic task mapping: faster rollbacks and practical rerun strategies
- Observability: health metrics, alerts, SLOs, and incident playbooks
Last year, we showed how LinkedIn’s continuous deployment (LCD) runs on Apache Airflow to orchestrate safe, repeatable releases across thousands of services—powering everyday deployments for 10,000+ engineers.
This year, we’ll dive into the hard‑won patterns that keep those deployments stable at scale: preserving DAG consistency during live updates; routing seamlessly across multiple clusters for graceful failover; enforcing HA guardrails on the control plane; and using dynamic task mapping to deliver faster rollbacks and reduce deployment overhead. You’ll see how we abstract Airflow for a cleaner user experience, what really moved the needle on launching tasks faster, and portable observability practices that cut on‑call toil.
As Apache Airflow expands beyond batch into real-time, event-driven architectures, data teams face a new set of challenges: duplicated DAG patterns, fragile Kafka-triggered workflows, and debugging cycles that happen too late—often in production.
In this session, we introduce a shift-left approach to pipeline reliability for environments combining Airflow with streaming platforms like Confluent. We’ll explore how event-driven pipelines increase complexity—and why traditional debugging and validation approaches no longer scale. You’ll see how IBM Bob, an AI-powered assistant for data engineers, brings real-time code review, refactoring guidance, and debugging insights directly into developer workflows.
From catching DAG anti-patterns early to improving consistency across batch and streaming pipelines, we’ll demonstrate how teams can prevent issues before they reach production.
We’ll also share practical patterns to: -Improve Airflow code quality across distributed teams -Standardize DAG development for batch and streaming use cases -Reduce MTTD (Mean Time to Detections) and MTTR (Mean Time To Resolution) -Automate DAG tracking across your enterprise through lineage graphs -Minimize technical debt as pipeline complexity grows
We’ll close with a preview of our hands-on workshop, where attendees can apply these concepts in a live lab—using AI to debug, optimize, and standardize Airflow pipelines in real-world scenarios.
Texas Ballroom 1-2-3As Apache Airflow expands beyond batch into real-time, event-driven architectures, data teams face a new set of challenges: duplicated DAG patterns, fragile Kafka-triggered workflows, and debugging cycles that happen too late—often in production.
In this session, we introduce a shift-left approach to pipeline reliability for environments combining Airflow with streaming platforms like Confluent. We’ll explore how event-driven pipelines increase complexity—and why traditional debugging and validation approaches no longer scale. You’ll see how IBM Bob, an AI-powered assistant for data engineers, brings real-time code review, refactoring guidance, and debugging insights directly into developer workflows.
During this session, I’ll deep dive into the implementation of an AI-powered endurance sports coach using Apache Airflow as the backbone for data ingestion and processing. Beyond data pipelines, I’ll explain what’s required to build a conversational AI system, from structured data modeling to orchestration and retrieval. We’ll explore how metrics are precomputed, how vector search enables contextual memory, and which front-end patterns work best for interacting with AI agents. The result is a reproducible architecture where Airflow powers the data layer and an LLM provides the reasoning on top to help athletes perform at their best in numbers-driven endurance sports.
Texas Ballroom 6During this session, I’ll deep dive into the implementation of an AI-powered endurance sports coach using Apache Airflow as the backbone for data ingestion and processing. Beyond data pipelines, I’ll explain what’s required to build a conversational AI system, from structured data modeling to orchestration and retrieval. We’ll explore how metrics are precomputed, how vector search enables contextual memory, and which front-end patterns work best for interacting with AI agents. The result is a reproducible architecture where Airflow powers the data layer and an LLM provides the reasoning on top to help athletes perform at their best in numbers-driven endurance sports.
Airflow’s evolution toward a client-server architecture faced a fundamental challenge: splitting a monolithic codebase into independent distributions (airflow-core, task-sdk, providers) without triggering dependency hell. Traditional PyPi packaging and code duplication both fail at Airflow’s scale.
Airflow 3.2 solves this through modular isolation and shared libraries using in-repository symlinks. This approach ensures each distribution ships with the exact version of shared code it requires, eliminating runtime version conflicts and allowing for independent dependency management. We have already migrated 10+ critical components—including the config parser, observability, and secrets masking—into this shared model.
This architecture unlocks:
Zero Version Conflicts: Mix and match airflow-core and task-sdk versions seamlessly.
Streamlined Maintenance: Automatic security fixes across all components.
AIP-72 Vision: Lightweight workers with API-first communication, removing the need for direct database access.
Join us to explore how shared libraries transform Airflow’s monorepo into a modular, scalable, and truly distributed orchestration platform!
Texas Ballroom 1-2-3Airflow’s evolution toward a client-server architecture faced a fundamental challenge: splitting a monolithic codebase into independent distributions (airflow-core, task-sdk, providers) without triggering dependency hell. Traditional PyPi packaging and code duplication both fail at Airflow’s scale.
Airflow 3.2 solves this through modular isolation and shared libraries using in-repository symlinks. This approach ensures each distribution ships with the exact version of shared code it requires, eliminating runtime version conflicts and allowing for independent dependency management. We have already migrated 10+ critical components—including the config parser, observability, and secrets masking—into this shared model.
Apache Airflow is the de facto orchestrator for modern data platforms, while Apache Spark powers large-scale data processing. But when the two meet in production, teams quickly face architectural decisions that affect reliability, performance, and cloud cost.
In this talk we explore key design questions when orchestrating Spark with Airflow: • Should you run a shared Spark cluster, a cluster per DAG run, or clusters per task? • When should Spark workloads run in parallel vs sequentially within a workflow? • How can teams benchmark pipeline performance in terms of both runtime and cost? • How do emerging features like Spark Declarative Pipelines change how Spark integrates with orchestration systems?
Using real production scenarios, we’ll examine the tradeoffs between orchestration strategies and show how Spark observability with DataFlint OSS helps analyze execution plans, task behavior, and runtime metrics.
We’ll also demonstrate how propagating Airflow metadata into Spark jobs allows teams to attribute infrastructure usage and cost back to individual DAGs and benchmark workflow performance.
Texas Ballroom 6Apache Airflow is the de facto orchestrator for modern data platforms, while Apache Spark powers large-scale data processing. But when the two meet in production, teams quickly face architectural decisions that affect reliability, performance, and cloud cost.
In this talk we explore key design questions when orchestrating Spark with Airflow: • Should you run a shared Spark cluster, a cluster per DAG run, or clusters per task? • When should Spark workloads run in parallel vs sequentially within a workflow? • How can teams benchmark pipeline performance in terms of both runtime and cost? • How do emerging features like Spark Declarative Pipelines change how Spark integrates with orchestration systems?
Generic AI coding assistants like Cursor and Claude code are powerful, but they struggle with proprietary infrastructures. At Wix, managing 7,500 active DAGs across 120 Data Engineers, we found that standard AI tools lacked the context to be truly effective - they didn’t know our custom operators, DWH modeling patterns, or strict governance rules. In this session, we introduce our internal “Agentic IDE Configuration Manager” that bridges this gap. We will demonstrate how we leverage MCPs to inject deep Airflow context into our AI agents. You will learn how we enabled our coding agents to: Generate compliant code by utilizing custom Cursor rules to ensure every DAG meets production standards and naming conventions. Interact with Airflow by using our custom MCPs to run DAGs locally, parse error logs, and autonomously fix pipeline failures. Understand data by accessing our Data Catalog and Trino engine to validate schema logic in real-time. Whether you are trying to optimize your team’s workflows or simply curious how far can coding agents go in the current age, join us in this exciting talk.
Texas Ballroom 1-2-3Generic AI coding assistants like Cursor and Claude code are powerful, but they struggle with proprietary infrastructures. At Wix, managing 7,500 active DAGs across 120 Data Engineers, we found that standard AI tools lacked the context to be truly effective - they didn’t know our custom operators, DWH modeling patterns, or strict governance rules. In this session, we introduce our internal “Agentic IDE Configuration Manager” that bridges this gap. We will demonstrate how we leverage MCPs to inject deep Airflow context into our AI agents. You will learn how we enabled our coding agents to: Generate compliant code by utilizing custom Cursor rules to ensure every DAG meets production standards and naming conventions. Interact with Airflow by using our custom MCPs to run DAGs locally, parse error logs, and autonomously fix pipeline failures. Understand data by accessing our Data Catalog and Trino engine to validate schema logic in real-time. Whether you are trying to optimize your team’s workflows or simply curious how far can coding agents go in the current age, join us in this exciting talk.
The rise of more complex asset and agentic powered workflows, the Airflow UI needs to evolve beyond just a way to view failed logs and relationships between tasks.
Come see how we are leveraging the latest Airflow features to build new user experiences that can handle growing agentic workflows.
We’ll go through a few workflows to see how they can be solved through a traditional Dag-centric view or a new Asset-centric view. We will also showcase how both are becoming more realtime so you can always see what is happening.
Come with your own use cases for us to discuss what user experience fits each best.
Texas Ballroom 5The rise of more complex asset and agentic powered workflows, the Airflow UI needs to evolve beyond just a way to view failed logs and relationships between tasks.
Come see how we are leveraging the latest Airflow features to build new user experiences that can handle growing agentic workflows.
We’ll go through a few workflows to see how they can be solved through a traditional Dag-centric view or a new Asset-centric view. We will also showcase how both are becoming more realtime so you can always see what is happening.
As organizations scale their data platforms, managing access to Apache Airflow becomes increasingly complex. In this talk, we introduce the Keycloak Auth Manager — a pluggable authentication and authorization backend for Airflow that delegates identity management to Keycloak, a battle-tested open-source Identity and Access Management solution.
We’ll start with the big picture: what problem does the Keycloak Auth Manager solve, and why Keycloak? We’ll walk through the architecture — how Airflow’s auth manager interface works, how the Keycloak integration hooks into it, and how authentication flows (OIDC/OAuth2) and authorization (role mapping, resource-based permissions) are handled under the hood.
In the second part, we shift to the user perspective with a demo. You’ll see how to configure and deploy the integration, how end users experience SSO login, and how administrators manage roles and permissions in Keycloak that reflect directly in Airflow’s UI and API.
Finally, We’ll also touch on how Keycloak naturally fits into multi-team scenarios, and what that unlocks for teams operating at scale.
Key takeaways:
- Understand how Airflow’s pluggable auth manager interface works
- Learn how Keycloak handles authentication (OIDC) and authorization (roles/permissions) for Airflow
- See a real-world deployment in action, from config to login to access control
- Walk away with practical tips for adopting this in your own stack
As organizations scale their data platforms, managing access to Apache Airflow becomes increasingly complex. In this talk, we introduce the Keycloak Auth Manager — a pluggable authentication and authorization backend for Airflow that delegates identity management to Keycloak, a battle-tested open-source Identity and Access Management solution.
We’ll start with the big picture: what problem does the Keycloak Auth Manager solve, and why Keycloak? We’ll walk through the architecture — how Airflow’s auth manager interface works, how the Keycloak integration hooks into it, and how authentication flows (OIDC/OAuth2) and authorization (role mapping, resource-based permissions) are handled under the hood.
AI and ML pipelines built in Airflow often power critical business outcomes, but they rarely operate in isolation. In this hands-on workshop, learn how Control-M integrates with Airflow to orchestrate end-to-end workflows that connect data pipelines with upstream and downstream enterprise systems such as supply chain, billing, and other mission-critical applications. You’ll gain practical insight into how teams can improve visibility, reliability, and governance across Airflow-driven workflows, helping move AI and data initiatives from pipeline execution to enterprise-ready business impact.
Hill Country CDAI and ML pipelines built in Airflow often power critical business outcomes, but they rarely operate in isolation. In this hands-on workshop, learn how Control-M integrates with Airflow to orchestrate end-to-end workflows that connect data pipelines with upstream and downstream enterprise systems such as supply chain, billing, and other mission-critical applications. You’ll gain practical insight into how teams can improve visibility, reliability, and governance across Airflow-driven workflows, helping move AI and data initiatives from pipeline execution to enterprise-ready business impact.
Airflow 3’s Deadline Alerts let you set “need-by” times on DAGs and fire callbacks when deadlines are missed. The built-in references cover common cases, but the real power is the feature’s extensibility. In this workshop, led by the feature’s author, we will go beyond the basics and explore these more advanced features.
We start with an overview of how DeadlineAlert, DeadlineReference, and Callback fit together, and how the scheduler detects misses. Then, a guided project: coding our own Callback implementation and building custom DeadlineReference classes using the @deadline_reference decorator, implementing _evaluate_with(), serialization, and required_kwargs. We wrap up with a hackathon-style “competition” to build the most creative WORKING DeadlineReference (business hours, the last time it didn’t rain in Vancouver, the moon phases, the last time the Leafs won the cup… anything goes, as long as it serializes and returns a valid datetime).
BONUS! Your hackathon “projects” would make a great lightning talk topic!
Submission reviewer note: OR a 25 minute talk. Architecture overview and live demo with the rest as a github link, but the workshop format would be more engaging.
Hill Country ABAirflow 3’s Deadline Alerts let you set “need-by” times on DAGs and fire callbacks when deadlines are missed. The built-in references cover common cases, but the real power is the feature’s extensibility. In this workshop, led by the feature’s author, we will go beyond the basics and explore these more advanced features.
We start with an overview of how DeadlineAlert, DeadlineReference, and Callback fit together, and how the scheduler detects misses. Then, a guided project: coding our own Callback implementation and building custom DeadlineReference classes using the @deadline_reference decorator, implementing _evaluate_with(), serialization, and required_kwargs. We wrap up with a hackathon-style “competition” to build the most creative WORKING DeadlineReference (business hours, the last time it didn’t rain in Vancouver, the moon phases, the last time the Leafs won the cup… anything goes, as long as it serializes and returns a valid datetime).
This session details Idelic/Descartes’s critical journey to a robust, scaled Astronomer Airflow environment. We’ll share technical lessons from overcoming initial orchestration challenges and successfully scaling to over 1,000 active DAGs. The session will showcase our advanced, Jenkins-integrated testing deployment for managing this scale, and the development of a standardized framework that simplifies DAG creation, eliminates code repetition, and enables configuration changes without a full deployment. This is essential for any team managing complex data pipelines, offering a blueprint for standardized Airflow development, maximum data reliability, and future growth at a large scale.
Texas Ballroom 1-2-3This session details Idelic/Descartes’s critical journey to a robust, scaled Astronomer Airflow environment. We’ll share technical lessons from overcoming initial orchestration challenges and successfully scaling to over 1,000 active DAGs. The session will showcase our advanced, Jenkins-integrated testing deployment for managing this scale, and the development of a standardized framework that simplifies DAG creation, eliminates code repetition, and enables configuration changes without a full deployment. This is essential for any team managing complex data pipelines, offering a blueprint for standardized Airflow development, maximum data reliability, and future growth at a large scale.
Airflow running slow? Memory is spiking. Tasks are queuing forever. Now what? Debugging performance issues in a distributed system like Airflow can feel overwhelming—is it the scheduler, the database, the DAG Processor, or your DAG code? This talk shares practical techniques for isolating and fixing performance problems, using real examples from the Airflow codebase.
We’ll cover:
-
Understanding Airflow’s moving parts – Where bottlenecks typically hide (scheduler loop, DAG parsing, database queries).
-
Profiling techniques – Memory profiling, query analysis, and metrics that actually matter.
-
Case study: DAG Processor OOM – How a single SQLAlchemy query caused a memory explosion, and how we fixed it.
-
Testing your fixes – Setting up reproducible performance tests before and after.
Whether you’re troubleshooting your own deployment or contributing fixes upstream, you’ll leave with a toolkit for tackling Airflow performance issues.
Texas Ballroom 6Airflow running slow? Memory is spiking. Tasks are queuing forever. Now what? Debugging performance issues in a distributed system like Airflow can feel overwhelming—is it the scheduler, the database, the DAG Processor, or your DAG code? This talk shares practical techniques for isolating and fixing performance problems, using real examples from the Airflow codebase.
We’ll cover:
-
Understanding Airflow’s moving parts – Where bottlenecks typically hide (scheduler loop, DAG parsing, database queries).
At Lyft, driver pay configs on GitHub must be validated through Airflow DAGs before merging. However, Scientists and Analysts who change configs are not familiar with Airflow. How do we make such validation self-service while meeting SOX compliance?
This talk presents a design pattern for bidirectional GitHub-Airflow integration: GitHub Actions trigger DAGs, and DAGs push results back as PR status checks via the GitHub Commit Status API. We cover event-driven push-style vs traditional polling style, and why an event-driven push-style works well with Dynamic Task Mapping. This pattern aligns with Airflow 3’s event-driven scheduling vision. We also discuss how SOX requirements shaped this design.
GitHub-Airflow integration is not just hitting APIs. We also address real-world challenges: choosing between GitHub API and GitPython for read/write patterns, preventing DAGs from overriding built-in PR checks, blocking merges during long-running execution, and handling PR rebases that orphan commit statuses. Our solution uses a persistent mapping table to propagate results across rebased commits.
Attendees will leave with a reusable blueprint for event-driven GitHub-Airflow integration.
Texas Ballroom 1-2-3At Lyft, driver pay configs on GitHub must be validated through Airflow DAGs before merging. However, Scientists and Analysts who change configs are not familiar with Airflow. How do we make such validation self-service while meeting SOX compliance?
This talk presents a design pattern for bidirectional GitHub-Airflow integration: GitHub Actions trigger DAGs, and DAGs push results back as PR status checks via the GitHub Commit Status API. We cover event-driven push-style vs traditional polling style, and why an event-driven push-style works well with Dynamic Task Mapping. This pattern aligns with Airflow 3’s event-driven scheduling vision. We also discuss how SOX requirements shaped this design.
Orchestrating AI workloads introduces a two-front battle with infrastructure instability. First, the Airflow workers themselves (e.g., Kubernetes pod evictions, Celery node scaling) can restart and lose track of active tasks. Second, the external AI cluster running the heavy compute can experience temporary network blips, API timeouts or compute rescheduling. With standard Dag designs, these transient hiccups often cause Airflow to panic, fail the task, and tragically send a kill signal to an expensive, perfectly healthy AI job.
This Builder Track session explores a specialized Dag design pattern engineered to solve this dual-instability problem entirely at the code level. Rather than managing the underlying infrastructure, we will dive into how to write resilient Airflow tasks that act as fault-tolerant “watchers.” You will learn how to author Dags that survive worker evictions, patiently handle external AI cluster timeouts, and accurately reflect the true state of the workload, ensuring your pipelines remain bulletproof.
Texas Ballroom 6Orchestrating AI workloads introduces a two-front battle with infrastructure instability. First, the Airflow workers themselves (e.g., Kubernetes pod evictions, Celery node scaling) can restart and lose track of active tasks. Second, the external AI cluster running the heavy compute can experience temporary network blips, API timeouts or compute rescheduling. With standard Dag designs, these transient hiccups often cause Airflow to panic, fail the task, and tragically send a kill signal to an expensive, perfectly healthy AI job.
With the growing recognition if the need for agentic orchestration, Airflow is evolving to support a growing set of agentic patterns. Dynamic Task mapping provided a foundation for RAG workflows. Learn how to go beyond those and orchestrate reasoning patterns with dynamic graphs
Texas Ballroom 1-2-3With the growing recognition if the need for agentic orchestration, Airflow is evolving to support a growing set of agentic patterns. Dynamic Task mapping provided a foundation for RAG workflows. Learn how to go beyond those and orchestrate reasoning patterns with dynamic graphs
If the idea of pushing to production on a Friday still makes your stomach drop, you’re in good company because most data professionals know that particular flavor of dread. But that fear says more about systemic fragility than the day of the week. This talk explores how unclear ownership, hidden dependencies, and late validation create production risk in data platforms. I’ll show how data contracts clarify expectations between producers and consumers, how Behavior‑Driven Development (BDD) provides a shared language for system behavior, and how Airflow can enforce guardrails that shift validation earlier and reduce blast radius. This session focuses on the organizational and architectural decisions that shape platform reliability. Because Airflow often becomes the visible surface of upstream uncertainty, its teams feel the impact of broader design and governance choices. Attendees will learn to interpret “Friday fear” as a strategic signal, how contracts and BDD strengthen alignment and predictability, and how Airflow can act as a platform‑level safety system that builds trust and supports confident deployments - Fridays included.
Texas Ballroom 5If the idea of pushing to production on a Friday still makes your stomach drop, you’re in good company because most data professionals know that particular flavor of dread. But that fear says more about systemic fragility than the day of the week. This talk explores how unclear ownership, hidden dependencies, and late validation create production risk in data platforms. I’ll show how data contracts clarify expectations between producers and consumers, how Behavior‑Driven Development (BDD) provides a shared language for system behavior, and how Airflow can enforce guardrails that shift validation earlier and reduce blast radius. This session focuses on the organizational and architectural decisions that shape platform reliability. Because Airflow often becomes the visible surface of upstream uncertainty, its teams feel the impact of broader design and governance choices. Attendees will learn to interpret “Friday fear” as a strategic signal, how contracts and BDD strengthen alignment and predictability, and how Airflow can act as a platform‑level safety system that builds trust and supports confident deployments - Fridays included.
Amazon Prime Video uses Airflow to forecast traffic for hundreds of micro-services to deliver the best customer experience for some of the world’s biggest live events across multiple global regions. The forecasting methodology involves complex job dependencies between customer interaction metrics and geographies - translating to ~50 production DAGs with cross-DAG dependencies that process terabytes of customer activity data daily across tens of thousands of compute cores. In this talk, we’ll cover how we manage dependency complexity at scale, coordinate data flows across geographical boundaries, and keep forecasts reliable as the system grows.
Texas Ballroom 6Amazon Prime Video uses Airflow to forecast traffic for hundreds of micro-services to deliver the best customer experience for some of the world’s biggest live events across multiple global regions. The forecasting methodology involves complex job dependencies between customer interaction metrics and geographies - translating to ~50 production DAGs with cross-DAG dependencies that process terabytes of customer activity data daily across tens of thousands of compute cores. In this talk, we’ll cover how we manage dependency complexity at scale, coordinate data flows across geographical boundaries, and keep forecasts reliable as the system grows.
Data Mesh decentralises data ownership across business domains. In regulated industries each domain operates in its own account where producers publish data products and consumers subscribe. This enforces governance, limits blast radius and preserves autonomy. When each domain runs its own Airflow, orchestrating across these boundaries is the central challenge. Airflow 2.4 introduced data-aware scheduling which were designed for single Airflow instance with no native cross-instance event propagation. In practice this meant building polling sensors that queried the producer REST API to check upstream completion, but it is unreliable as events were lost and ordering not guaranteed. Airflow 3.0 resolves this with Event-driven scheduling via AssetWatcher. The Triggerer monitors a message queue and triggers the consumer DAG when the producer publishes a completion event. This talk traces that journey through a regulated enterprise Data Mesh. We also share how we built an agentic AI skills framework that encodes operational Airflow knowledge into reusable skills, enabling an AI agent to autonomously deploy, validate and troubleshoot the cross-environment pattern end-to-end.
Texas Ballroom 1-2-3Data Mesh decentralises data ownership across business domains. In regulated industries each domain operates in its own account where producers publish data products and consumers subscribe. This enforces governance, limits blast radius and preserves autonomy. When each domain runs its own Airflow, orchestrating across these boundaries is the central challenge. Airflow 2.4 introduced data-aware scheduling which were designed for single Airflow instance with no native cross-instance event propagation. In practice this meant building polling sensors that queried the producer REST API to check upstream completion, but it is unreliable as events were lost and ordering not guaranteed. Airflow 3.0 resolves this with Event-driven scheduling via AssetWatcher. The Triggerer monitors a message queue and triggers the consumer DAG when the producer publishes a completion event. This talk traces that journey through a regulated enterprise Data Mesh. We also share how we built an agentic AI skills framework that encodes operational Airflow knowledge into reusable skills, enabling an AI agent to autonomously deploy, validate and troubleshoot the cross-environment pattern end-to-end.
Ever wondered what happens between typing SELECT ... GROUP BY and getting results back? Inside every SQL engine lives a scheduler that breaks your query into a DAG of tasks — shuffling, sorting, aggregating, and parallelizing work across partitions. Sound familiar?
In this talk, I’ll demystify SQL engine internals by building one on top of Apache Airflow. We’ll take a SQL query, parse it, optimize it, and transform it into a DAG of Airflow tasks that you can watch execute step by step in the Airflow UI.
You’ll walk away understanding:
- How SQL engines plan and schedule query execution
- What shuffle, partition, and pipeline-breaking actually mean
- How query parallelism works under the hood
No PhD in databases required — just curiosity and an Airflow UI to watch it all unfold.
Texas Ballroom 5Ever wondered what happens between typing SELECT ... GROUP BY and getting results back? Inside every SQL engine lives a scheduler that breaks your query into a DAG of tasks — shuffling, sorting, aggregating, and parallelizing work across partitions. Sound familiar?
In this talk, I’ll demystify SQL engine internals by building one on top of Apache Airflow. We’ll take a SQL query, parse it, optimize it, and transform it into a DAG of Airflow tasks that you can watch execute step by step in the Airflow UI.
This talk focuses on leveraging the Task State Management (AIP-103) and Enhanced Retry Policy work (AIP-105) being released in Airflow 3.3 to enable enhanced execution of long running tasks including checkpointing, sophisticated (and automated) retry policies, and intra-task observability.
Initially focused on Apache Spark, which is one of the most widely used workload frameworks for data engineers, this is extensible to long running tasks of any type including agentic workflows.
This solves a key pain point for Airflow users, without requiring custom async code development.
Texas Ballroom 6This talk focuses on leveraging the Task State Management (AIP-103) and Enhanced Retry Policy work (AIP-105) being released in Airflow 3.3 to enable enhanced execution of long running tasks including checkpointing, sophisticated (and automated) retry policies, and intra-task observability.
Initially focused on Apache Spark, which is one of the most widely used workload frameworks for data engineers, this is extensible to long running tasks of any type including agentic workflows.
As Apache Apache Airflow environments scale, teams face duplicated DAG patterns, slow debugging cycles, and technical debt that often surfaces only in production. This workshop explores how AI can shift pipeline quality left by enabling earlier detection of issues and improving code reliability during development.
Using IBM Bob, we demonstrate real-time code review and refactoring guidance across IDE and terminal workflows, helping engineers identify complexity and performance risks before deployment. We also show how AI accelerates DAG debugging and improves consistency across pipelines in environments that span Airflow and streaming systems such as Confluent.
Attendees will learn practical patterns to improve Airflow reliability, reduce technical debt, and shift debugging earlier in the lifecycle.
Hill Country CDAs Apache Apache Airflow environments scale, teams face duplicated DAG patterns, slow debugging cycles, and technical debt that often surfaces only in production. This workshop explores how AI can shift pipeline quality left by enabling earlier detection of issues and improving code reliability during development.
Using IBM Bob, we demonstrate real-time code review and refactoring guidance across IDE and terminal workflows, helping engineers identify complexity and performance risks before deployment. We also show how AI accelerates DAG debugging and improves consistency across pipelines in environments that span Airflow and streaming systems such as Confluent.
During this workshop you are going to learn how to effectively set up a CI/CD pipeline that builds, tests, and even corrects (using Gemini) your DAGs BEFORE deploying them into your Composer environment. You will also learn how to deploy a single custom web app to centrally control all your DAGs across many Cloud Composer environments/projects.
Hill Country ABDuring this workshop you are going to learn how to effectively set up a CI/CD pipeline that builds, tests, and even corrects (using Gemini) your DAGs BEFORE deploying them into your Composer environment. You will also learn how to deploy a single custom web app to centrally control all your DAGs across many Cloud Composer environments/projects.
Building an AI capability in Airflow is the easy part. The hard part is what comes next.
You want to swap a model, refactor a prompt, cut token costs, or try a local model instead of paying for cloud. How do you know it still works as expected? Without a fast feedback loop, every change is a gamble.
This talk shows practical patterns for building that feedback loop, with real examples using agent skills, MCPs, and local and cloud models. It covers the challenges too: sandboxing, observability, non-determinism, and keeping checks simple enough that people actually use them.
Texas Ballroom 1-2-3Building an AI capability in Airflow is the easy part. The hard part is what comes next.
You want to swap a model, refactor a prompt, cut token costs, or try a local model instead of paying for cloud. How do you know it still works as expected? Without a fast feedback loop, every change is a gamble.
This talk shows practical patterns for building that feedback loop, with real examples using agent skills, MCPs, and local and cloud models. It covers the challenges too: sandboxing, observability, non-determinism, and keeping checks simple enough that people actually use them.
Apache Airflow’s Kubernetes integration enables flexible workload execution on Kubernetes but lacks advanced resource management features including application queueing, tenant isolation and gang scheduling. These features are increasingly critical for data engineering as well as AI/ML use cases, particularly GPU utilization optimization. For example, gang scheduling ensures all required resources for a job are allocated atomically, preventing partial allocations that waste resources. Apache Yunikorn, a Kubernetes-native scheduler, addresses these gaps by offering a high-performance alternative to Kubernetes default scheduler. In this talk, we’ll demonstrate how to conveniently leverage Yunikorn’s power in Airflow, along with practical use cases and examples.
Texas Ballroom 5Apache Airflow’s Kubernetes integration enables flexible workload execution on Kubernetes but lacks advanced resource management features including application queueing, tenant isolation and gang scheduling. These features are increasingly critical for data engineering as well as AI/ML use cases, particularly GPU utilization optimization. For example, gang scheduling ensures all required resources for a job are allocated atomically, preventing partial allocations that waste resources. Apache Yunikorn, a Kubernetes-native scheduler, addresses these gaps by offering a high-performance alternative to Kubernetes default scheduler. In this talk, we’ll demonstrate how to conveniently leverage Yunikorn’s power in Airflow, along with practical use cases and examples.
In my almost 15 years as a data engineer, I’ve learned one universal truth: everyone needs orchestration. The marketing team needs daily attribution reports. The CRM team needs personalized newsletter triggers. The platform team needs cross-cloud data transfers. The analytics team needs third-party data imports. Data touches every corner of the business, and the orchestration layer is the one layer that connects it all.
This talk explores what becomes possible when we decouple pipeline logic (what happens) from definition (how it’s authored). With the right abstractions, the authoring interface can be anything: Python, declarative YAML, templates, spreadsheets, or even a video game.
We will explore different levels of orchestration abstraction: native Python, declarative config (using the open-source project DAG Factory), templates (using the open-source project Blueprint), and AI-assisted Authoring. To prove the power of this decoupled architecture, I will showcase a custom Minecraft plugin that lets you build Airflow Dags by placing blocks in Minecraft (yes, you heard that right).
Texas Ballroom 6In my almost 15 years as a data engineer, I’ve learned one universal truth: everyone needs orchestration. The marketing team needs daily attribution reports. The CRM team needs personalized newsletter triggers. The platform team needs cross-cloud data transfers. The analytics team needs third-party data imports. Data touches every corner of the business, and the orchestration layer is the one layer that connects it all.
This talk explores what becomes possible when we decouple pipeline logic (what happens) from definition (how it’s authored). With the right abstractions, the authoring interface can be anything: Python, declarative YAML, templates, spreadsheets, or even a video game.
RAG pipelines fail silently. Bad retrievals, hallucinated answers, and stale vectors rarely trigger alerts; they quietly degrade your AI product. This session presents a reference DAG architecture for production-grade RAG ingestion built on Airflow 3, with inline quality gates that evaluate retrieval accuracy and LLM faithfulness before a single vector reaches production. We’ll walk through four common RAG failure modes and the specific Airflow pattern that stops each one using RAGAS as the evaluation framework, and Airflow 3’s TaskFlow API, Assets, and DAG Versioning to make pipelines reproducible and event-driven. You’ll leave with reusable quality gate patterns and a concrete architecture you can adapt — because in RAG systems, quality shouldn’t be an afterthought. It should be built into the pipeline from the start.
Texas Ballroom 1-2-3RAG pipelines fail silently. Bad retrievals, hallucinated answers, and stale vectors rarely trigger alerts; they quietly degrade your AI product. This session presents a reference DAG architecture for production-grade RAG ingestion built on Airflow 3, with inline quality gates that evaluate retrieval accuracy and LLM faithfulness before a single vector reaches production. We’ll walk through four common RAG failure modes and the specific Airflow pattern that stops each one using RAGAS as the evaluation framework, and Airflow 3’s TaskFlow API, Assets, and DAG Versioning to make pipelines reproducible and event-driven. You’ll leave with reusable quality gate patterns and a concrete architecture you can adapt — because in RAG systems, quality shouldn’t be an afterthought. It should be built into the pipeline from the start.
Orchestrating Cross-Account ML & Data Pipelines with Apache Airflow As organizations scale data and ML workloads across multiple AWS accounts and Regions, orchestration becomes the hardest engineering problem — not the models themselves. This session shows how Apache Airflow serves as a centralized orchestration hub for distributed data-processing and machine-learning pipelines that span account and regional boundaries.
We walk through a production-ready architecture where a single Airflow environment coordinates:
-
Cross-account DAG patterns — using Airflow connections, IAM role assumption, and custom hooks to trigger AWS Glue, SageMaker, and Lambda in remote accounts
-
Cross-Region data flow — leveraging S3 Cross-Region Replication with S3KeySensor operators to gate downstream tasks on data availability
-
Custom operators for cross-account ML — extending SageMakerHook and SageMakerTrainingOperator to train models in a separate account while keeping orchestration centralized
-
Sensor and operator design — choosing the right sensor modes, timeouts, and poke intervals for long-running training jobs and inference calls
Attendees leave with reusable DAG patterns, operator recipes, and an architecture blueprint for running multi-account, multi-Region data and ML pipelines — all orchestrated through Airflow.
Texas Ballroom 5Orchestrating Cross-Account ML & Data Pipelines with Apache Airflow As organizations scale data and ML workloads across multiple AWS accounts and Regions, orchestration becomes the hardest engineering problem — not the models themselves. This session shows how Apache Airflow serves as a centralized orchestration hub for distributed data-processing and machine-learning pipelines that span account and regional boundaries.
We walk through a production-ready architecture where a single Airflow environment coordinates:
-
Cross-account DAG patterns — using Airflow connections, IAM role assumption, and custom hooks to trigger AWS Glue, SageMaker, and Lambda in remote accounts
Meet airflowctl, the new default for API-driven remote operations. You will see how separating control from execution enhances security, enables isolation, and simplifies automation across different environments. I will discuss the development of airflowctl, demonstrate practical examples of secure remote execution, and provide a guide for transitioning from legacy workflows. You will learn how to easily migrate towards airflowctl and leverage the flexibility of an API-driven approach.
Texas Ballroom 6Meet airflowctl, the new default for API-driven remote operations. You will see how separating control from execution enhances security, enables isolation, and simplifies automation across different environments. I will discuss the development of airflowctl, demonstrate practical examples of secure remote execution, and provide a guide for transitioning from legacy workflows. You will learn how to easily migrate towards airflowctl and leverage the flexibility of an API-driven approach.
Your data platform team didn’t sign up to be a Dag factory. But when Airflow expertise is concentrated in a small group of engineers, that’s exactly what happens. Analysts wait days for simple workflows, engineers burn cycles rebuilding the same patterns, and frustrated teams start building outside the stack entirely. The real fix isn’t a better onboarding guide or a friendlier UI. It’s rethinking the abstraction layer your team exposes to the rest of the business.
In this talk, we’ll introduce astronomer/blueprint, an open-source Python library built around the idea that platform teams should define the building blocks and everyone else should be able to assemble them. We’ll walk through how to design composable, type-safe templates using Pydantic validation and control exactly which parameters downstream users can configure. We’ll also show how we’ve layered a no-code visual interface on top in Astro, Astronomer’s unified orchestration platform for Apache Airflow®, giving non-technical users a path to self-service without sacrificing governance or control. You’ll leave with a concrete framework and open-source tooling you can put to work immediately.
Texas Ballroom 1-2-3Your data platform team didn’t sign up to be a Dag factory. But when Airflow expertise is concentrated in a small group of engineers, that’s exactly what happens. Analysts wait days for simple workflows, engineers burn cycles rebuilding the same patterns, and frustrated teams start building outside the stack entirely. The real fix isn’t a better onboarding guide or a friendlier UI. It’s rethinking the abstraction layer your team exposes to the rest of the business.
A live text adventure where Airflow is the game engine. Rooms are tasks. Choices are branches. Inventory lives in XComs. Monsters have Deadline Alerts. The audience votes at every fork, and the DAG decides what happens next. It’s silly, it’s live, and every concept maps to a real production pattern.
Texas Ballroom 5A live text adventure where Airflow is the game engine. Rooms are tasks. Choices are branches. Inventory lives in XComs. Monsters have Deadline Alerts. The audience votes at every fork, and the DAG decides what happens next. It’s silly, it’s live, and every concept maps to a real production pattern.
Wednesday, September 2, 2026
No two open source projects have shaped modern data engineering more than Apache Spark and Apache Airflow. But their partnership wasn’t designed- it was earned. From the early days of BashOperator wrapping spark-submit, through the SparkSubmitOperator, Livy, Kubernetes-native execution, and now Airflow 3’s asset-aware scheduling paired with Spark’s Declarative Pipelines, the integration story is a masterclass in how independent communities converge on shared problems without shared governance. This talk traces the full arc: how Spark’s compute model and Airflow’s orchestration model co-evolved, where they fought, where they complemented each other, and what the next chapter looks like as both projects ship their most ambitious releases simultaneously. Along the way, we’ll examine the contribution patterns, the cross-pollination of committers, and why this particular pairing outlasted every managed alternative that tried to replace it. This is not a vendor talk. This is a community talk about what happens when two ecosystems trust each other enough to stay independent.
Texas Ballroom 5No two open source projects have shaped modern data engineering more than Apache Spark and Apache Airflow. But their partnership wasn’t designed- it was earned. From the early days of BashOperator wrapping spark-submit, through the SparkSubmitOperator, Livy, Kubernetes-native execution, and now Airflow 3’s asset-aware scheduling paired with Spark’s Declarative Pipelines, the integration story is a masterclass in how independent communities converge on shared problems without shared governance. This talk traces the full arc: how Spark’s compute model and Airflow’s orchestration model co-evolved, where they fought, where they complemented each other, and what the next chapter looks like as both projects ship their most ambitious releases simultaneously. Along the way, we’ll examine the contribution patterns, the cross-pollination of committers, and why this particular pairing outlasted every managed alternative that tried to replace it. This is not a vendor talk. This is a community talk about what happens when two ecosystems trust each other enough to stay independent.
We’re excited to offer Airflow Summit 2026 attendees an exclusive opportunity to earn their DAG Authoring certification in person, now updated to include all the latest Airflow 3 features. This certification workshop comes at no additional cost to summit attendees.
The DAG Authoring for Apache Airflow certification validates your expertise in advanced Airflow concepts and demonstrates your ability to build production-grade data pipelines. It covers TaskFlow API, Dynamic task mapping, Templating, Asset-driven scheduling, Best practices for production DAGs, and new Airflow 3.0 features and optimizations.
The certification session includes:
- 20-minute preparation period with expert guidance
- Live Q&A session with Marc Lamberti from Astronomer
- 60-minute examination period
- Real-time results and immediate feedback
To prepare for the Airflow Certification, visit the Astronomer Academy (https://academy.astronomer.io/page/astronomer-certification).
Hill Country CDWe’re excited to offer Airflow Summit 2026 attendees an exclusive opportunity to earn their DAG Authoring certification in person, now updated to include all the latest Airflow 3 features. This certification workshop comes at no additional cost to summit attendees.
The DAG Authoring for Apache Airflow certification validates your expertise in advanced Airflow concepts and demonstrates your ability to build production-grade data pipelines. It covers TaskFlow API, Dynamic task mapping, Templating, Asset-driven scheduling, Best practices for production DAGs, and new Airflow 3.0 features and optimizations.
Building on Airflow 3’s new worker structure and foundation laied by Go SDK, we take a look at how Airflow can support a fully cross-language Dag-authoring experience.
We will discuss how a new language SDK is built, how a task talks to Airflow, and how multiple languages may be mixed inside a Dag. To support additional languages without logic duplication, a new middle layer is required between Airflow and the task. Additional topics, such as security, distributed workload, and user interface considerations, will also be touched on.
Texas Ballroom 1-2-3Building on Airflow 3’s new worker structure and foundation laied by Go SDK, we take a look at how Airflow can support a fully cross-language Dag-authoring experience.
We will discuss how a new language SDK is built, how a task talks to Airflow, and how multiple languages may be mixed inside a Dag. To support additional languages without logic duplication, a new middle layer is required between Airflow and the task. Additional topics, such as security, distributed workload, and user interface considerations, will also be touched on.
Ready to contribute to Apache Airflow?
In this hands-on workshop, we’ll help you jump straight into the project with real, beginner-friendly issues matched to your skills and interests.
To make the most of our time together, come with a development environment set up in advance — installing Breeze is highly recommended, but GitHub Codespaces is a great alternative if Docker isn’t an option for you.
We’ll walk through the full contribution journey step by step: exploring the codebase, picking an issue, opening your first pull request, and engaging with the community for feedback and reviews. Whether you’re interested in writing code, improving documentation, writing tests, or sharing ideas, there’s a welcoming place for you in the Airflow community.
We’re excited to have you — let’s contribute together!
Hill Country ABReady to contribute to Apache Airflow?
In this hands-on workshop, we’ll help you jump straight into the project with real, beginner-friendly issues matched to your skills and interests.
To make the most of our time together, come with a development environment set up in advance — installing Breeze is highly recommended, but GitHub Codespaces is a great alternative if Docker isn’t an option for you.
We’ll walk through the full contribution journey step by step: exploring the codebase, picking an issue, opening your first pull request, and engaging with the community for feedback and reviews. Whether you’re interested in writing code, improving documentation, writing tests, or sharing ideas, there’s a welcoming place for you in the Airflow community.
Agor is an open-source platform for orchestrating AI agents: built for teams, not just individuals. It provides a shared, real-time workspace where humans and agents collaborate on a spatial canvas. Multiple agents run in parallel across isolated git worktrees, with full visibility into sessions, conversations, and outputs. Teams can inspect, intervene, and steer work as it happens.
At the core are persistent assistants: long-lived agents with memory and tools that coordinate tasks, spawn sub-agents, and continuously advance workflows.
Agor brings structure to agentic work:
- Sessions for execution with observability
- A Figma-like spatial layout to organize parallel work visually
- Git worktrees for isolation and coordination
- Artifacts for durable outputs
Under the hood, it’s a full orchestration layer with APIs, WebSockets, and an MCP-based tool system that gives agents awareness of shared state and other agents.
For the Airflow audience, Agor acts as a control plane for agent workflows: handling parallelism, state, observability, and handoffs between autonomous units of work.
In this demo-driven talk, I’ll show assistants coordinating agents, teams collaborating live, and workflows progressing with minimal human intervention.
Texas Ballroom 6Agor is an open-source platform for orchestrating AI agents: built for teams, not just individuals. It provides a shared, real-time workspace where humans and agents collaborate on a spatial canvas. Multiple agents run in parallel across isolated git worktrees, with full visibility into sessions, conversations, and outputs. Teams can inspect, intervene, and steer work as it happens.
At the core are persistent assistants: long-lived agents with memory and tools that coordinate tasks, spawn sub-agents, and continuously advance workflows.
Airflow runs the pipelines that matter. When a task breaks, the workflow is fragmented: you copy an error, ask your favorite LLM in another tab, and still end up back in the Grid, scrolling logs. AIP-101 proposes a better way: an opt-in AI assistant, natively integrated into Apache Airflow. Ask about your Dags, runs, and logs, and get grounded answers based on what you can already see. Built with a safety-first mindset, it respects your existing access, keeps sensitive details out of responses, and makes its help transparent. In this initial phase, the assistant explains, not executes. This talk highlights the user experience, the key design decisions, suggested high-level architecture, and what comes next for AI in Airflow.
Texas Ballroom 5Airflow runs the pipelines that matter. When a task breaks, the workflow is fragmented: you copy an error, ask your favorite LLM in another tab, and still end up back in the Grid, scrolling logs. AIP-101 proposes a better way: an opt-in AI assistant, natively integrated into Apache Airflow. Ask about your Dags, runs, and logs, and get grounded answers based on what you can already see. Built with a safety-first mindset, it respects your existing access, keeps sensitive details out of responses, and makes its help transparent. In this initial phase, the assistant explains, not executes. This talk highlights the user experience, the key design decisions, suggested high-level architecture, and what comes next for AI in Airflow.
Datadog is a world-class data platform ingesting more than a 100 trillion events a day, providing real-time insights. Since our internal adoption of Airflow following the release of 3.0.0, the number of teams relying on our internal Airflow platform have grown organically and quickly.
This internal Airflow adoption came with a number of platform challenges, requiring novel solutions which could support multi-tenancy, scalability, and bespoke runtime environments. In this talk, we will cover how we’ve expanded the functionality of Airflow triggers – via trigger queue assignment – to support multi-tenancy deployments, while contributing those solutions upstream with the broader Airflow community. We’ll cover the conceptual design and motivations for Trigger queues, and how the trigger queue pattern can benefit both multi-tenant and single-occupant Airflow systems alike.
Texas Ballroom 1-2-3Datadog is a world-class data platform ingesting more than a 100 trillion events a day, providing real-time insights. Since our internal adoption of Airflow following the release of 3.0.0, the number of teams relying on our internal Airflow platform have grown organically and quickly.
This internal Airflow adoption came with a number of platform challenges, requiring novel solutions which could support multi-tenancy, scalability, and bespoke runtime environments. In this talk, we will cover how we’ve expanded the functionality of Airflow triggers – via trigger queue assignment – to support multi-tenancy deployments, while contributing those solutions upstream with the broader Airflow community. We’ll cover the conceptual design and motivations for Trigger queues, and how the trigger queue pattern can benefit both multi-tenant and single-occupant Airflow systems alike.
At last year’s Airflow Summit, we shared how we built a multi-cluster orchestration layer on top of Apache Airflow to run ML workloads across multiple Kubernetes GPU clusters.
Once hundreds of ML engineers started running GPU pipelines in production, we discovered that orchestration alone is not enough. Operating multi-cluster GPU infrastructure introduces new challenges: controlling GPU allocation across teams, observing pipelines across clusters, and helping users run workloads efficiently without wasting expensive GPU resources.
In this talk, we’ll show how our Airflow platform evolved from a workflow orchestrator into an operational control plane for GPU infrastructure. We’ll cover custom scheduling strategies that dynamically route workloads across clusters using Airflow policies and resource awareness, integration with HAMI to improve GPU utilization, and AIOps workflows with KeepHQ that detect underutilized CPU, RAM, and GPU resources. We’ll also present powerful dashboards and AI-assisted tools that reduce Time2Market and simplify debugging while keeping infrastructure complexity hidden.
We’ll be happy to share how our platform continues to evolve with Apache Airflow.
Texas Ballroom 5At last year’s Airflow Summit, we shared how we built a multi-cluster orchestration layer on top of Apache Airflow to run ML workloads across multiple Kubernetes GPU clusters.
Once hundreds of ML engineers started running GPU pipelines in production, we discovered that orchestration alone is not enough. Operating multi-cluster GPU infrastructure introduces new challenges: controlling GPU allocation across teams, observing pipelines across clusters, and helping users run workloads efficiently without wasting expensive GPU resources.
Airflow is Python-first — but production business logic is often Java-first. The Airflow Java SDK bridges that gap by letting you mix Java and Python tasks within the same DAG, without shell wrappers or separate services.
In this talk, we’ll walk through the full lifecycle of a Java task: how the Java SDK is set up, how tasks are defined and packaged as a JAR, how Airflow picks them up and runs them on any executor, and how results flow back into your DAG. We’ll also cover how core Airflow primitives — Variables, Connections, XCom, and logging — work natively in the Java SDK, enabling true cross-language, bidirectional communication within a single pipeline. You’ll see it all running end-to-end in a live demo alongside Python tasks.
If you work with Java services and have ever wished you could orchestrate them without translating everything into Python first, this talk is for you.
Texas Ballroom 1-2-3Airflow is Python-first — but production business logic is often Java-first. The Airflow Java SDK bridges that gap by letting you mix Java and Python tasks within the same DAG, without shell wrappers or separate services.
In this talk, we’ll walk through the full lifecycle of a Java task: how the Java SDK is set up, how tasks are defined and packaged as a JAR, how Airflow picks them up and runs them on any executor, and how results flow back into your DAG. We’ll also cover how core Airflow primitives — Variables, Connections, XCom, and logging — work natively in the Java SDK, enabling true cross-language, bidirectional communication within a single pipeline. You’ll see it all running end-to-end in a live demo alongside Python tasks.
Data incidents are often investigated through fragmented Slack threads and manual SQL queries, leaving data owners dependent on engineers. Qbiz introduces a more efficient alternative: the Agentic Incident DAG. This approach uses AI agents to lead investigations while Airflow orchestrates a systematic diagnostic workflow.
When a failure occurs, the system triggers a diagnostic DAG and creates a Data Incident Ticket. An Investigation Thread captures the analysis in real time as specialized agents evaluate potential causes and provide clear summaries for data owners.
The system relies on deterministic diagnosis, using automated hypothesis testing and confidence scoring to identify root causes. Airflow coordinates agents as they query platforms via MCP interfaces and document findings. These steps are then converted into versioned playbooks, building institutional memory and significantly reducing Mean Time to Diagnosis.
Texas Ballroom 6Data incidents are often investigated through fragmented Slack threads and manual SQL queries, leaving data owners dependent on engineers. Qbiz introduces a more efficient alternative: the Agentic Incident DAG. This approach uses AI agents to lead investigations while Airflow orchestrates a systematic diagnostic workflow.
When a failure occurs, the system triggers a diagnostic DAG and creates a Data Incident Ticket. An Investigation Thread captures the analysis in real time as specialized agents evaluate potential causes and provide clear summaries for data owners.
Migrating from UC4 Automic to Apache Airflow is far from a lift-and-shift exercise. UC4 offers advanced scheduling primitives that data teams rely on daily — and Airflow doesn’t replicate them out of the box.
At eBay, we migrated thousands of business-critical UC4 workflows onto our Airflow 2.10 platform. Rather than forcing teams to change how they operate, we built the missing capabilities natively into Airflow:
- Breakpoints — pause a pipeline at a specific task for inspection without failing the run
- Skip logic — dynamically bypass tasks or task groups at runtime
- Calendar-aware scheduling — replicate UC4’s calendar model as custom Airflow timetables
- Pipeline pause/resume — operator-triggered suspension of in-flight DAG runs with state consistency
We’ll share the engineering trade-offs, architectural constraints we hit, and patterns reusable beyond eBay’s stack.
Texas Ballroom 1-2-3Migrating from UC4 Automic to Apache Airflow is far from a lift-and-shift exercise. UC4 offers advanced scheduling primitives that data teams rely on daily — and Airflow doesn’t replicate them out of the box.
At eBay, we migrated thousands of business-critical UC4 workflows onto our Airflow 2.10 platform. Rather than forcing teams to change how they operate, we built the missing capabilities natively into Airflow:
- Breakpoints — pause a pipeline at a specific task for inspection without failing the run
- Skip logic — dynamically bypass tasks or task groups at runtime
- Calendar-aware scheduling — replicate UC4’s calendar model as custom Airflow timetables
- Pipeline pause/resume — operator-triggered suspension of in-flight DAG runs with state consistency
We’ll share the engineering trade-offs, architectural constraints we hit, and patterns reusable beyond eBay’s stack.
As Airflow adoption expands across large enterprises, a core challenge emerges: How to enable multiple teams to design and operate data pipelines without relying heavily on specialized engineering expertise. In this session, we will present a zero‑code, metadata‑driven Airflow framework built and deployed within a large financial services organization to accelerate pipeline development and onboarding at scale.
This framework allows users to define workflows using simple CSV or Excel inputs, which are automatically converted into YAML configurations and deployed as fully production‑ready Airflow DAGs using standardized templates on Astronomer. By leveraging a remote execution model and reusable DAG patterns, the solution supports orchestration across heterogeneous systems—including data warehouses, ingestion pipelines, and data quality frameworks—while maintaining enterprise‑grade governance, consistency, and observability.
The talk will walk through the high-level architecture, including Excel‑to‑YAML transformation logic, dynamic DAG generation patterns, and controls that enable non‑developer and cross‑functional teams to safely create and manage pipelines with minimal coding. We will also share lessons learned from taking this framework from initial design to enterprise‑wide production rollout, highlighting how it reduced onboarding time, enforced standardization, and scaled orchestration across teams.
Attendees will gain practical insights into implementing low‑code and no‑code orchestration on top of Apache Airflow, along with key architectural considerations for operating Airflow at enterprise scale.
Texas Ballroom 5As Airflow adoption expands across large enterprises, a core challenge emerges: How to enable multiple teams to design and operate data pipelines without relying heavily on specialized engineering expertise. In this session, we will present a zero‑code, metadata‑driven Airflow framework built and deployed within a large financial services organization to accelerate pipeline development and onboarding at scale.
This framework allows users to define workflows using simple CSV or Excel inputs, which are automatically converted into YAML configurations and deployed as fully production‑ready Airflow DAGs using standardized templates on Astronomer. By leveraging a remote execution model and reusable DAG patterns, the solution supports orchestration across heterogeneous systems—including data warehouses, ingestion pipelines, and data quality frameworks—while maintaining enterprise‑grade governance, consistency, and observability.
Productionizing ML workflows is complicated; scaling them is harder. At Ramp, we grew from zero to nearly 100 production ML models powering systems like credit risk assessment and sales lead valuation.
This talk covers how Airflow became the backbone of our ML platform, orchestrating ETL jobs, data quality checks, and model runs. We’ll discuss how we evolved it to meet the increasing complexity of our ML systems.
Every ML system consists of feature creation and large-batch inference. We started with a few DBT models and one cloud-hosted notebook, which evolved into thousands of upstream tables and hundreds of AWS batch inference jobs.
We’ll share practical examples of using Airflow to handle increasing complexity. As upstream ETL jobs grew more interdependent, we built custom Airflow sensors to detect Snowflake table changes, cutting prediction latency by hours. As downstream models scaled, we moved from simple daily schedules to dataset-aware scheduling and custom dynamic DAGs, with custom Slack notifications for one-click debugging.
We’ll share implementation patterns, code snippets, and lessons for performant Airflow code.
Texas Ballroom 1-2-3Productionizing ML workflows is complicated; scaling them is harder. At Ramp, we grew from zero to nearly 100 production ML models powering systems like credit risk assessment and sales lead valuation.
This talk covers how Airflow became the backbone of our ML platform, orchestrating ETL jobs, data quality checks, and model runs. We’ll discuss how we evolved it to meet the increasing complexity of our ML systems.
Every ML system consists of feature creation and large-batch inference. We started with a few DBT models and one cloud-hosted notebook, which evolved into thousands of upstream tables and hundreds of AWS batch inference jobs.
At Together AI, AI agents have become the primary authors of our production data pipelines — and Airflow is what makes that safe to do. Agents do the building. Airflow gives us the surface to set the rules, see what’s happening, and step in when we need to. The interesting part is what each side has to look like for that to actually work in production. This talk is a field report on that relationship. We’ll walk how we got from a world where humans wrote SQL by hand and dashboards refreshed nightly to one where agents make hundreds of queries per session, catalog thousands of tables across engines, and ship pipelines in hours instead of weeks. The platform now spans twelve dbt projects across billing, inference, and analytics — all of it agent-authored, all of it running through Airflow.
We’ll cover the conventions that make agent-authored pipelines legible, the guardrails that make them safe, and the runtime patterns that make them recoverable. And we’ll dig into why oversight built for human authors breaks down the moment your author is non-deterministic — and what it takes to rebuild that oversight as something agents and humans can share.
A short live demo closes the loop: a single ticket goes in, a production-ready pipeline comes out, and the platform stays in control on both sides.
Attendees will leave with adoption patterns for agent-authored pipelines, a model for oversight that scales with non-deterministic authors, and the conventions-as-code playbook we use to keep agents productive without losing control.
Texas Ballroom 1-2-3At Together AI, AI agents have become the primary authors of our production data pipelines — and Airflow is what makes that safe to do. Agents do the building. Airflow gives us the surface to set the rules, see what’s happening, and step in when we need to. The interesting part is what each side has to look like for that to actually work in production. This talk is a field report on that relationship. We’ll walk how we got from a world where humans wrote SQL by hand and dashboards refreshed nightly to one where agents make hundreds of queries per session, catalog thousands of tables across engines, and ship pipelines in hours instead of weeks. The platform now spans twelve dbt projects across billing, inference, and analytics — all of it agent-authored, all of it running through Airflow.
Productionizing ML workflows is complicated; scaling them is harder. At Ramp, we grew from zero to nearly 100 production ML models powering systems like credit risk assessment and sales lead valuation.
This talk covers how Airflow became the backbone of our ML platform, orchestrating ETL jobs, data quality checks, and model runs. We’ll discuss how we evolved it to meet the increasing complexity of our ML systems.
Every ML system consists of feature creation and large-batch inference. We started with a few DBT models and one cloud-hosted notebook, which evolved into thousands of upstream tables and hundreds of AWS batch inference jobs.
We’ll share practical examples of using Airflow to handle increasing complexity. As upstream ETL jobs grew more interdependent, we built custom Airflow sensors to detect Snowflake table changes, cutting prediction latency by hours. As downstream models scaled, we moved from simple daily schedules to dataset-aware scheduling and custom dynamic DAGs, with custom Slack notifications for one-click debugging.
We’ll share implementation patterns, code snippets, and lessons for performant Airflow code.
Texas Ballroom 1-2-3Productionizing ML workflows is complicated; scaling them is harder. At Ramp, we grew from zero to nearly 100 production ML models powering systems like credit risk assessment and sales lead valuation.
This talk covers how Airflow became the backbone of our ML platform, orchestrating ETL jobs, data quality checks, and model runs. We’ll discuss how we evolved it to meet the increasing complexity of our ML systems.
Every ML system consists of feature creation and large-batch inference. We started with a few DBT models and one cloud-hosted notebook, which evolved into thousands of upstream tables and hundreds of AWS batch inference jobs.
At Together AI, AI agents have become the primary authors of our production data pipelines — and Airflow is what makes that safe to do. Agents do the building. Airflow gives us the surface to set the rules, see what’s happening, and step in when we need to. The interesting part is what each side has to look like for that to actually work in production. This talk is a field report on that relationship. We’ll walk how we got from a world where humans wrote SQL by hand and dashboards refreshed nightly to one where agents make hundreds of queries per session, catalog thousands of tables across engines, and ship pipelines in hours instead of weeks. The platform now spans twelve dbt projects across billing, inference, and analytics — all of it agent-authored, all of it running through Airflow.
We’ll cover the conventions that make agent-authored pipelines legible, the guardrails that make them safe, and the runtime patterns that make them recoverable. And we’ll dig into why oversight built for human authors breaks down the moment your author is non-deterministic — and what it takes to rebuild that oversight as something agents and humans can share.
A short live demo closes the loop: a single ticket goes in, a production-ready pipeline comes out, and the platform stays in control on both sides.
Attendees will leave with adoption patterns for agent-authored pipelines, a model for oversight that scales with non-deterministic authors, and the conventions-as-code playbook we use to keep agents productive without losing control.
Texas Ballroom 1-2-3At Together AI, AI agents have become the primary authors of our production data pipelines — and Airflow is what makes that safe to do. Agents do the building. Airflow gives us the surface to set the rules, see what’s happening, and step in when we need to. The interesting part is what each side has to look like for that to actually work in production. This talk is a field report on that relationship. We’ll walk how we got from a world where humans wrote SQL by hand and dashboards refreshed nightly to one where agents make hundreds of queries per session, catalog thousands of tables across engines, and ship pipelines in hours instead of weeks. The platform now spans twelve dbt projects across billing, inference, and analytics — all of it agent-authored, all of it running through Airflow.
Building on Airflow 3’s new worker structure and foundation laied by Go SDK, we take a look at how Airflow can support a fully cross-language Dag-authoring experience.
We will discuss how a new language SDK is built, how a task talks to Airflow, and how multiple languages may be mixed inside a Dag. To support additional languages without logic duplication, a new middle layer is required between Airflow and the task. Additional topics, such as security, distributed workload, and user interface considerations, will also be touched on.
Texas Ballroom 1-2-3Building on Airflow 3’s new worker structure and foundation laied by Go SDK, we take a look at how Airflow can support a fully cross-language Dag-authoring experience.
We will discuss how a new language SDK is built, how a task talks to Airflow, and how multiple languages may be mixed inside a Dag. To support additional languages without logic duplication, a new middle layer is required between Airflow and the task. Additional topics, such as security, distributed workload, and user interface considerations, will also be touched on.
No two open source projects have shaped modern data engineering more than Apache Spark and Apache Airflow. But their partnership wasn’t designed- it was earned. From the early days of BashOperator wrapping spark-submit, through the SparkSubmitOperator, Livy, Kubernetes-native execution, and now Airflow 3’s asset-aware scheduling paired with Spark’s Declarative Pipelines, the integration story is a masterclass in how independent communities converge on shared problems without shared governance. This talk traces the full arc: how Spark’s compute model and Airflow’s orchestration model co-evolved, where they fought, where they complemented each other, and what the next chapter looks like as both projects ship their most ambitious releases simultaneously. Along the way, we’ll examine the contribution patterns, the cross-pollination of committers, and why this particular pairing outlasted every managed alternative that tried to replace it. This is not a vendor talk. This is a community talk about what happens when two ecosystems trust each other enough to stay independent.
Texas Ballroom 5No two open source projects have shaped modern data engineering more than Apache Spark and Apache Airflow. But their partnership wasn’t designed- it was earned. From the early days of BashOperator wrapping spark-submit, through the SparkSubmitOperator, Livy, Kubernetes-native execution, and now Airflow 3’s asset-aware scheduling paired with Spark’s Declarative Pipelines, the integration story is a masterclass in how independent communities converge on shared problems without shared governance. This talk traces the full arc: how Spark’s compute model and Airflow’s orchestration model co-evolved, where they fought, where they complemented each other, and what the next chapter looks like as both projects ship their most ambitious releases simultaneously. Along the way, we’ll examine the contribution patterns, the cross-pollination of committers, and why this particular pairing outlasted every managed alternative that tried to replace it. This is not a vendor talk. This is a community talk about what happens when two ecosystems trust each other enough to stay independent.
We’re excited to offer Airflow Summit 2026 attendees an exclusive opportunity to earn their DAG Authoring certification in person, now updated to include all the latest Airflow 3 features. This certification workshop comes at no additional cost to summit attendees.
The DAG Authoring for Apache Airflow certification validates your expertise in advanced Airflow concepts and demonstrates your ability to build production-grade data pipelines. It covers TaskFlow API, Dynamic task mapping, Templating, Asset-driven scheduling, Best practices for production DAGs, and new Airflow 3.0 features and optimizations.
The certification session includes:
- 20-minute preparation period with expert guidance
- Live Q&A session with Marc Lamberti from Astronomer
- 60-minute examination period
- Real-time results and immediate feedback
To prepare for the Airflow Certification, visit the Astronomer Academy (https://academy.astronomer.io/page/astronomer-certification).
Hill Country CDWe’re excited to offer Airflow Summit 2026 attendees an exclusive opportunity to earn their DAG Authoring certification in person, now updated to include all the latest Airflow 3 features. This certification workshop comes at no additional cost to summit attendees.
The DAG Authoring for Apache Airflow certification validates your expertise in advanced Airflow concepts and demonstrates your ability to build production-grade data pipelines. It covers TaskFlow API, Dynamic task mapping, Templating, Asset-driven scheduling, Best practices for production DAGs, and new Airflow 3.0 features and optimizations.
Ready to contribute to Apache Airflow?
In this hands-on workshop, we’ll help you jump straight into the project with real, beginner-friendly issues matched to your skills and interests.
To make the most of our time together, come with a development environment set up in advance — installing Breeze is highly recommended, but GitHub Codespaces is a great alternative if Docker isn’t an option for you.
We’ll walk through the full contribution journey step by step: exploring the codebase, picking an issue, opening your first pull request, and engaging with the community for feedback and reviews. Whether you’re interested in writing code, improving documentation, writing tests, or sharing ideas, there’s a welcoming place for you in the Airflow community.
We’re excited to have you — let’s contribute together!
Hill Country ABReady to contribute to Apache Airflow?
In this hands-on workshop, we’ll help you jump straight into the project with real, beginner-friendly issues matched to your skills and interests.
To make the most of our time together, come with a development environment set up in advance — installing Breeze is highly recommended, but GitHub Codespaces is a great alternative if Docker isn’t an option for you.
We’ll walk through the full contribution journey step by step: exploring the codebase, picking an issue, opening your first pull request, and engaging with the community for feedback and reviews. Whether you’re interested in writing code, improving documentation, writing tests, or sharing ideas, there’s a welcoming place for you in the Airflow community.
Datadog is a world-class data platform ingesting more than a 100 trillion events a day, providing real-time insights. Since our internal adoption of Airflow following the release of 3.0.0, the number of teams relying on our internal Airflow platform have grown organically and quickly.
This internal Airflow adoption came with a number of platform challenges, requiring novel solutions which could support multi-tenancy, scalability, and bespoke runtime environments. In this talk, we will cover how we’ve expanded the functionality of Airflow triggers – via trigger queue assignment – to support multi-tenancy deployments, while contributing those solutions upstream with the broader Airflow community. We’ll cover the conceptual design and motivations for Trigger queues, and how the trigger queue pattern can benefit both multi-tenant and single-occupant Airflow systems alike.
Texas Ballroom 1-2-3Datadog is a world-class data platform ingesting more than a 100 trillion events a day, providing real-time insights. Since our internal adoption of Airflow following the release of 3.0.0, the number of teams relying on our internal Airflow platform have grown organically and quickly.
This internal Airflow adoption came with a number of platform challenges, requiring novel solutions which could support multi-tenancy, scalability, and bespoke runtime environments. In this talk, we will cover how we’ve expanded the functionality of Airflow triggers – via trigger queue assignment – to support multi-tenancy deployments, while contributing those solutions upstream with the broader Airflow community. We’ll cover the conceptual design and motivations for Trigger queues, and how the trigger queue pattern can benefit both multi-tenant and single-occupant Airflow systems alike.
Airflow runs the pipelines that matter. When a task breaks, the workflow is fragmented: you copy an error, ask your favorite LLM in another tab, and still end up back in the Grid, scrolling logs. AIP-101 proposes a better way: an opt-in AI assistant, natively integrated into Apache Airflow. Ask about your Dags, runs, and logs, and get grounded answers based on what you can already see. Built with a safety-first mindset, it respects your existing access, keeps sensitive details out of responses, and makes its help transparent. In this initial phase, the assistant explains, not executes. This talk highlights the user experience, the key design decisions, suggested high-level architecture, and what comes next for AI in Airflow.
Texas Ballroom 5Airflow runs the pipelines that matter. When a task breaks, the workflow is fragmented: you copy an error, ask your favorite LLM in another tab, and still end up back in the Grid, scrolling logs. AIP-101 proposes a better way: an opt-in AI assistant, natively integrated into Apache Airflow. Ask about your Dags, runs, and logs, and get grounded answers based on what you can already see. Built with a safety-first mindset, it respects your existing access, keeps sensitive details out of responses, and makes its help transparent. In this initial phase, the assistant explains, not executes. This talk highlights the user experience, the key design decisions, suggested high-level architecture, and what comes next for AI in Airflow.
Agor is an open-source platform for orchestrating AI agents: built for teams, not just individuals. It provides a shared, real-time workspace where humans and agents collaborate on a spatial canvas. Multiple agents run in parallel across isolated git worktrees, with full visibility into sessions, conversations, and outputs. Teams can inspect, intervene, and steer work as it happens.
At the core are persistent assistants: long-lived agents with memory and tools that coordinate tasks, spawn sub-agents, and continuously advance workflows.
Agor brings structure to agentic work:
- Sessions for execution with observability
- A Figma-like spatial layout to organize parallel work visually
- Git worktrees for isolation and coordination
- Artifacts for durable outputs
Under the hood, it’s a full orchestration layer with APIs, WebSockets, and an MCP-based tool system that gives agents awareness of shared state and other agents.
For the Airflow audience, Agor acts as a control plane for agent workflows: handling parallelism, state, observability, and handoffs between autonomous units of work.
In this demo-driven talk, I’ll show assistants coordinating agents, teams collaborating live, and workflows progressing with minimal human intervention.
Texas Ballroom 6Agor is an open-source platform for orchestrating AI agents: built for teams, not just individuals. It provides a shared, real-time workspace where humans and agents collaborate on a spatial canvas. Multiple agents run in parallel across isolated git worktrees, with full visibility into sessions, conversations, and outputs. Teams can inspect, intervene, and steer work as it happens.
At the core are persistent assistants: long-lived agents with memory and tools that coordinate tasks, spawn sub-agents, and continuously advance workflows.
At last year’s Airflow Summit, we shared how we built a multi-cluster orchestration layer on top of Apache Airflow to run ML workloads across multiple Kubernetes GPU clusters.
Once hundreds of ML engineers started running GPU pipelines in production, we discovered that orchestration alone is not enough. Operating multi-cluster GPU infrastructure introduces new challenges: controlling GPU allocation across teams, observing pipelines across clusters, and helping users run workloads efficiently without wasting expensive GPU resources.
In this talk, we’ll show how our Airflow platform evolved from a workflow orchestrator into an operational control plane for GPU infrastructure. We’ll cover custom scheduling strategies that dynamically route workloads across clusters using Airflow policies and resource awareness, integration with HAMI to improve GPU utilization, and AIOps workflows with KeepHQ that detect underutilized CPU, RAM, and GPU resources. We’ll also present powerful dashboards and AI-assisted tools that reduce Time2Market and simplify debugging while keeping infrastructure complexity hidden.
We’ll be happy to share how our platform continues to evolve with Apache Airflow.
Texas Ballroom 5At last year’s Airflow Summit, we shared how we built a multi-cluster orchestration layer on top of Apache Airflow to run ML workloads across multiple Kubernetes GPU clusters.
Once hundreds of ML engineers started running GPU pipelines in production, we discovered that orchestration alone is not enough. Operating multi-cluster GPU infrastructure introduces new challenges: controlling GPU allocation across teams, observing pipelines across clusters, and helping users run workloads efficiently without wasting expensive GPU resources.
Airflow is Python-first — but production business logic is often Java-first. The Airflow Java SDK bridges that gap by letting you mix Java and Python tasks within the same DAG, without shell wrappers or separate services.
In this talk, we’ll walk through the full lifecycle of a Java task: how the Java SDK is set up, how tasks are defined and packaged as a JAR, how Airflow picks them up and runs them on any executor, and how results flow back into your DAG. We’ll also cover how core Airflow primitives — Variables, Connections, XCom, and logging — work natively in the Java SDK, enabling true cross-language, bidirectional communication within a single pipeline. You’ll see it all running end-to-end in a live demo alongside Python tasks.
If you work with Java services and have ever wished you could orchestrate them without translating everything into Python first, this talk is for you.
Texas Ballroom 1-2-3Airflow is Python-first — but production business logic is often Java-first. The Airflow Java SDK bridges that gap by letting you mix Java and Python tasks within the same DAG, without shell wrappers or separate services.
In this talk, we’ll walk through the full lifecycle of a Java task: how the Java SDK is set up, how tasks are defined and packaged as a JAR, how Airflow picks them up and runs them on any executor, and how results flow back into your DAG. We’ll also cover how core Airflow primitives — Variables, Connections, XCom, and logging — work natively in the Java SDK, enabling true cross-language, bidirectional communication within a single pipeline. You’ll see it all running end-to-end in a live demo alongside Python tasks.
Migrating from UC4 Automic to Apache Airflow is far from a lift-and-shift exercise. UC4 offers advanced scheduling primitives that data teams rely on daily — and Airflow doesn’t replicate them out of the box.
At eBay, we migrated thousands of business-critical UC4 workflows onto our Airflow 2.10 platform. Rather than forcing teams to change how they operate, we built the missing capabilities natively into Airflow:
- Breakpoints — pause a pipeline at a specific task for inspection without failing the run
- Skip logic — dynamically bypass tasks or task groups at runtime
- Calendar-aware scheduling — replicate UC4’s calendar model as custom Airflow timetables
- Pipeline pause/resume — operator-triggered suspension of in-flight DAG runs with state consistency
We’ll share the engineering trade-offs, architectural constraints we hit, and patterns reusable beyond eBay’s stack.
Texas Ballroom 1-2-3Migrating from UC4 Automic to Apache Airflow is far from a lift-and-shift exercise. UC4 offers advanced scheduling primitives that data teams rely on daily — and Airflow doesn’t replicate them out of the box.
At eBay, we migrated thousands of business-critical UC4 workflows onto our Airflow 2.10 platform. Rather than forcing teams to change how they operate, we built the missing capabilities natively into Airflow:
- Breakpoints — pause a pipeline at a specific task for inspection without failing the run
- Skip logic — dynamically bypass tasks or task groups at runtime
- Calendar-aware scheduling — replicate UC4’s calendar model as custom Airflow timetables
- Pipeline pause/resume — operator-triggered suspension of in-flight DAG runs with state consistency
We’ll share the engineering trade-offs, architectural constraints we hit, and patterns reusable beyond eBay’s stack.
As Airflow adoption expands across large enterprises, a core challenge emerges: How to enable multiple teams to design and operate data pipelines without relying heavily on specialized engineering expertise. In this session, we will present a zero‑code, metadata‑driven Airflow framework built and deployed within a large financial services organization to accelerate pipeline development and onboarding at scale.
This framework allows users to define workflows using simple CSV or Excel inputs, which are automatically converted into YAML configurations and deployed as fully production‑ready Airflow DAGs using standardized templates on Astronomer. By leveraging a remote execution model and reusable DAG patterns, the solution supports orchestration across heterogeneous systems—including data warehouses, ingestion pipelines, and data quality frameworks—while maintaining enterprise‑grade governance, consistency, and observability.
The talk will walk through the high-level architecture, including Excel‑to‑YAML transformation logic, dynamic DAG generation patterns, and controls that enable non‑developer and cross‑functional teams to safely create and manage pipelines with minimal coding. We will also share lessons learned from taking this framework from initial design to enterprise‑wide production rollout, highlighting how it reduced onboarding time, enforced standardization, and scaled orchestration across teams.
Attendees will gain practical insights into implementing low‑code and no‑code orchestration on top of Apache Airflow, along with key architectural considerations for operating Airflow at enterprise scale.
Texas Ballroom 5As Airflow adoption expands across large enterprises, a core challenge emerges: How to enable multiple teams to design and operate data pipelines without relying heavily on specialized engineering expertise. In this session, we will present a zero‑code, metadata‑driven Airflow framework built and deployed within a large financial services organization to accelerate pipeline development and onboarding at scale.
This framework allows users to define workflows using simple CSV or Excel inputs, which are automatically converted into YAML configurations and deployed as fully production‑ready Airflow DAGs using standardized templates on Astronomer. By leveraging a remote execution model and reusable DAG patterns, the solution supports orchestration across heterogeneous systems—including data warehouses, ingestion pipelines, and data quality frameworks—while maintaining enterprise‑grade governance, consistency, and observability.
Data incidents are often investigated through fragmented Slack threads and manual SQL queries, leaving data owners dependent on engineers. Qbiz introduces a more efficient alternative: the Agentic Incident DAG. This approach uses AI agents to lead investigations while Airflow orchestrates a systematic diagnostic workflow.
When a failure occurs, the system triggers a diagnostic DAG and creates a Data Incident Ticket. An Investigation Thread captures the analysis in real time as specialized agents evaluate potential causes and provide clear summaries for data owners.
The system relies on deterministic diagnosis, using automated hypothesis testing and confidence scoring to identify root causes. Airflow coordinates agents as they query platforms via MCP interfaces and document findings. These steps are then converted into versioned playbooks, building institutional memory and significantly reducing Mean Time to Diagnosis.
Texas Ballroom 6Data incidents are often investigated through fragmented Slack threads and manual SQL queries, leaving data owners dependent on engineers. Qbiz introduces a more efficient alternative: the Agentic Incident DAG. This approach uses AI agents to lead investigations while Airflow orchestrates a systematic diagnostic workflow.
When a failure occurs, the system triggers a diagnostic DAG and creates a Data Incident Ticket. An Investigation Thread captures the analysis in real time as specialized agents evaluate potential causes and provide clear summaries for data owners.