Welcome to the session program for Airflow Summit 2024.

If you prefer, you can also see this as sessionize layout or list of sessions.

Thursday, September 12, 2024

09:00
09:30
10:10
Morning break
10:30
11:00
11:30
12:00
12:30
13:15
Lunch
14:00
14:35
15:10
15:45
16:20
Lightning talks & Wrap up
09:00 - 09:25.
By Vikram Koka
Track: Keynote
Room: Grand Ballroom
09/12/2024 9:00 AM 09/12/2024 9:25 AM America/Los_Angeles AS24: The road ahead: What’s coming in Airflow 3 and beyond?

Apache Airflow has emerged as the defacto standard for data orchestration. Over the last couple of years, Airflow has also seen increasing adoption for ML and AI use cases. It has been almost four years since the release of Airflow 2 and as a community we have agreed that it’s time for a major foundational release in the form of Airflow 3.

This talk will introduce the vision behind Airflow 3, including the emerging technology trends in the industry and how Airflow will evolve in response. Specifically, this will include an overview of the architectural changes in Airflow to support emerging use cases and distributed data infrastructure models. This talk will also introduce the major features and the desired outcomes of the release. Airflow 3 will be a foundational release and therefore this talk will similarly introduce the new concepts being introduced as part of Airflow 3, which may be fully realized in follow-on 3.x releases.

The goal of this talk is to raise awareness about Airflow 3 and to get feedback from the Airflow community while the release is still in the development phase.

Grand Ballroom
Apache Airflow has emerged as the defacto standard for data orchestration. Over the last couple of years, Airflow has also seen increasing adoption for ML and AI use cases. It has been almost four years since the release of Airflow 2 and as a community we have agreed that it’s time for a major foundational release in the form of Airflow 3. This talk will introduce the vision behind Airflow 3, including the emerging technology trends in the industry and how Airflow will evolve in response.
09:25 - 10:10.
By Madison Swain-Bowden, Kaxil Naik, Michał Modras, Constance Martineau & Shubham Mehta
Track: Keynote
Room: Grand Ballroom
09/12/2024 9:25 AM 09/12/2024 10:10 AM America/Los_Angeles AS24: Airflow 3 - Roadmap Discussion

Join us in this panel with key members of the community behind the development of Apache Airflow where we will discuss the tentative scope for the next generation, i.e. Airflow 3.

Grand Ballroom
Join us in this panel with key members of the community behind the development of Apache Airflow where we will discuss the tentative scope for the next generation, i.e. Airflow 3.
10:30 - 10:55.
By Rafal Biegacz & Filip Knapik
Track: Sponsored
Room: Georgian
09/12/2024 10:30 AM 09/12/2024 10:55 AM America/Los_Angeles AS24: Airflow - Path to Industry Orchestration Standard

In the realm of data engineering, machine learning pipelines and using cloud and web services there is a huge demand for orchestration technologies.

Apache Airflow belongs to the most popular orchestration technologies or even is the most popular one.

In this presentation we are going to focus these aspects of Airflow that make it so popular and whether it became the orchestration industry standard.

Georgian
In the realm of data engineering, machine learning pipelines and using cloud and web services there is a huge demand for orchestration technologies. Apache Airflow belongs to the most popular orchestration technologies or even is the most popular one. In this presentation we are going to focus these aspects of Airflow that make it so popular and whether it became the orchestration industry standard.
10:30 - 11:15.
By Briana Okyere, Amogh Desai, Ryan Hatter & Srabasti Banerjee
Track: Community
Room: California West
09/12/2024 10:30 AM 09/12/2024 11:15 AM America/Los_Angeles AS24: Connecting the Dots in Airflow: From User to Contributor

“Connecting the Dots in Airflow: From User to Contributor” explores the journey of transitioning from an Airflow user to an active project contributor. This talk will cover essential steps, resources, and best practices to effectively engage with the Airflow community and make meaningful contributions. Attendees will gain insights into the collaborative nature of open-source projects and how their involvement can drive both personal growth and project innovation.

California West
“Connecting the Dots in Airflow: From User to Contributor” explores the journey of transitioning from an Airflow user to an active project contributor. This talk will cover essential steps, resources, and best practices to effectively engage with the Airflow community and make meaningful contributions. Attendees will gain insights into the collaborative nature of open-source projects and how their involvement can drive both personal growth and project innovation.
10:30 - 10:55.
By Bhavesh Jaisinghani
Track: Use cases
Room: Elizabethan A+B
09/12/2024 10:30 AM 09/12/2024 10:55 AM America/Los_Angeles AS24: Scale and Security: How Autodesk Securely Develops and Tests PII Pipelines with Airflow

In today’s data-driven era, ensuring data reliability and enhancing our testing and development capabilities are paramount. Local unit testing has its merits but falls short when dealing with the volume of big data. One major challenge is running Spark jobs pre-deployment to ensure they produce expected results and handle production-level data volumes.

In this talk, we will discuss how Autodesk leveraged Astronomer to improve pipeline development. We’ll explore how it addresses challenges with sensitive and large data sets that cannot be transferred to local machines or non-production environments. Additionally, we’ll cover how this approach supports over 10 engineers working simultaneously on different feature branches within the same repo.

We will highlight the benefits, such as conflict-free development and testing, and eliminating concerns about data corruption when running DAGs on production Airflow servers.

Join me to discover how solutions like Astronomer empower developers to work with increased efficiency and reliability. This talk is perfect for those interested in big data, cloud solutions, and innovative development practices.

Elizabethan A+B
In today’s data-driven era, ensuring data reliability and enhancing our testing and development capabilities are paramount. Local unit testing has its merits but falls short when dealing with the volume of big data. One major challenge is running Spark jobs pre-deployment to ensure they produce expected results and handle production-level data volumes. In this talk, we will discuss how Autodesk leveraged Astronomer to improve pipeline development. We’ll explore how it addresses challenges with sensitive and large data sets that cannot be transferred to local machines or non-production environments.
10:30 - 10:55.
By Anant Agarwal
Track: Use cases
Room: California East
09/12/2024 10:30 AM 09/12/2024 10:55 AM America/Los_Angeles AS24: Scaling Airflow for Data Productivity at Instacart

In this talk, we’ll discuss how Instacart leverages Apache Airflow to orchestrate a vast network of data pipelines, powering both our core infrastructure and dbt deployments. As a data-driven company, Airflow plays a critical role in enabling us to execute large and intricate pipelines securely, compliantly, and at scale.

We’ll delve into the following key areas:

a. High-Throughput Cluster Management: We’ll explore how we manage and maintain our Airflow cluster, ensuring the efficient execution of over 2,000 DAGs across diverse use cases.

b. Centralized Airflow Vision: We’ll outline our plans for establishing a company-wide, centralized Airflow cluster, consolidating all Airflow instances at Instacart.

c. Custom Airflow Tooling: We’ll showcase the custom tooling we’ve developed to manage YML-based DAGs, execute DAGs on external ECS workers, leverage Terraform for cluster deployment, and implement robust cluster monitoring at scale.

By sharing our extensive experience with Airflow, we aim to contribute valuable insights to the Airflow community.

California East
In this talk, we’ll discuss how Instacart leverages Apache Airflow to orchestrate a vast network of data pipelines, powering both our core infrastructure and dbt deployments. As a data-driven company, Airflow plays a critical role in enabling us to execute large and intricate pipelines securely, compliantly, and at scale. We’ll delve into the following key areas: a. High-Throughput Cluster Management: We’ll explore how we manage and maintain our Airflow cluster, ensuring the efficient execution of over 2,000 DAGs across diverse use cases.
11:00 - 11:25.
By Udit Saxena
Track: Use cases
Room: Georgian
09/12/2024 11:00 AM 09/12/2024 11:25 AM America/Los_Angeles AS24: Airflow, Spark, and LLMs: Turbocharging MLOps at ASAPP

This talk will explore ASAPP’s use of Apache Airflow to streamline and optimize our machine learning operations (MLOps). Key highlights include:

  • Integrating with our custom Spark solution for achieving speedup, efficiency, and cost gains for generative AI transcription, summarization and intent categorization pipelines

  • Different design patterns of integrating with efficient LLM servers - like TGI/vllm/tensor-RT for Summarization pipelines with/without Spark.

  • An overview of batched LLM inference using Airflow as opposed to real time inference outside of it

  • [Tentative] Possible extension of this scaffolding to Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) for fine-tuning LLMs, using Airflow as the orchestrator.

Additionally, the talk will cover ASAPP’s MLOps journey with Airflow over the past few years, including an overview of our cloud infrastructure, various data backends, and sources.

The primary focus will be on the machine learning workflows at ASAPP, rather than the data workflows, providing a detailed look at how Airflow enhances our MLOps processes.

Georgian
This talk will explore ASAPP’s use of Apache Airflow to streamline and optimize our machine learning operations (MLOps). Key highlights include: Integrating with our custom Spark solution for achieving speedup, efficiency, and cost gains for generative AI transcription, summarization and intent categorization pipelines Different design patterns of integrating with efficient LLM servers - like TGI/vllm/tensor-RT for Summarization pipelines with/without Spark. An overview of batched LLM inference using Airflow as opposed to real time inference outside of it
11:00 - 11:25.
By Bartosz Jankiewicz
Track: New features
Room: Elizabethan A+B
09/12/2024 11:00 AM 09/12/2024 11:25 AM America/Los_Angeles AS24: Empowering Airflow Users: A framework for performance testing and transparent resource optimization

Apache Airflow is the backbone of countless data pipelines, but optimizing performance and resource utilization can be a challenge. This talk introduces a novel performance testing framework designed to measure, monitor, and improve the efficiency of Airflow deployments. I’ll delve into the framework’s modular architecture, showcasing how it can be tailored to various Airflow setups (Docker, Kubernetes, cloud providers). By measuring key metrics across schedulers, workers, triggers, and databases, this framework provides actionable insights to identify bottlenecks and compare performance across different versions or configurations.

Attendees will learn:

The motivation behind developing a standardized performance testing approach.

Key design considerations and challenges in measuring performance across diverse Airflow environments.

How to leverage the framework to construct test suites for different use cases (e.g., version comparison).

Practical tips for interpreting performance test results and making informed decisions about resource allocation.

How this framework contributes to greater transparency in Airflow release notes, empowering users with performance data.

Elizabethan A+B
Apache Airflow is the backbone of countless data pipelines, but optimizing performance and resource utilization can be a challenge. This talk introduces a novel performance testing framework designed to measure, monitor, and improve the efficiency of Airflow deployments. I’ll delve into the framework’s modular architecture, showcasing how it can be tailored to various Airflow setups (Docker, Kubernetes, cloud providers). By measuring key metrics across schedulers, workers, triggers, and databases, this framework provides actionable insights to identify bottlenecks and compare performance across different versions or configurations.
11:00 - 11:25.
By Jet Mariscal
Track: Use cases
Room: California East
09/12/2024 11:00 AM 09/12/2024 11:25 AM America/Los_Angeles AS24: Unlocking the Power of Airflow Beyond Data Engineering at Cloudflare

While Airflow is widely known for orchestrating and managing workflows, particularly in the context of data engineering, data science, ML (Machine Learning), and ETL (Extract, Transform, Load) processes, its flexibility and extensibility make it a highly versatile tool suitable for a variety of use cases beyond these domains. In fact, Cloudflare has publicly shared in the past an example on how Airflow was leveraged to build a system that automates datacenter expansions.

In this talk, I will share a few more of our use cases beyond traditional data engineering, demonstrating Airflow’s sophisticated capabilities for orchestrating a wide variety of complex workflows, and discussing how Airflow played a crucial role in building some of the highly successful autonomous systems at Cloudflare, from handling automated bare metal server diagnostics and recovery at scale, to Zero Touch Provisioning that is helping us accelerate the roll out of inference-optimized GPUs in 150+ cities in multiple countries globally.

California East
While Airflow is widely known for orchestrating and managing workflows, particularly in the context of data engineering, data science, ML (Machine Learning), and ETL (Extract, Transform, Load) processes, its flexibility and extensibility make it a highly versatile tool suitable for a variety of use cases beyond these domains. In fact, Cloudflare has publicly shared in the past an example on how Airflow was leveraged to build a system that automates datacenter expansions.
11:30 - 11:55.
By Konrad Schieban
Track: Community
Room: California West
09/12/2024 11:30 AM 09/12/2024 11:55 AM America/Los_Angeles AS24: DAGify - Enterprise Scheduler Migration Accelerator for Airflow

DAGify is a highly extensible, template driven, enterprise scheduler migration accelerator that helps organizations speed up their migration to Apache Airflow. While DAGify does not claim to migrate 100% of existing scheduler functionality it aims to heavily reduce the manual effort it takes for developers to convert their enterprise scheduler formats into Python Native Airflow DAGs.

DAGify is an open source tool under Apache 2.0 license and available on Github (https://github.com/GoogleCloudPlatform/dagify).

In this session we will introduce DAGify, its use cases and demo its functionality by converting Control-M XML files to Airflow DAGs.

Additionally we will highlight DAGify’s “no-code” extensibility by creating custom conversion templates to map Control-M functionality to Airflow operators.

California West
DAGify is a highly extensible, template driven, enterprise scheduler migration accelerator that helps organizations speed up their migration to Apache Airflow. While DAGify does not claim to migrate 100% of existing scheduler functionality it aims to heavily reduce the manual effort it takes for developers to convert their enterprise scheduler formats into Python Native Airflow DAGs. DAGify is an open source tool under Apache 2.0 license and available on Github (https://github.
11:30 - 12:15.
By Shobhit Shah & Sumit Maheshwari
Track: Use cases
Room: Georgian
09/12/2024 11:30 AM 09/12/2024 12:15 PM America/Los_Angeles AS24: Evolution of Airflow at Uber

Up until a few years ago, teams at Uber used multiple data workflow systems, with some based on open source projects such as Apache Oozie, Apache Airflow, and Jenkins while others were custom built solutions written in Python and Clojure.

Every user who needed to move data around had to learn about and choose from these systems, depending on the specific task they needed to accomplish. Each system required additional maintenance and operational burdens to keep it running, troubleshoot issues, fix bugs, and educate users.

After this evaluation, and with the goal in mind of converging on a single workflow system capable of supporting Uber’s scale, we settled on an Airflow-based system. The Airflow-based DSL provided the best trade-off of flexibility, expressiveness, and ease of use while being accessible for our broad range of users, which includes data scientists, developers, machine learning experts, and operations employees.

This talk will focus on scaling Airflow to Uber’s scale and providing a no-code seamless user experience

Georgian
Up until a few years ago, teams at Uber used multiple data workflow systems, with some based on open source projects such as Apache Oozie, Apache Airflow, and Jenkins while others were custom built solutions written in Python and Clojure. Every user who needed to move data around had to learn about and choose from these systems, depending on the specific task they needed to accomplish. Each system required additional maintenance and operational burdens to keep it running, troubleshoot issues, fix bugs, and educate users.
11:30 - 12:15.
By Kaxil Naik & Ash Berlin-Taylor
Track: New features
Room: Elizabethan A+B
09/12/2024 11:30 AM 09/12/2024 12:15 PM America/Los_Angeles AS24: Gen AI using Airflow 3: A vision for Airflow RAGs

Gen AI has taken the computing world by storm. As Enterprises and Startups have started to experiment with LLM applications, it has become clear that providing the right context to these LLM applications is critical.

This process known as Retrieval augmented generation (RAG) relies on adding custom data to the large language model, so that the efficacy of the response can be improved. Processing custom data and integrating with Enterprise applications is a strength of Apache Airflow.

This talk goes into details about a vision to enhance Apache Airflow to more intuitively support RAG, with additional capabilities and patterns. Specifically, these include the following

  • Support for unstructured data sources such as Text, but also extending to Image, Audio, Video, and Custom sensor data

  • LLM model invocation, including both external model services through APIs and local models using container invocation.

  • Automatic Index Refreshing with a focus on unstructured data lifecycle management to avoid cumbersome and expensive index creation on Vector databases

  • Templates for hallucination reduction via testing and scoping strategies

Elizabethan A+B
Gen AI has taken the computing world by storm. As Enterprises and Startups have started to experiment with LLM applications, it has become clear that providing the right context to these LLM applications is critical. This process known as Retrieval augmented generation (RAG) relies on adding custom data to the large language model, so that the efficacy of the response can be improved. Processing custom data and integrating with Enterprise applications is a strength of Apache Airflow.
11:30 - 11:55.
By Madhav Khakhar, Alexander Shmidt & Mayank Chopra
Track: Use cases
Room: California East
09/12/2024 11:30 AM 09/12/2024 11:55 AM America/Los_Angeles AS24: How we use Airflow at Booking to Orchestrate Big Data Workflows

The talk will cover how we use Airflow at the heart of our Workflow Management Platform(WFM) at Booking.com, enabling our internal users to orchestrate big data workflows on Booking Data Exchange(BDX).

High level overview of the talk:

  • Adapting open source Airflow helm chart to spin up Airflow installation in Booking Kubernetes Service (BKS)
  • Coming up with Workflow definition format (yaml)
  • Conversion of workflow.yaml to workflow.py DAGs
  • Usage of Deferrable operators to provide standard step templates to users
  • Workspaces (collection of workflows), using it to ensure role based access to DAG permissions for users
  • Using okta for authentication
  • Alerting, monitoring, logging
  • Plans to shift to Astronomer
California East
The talk will cover how we use Airflow at the heart of our Workflow Management Platform(WFM) at Booking.com, enabling our internal users to orchestrate big data workflows on Booking Data Exchange(BDX). High level overview of the talk: Adapting open source Airflow helm chart to spin up Airflow installation in Booking Kubernetes Service (BKS) Coming up with Workflow definition format (yaml) Conversion of workflow.yaml to workflow.py DAGs Usage of Deferrable operators to provide standard step templates to users Workspaces (collection of workflows), using it to ensure role based access to DAG permissions for users Using okta for authentication Alerting, monitoring, logging Plans to shift to Astronomer
12:00 - 12:25.
By Jianlong Zhong
Track: Use cases
Room: California East
09/12/2024 12:00 PM 09/12/2024 12:25 PM America/Los_Angeles AS24: Airflow Unleashed: Making Hundreds of Deployments A Day at Coinbase

At Coinbase, Airflow is the backbone of ELT, supported by a vibrant community of over 500 developers. This vast engagement results in a continuous stream of enhancements, with hundreds of commits tested and released daily. However, this scale of development presents its own set of challenges, especially in deployment velocity.

Traditional deployment methodologies proved inadequate, significantly impeding the productivity of our developers. Recognizing the critical need for a solution that matches our pace of innovation, we developed AirAgent: a bespoke, fully autonomous deployer designed specifically for Airflow. Capable of deploying updates hundreds of times a day on both staging and production environments, AirAgent has transformed our development lifecycle, enabling immediate iteration and drastically improving developer velocity.

This talk aims to unveil the inner workings of AirAgent, highlighting its design principles, deployment strategies, and the challenges we overcame in its implementation. By sharing our journey, we hope to offer insights and strategies that can benefit others in the Airflow community, encouraging a shift towards a high-frequency deployment workflow.

California East
At Coinbase, Airflow is the backbone of ELT, supported by a vibrant community of over 500 developers. This vast engagement results in a continuous stream of enhancements, with hundreds of commits tested and released daily. However, this scale of development presents its own set of challenges, especially in deployment velocity. Traditional deployment methodologies proved inadequate, significantly impeding the productivity of our developers. Recognizing the critical need for a solution that matches our pace of innovation, we developed AirAgent: a bespoke, fully autonomous deployer designed specifically for Airflow.
12:00 - 12:25.
By Michael Robinson
Track: Community
Room: California West
09/12/2024 12:00 PM 09/12/2024 12:25 PM America/Los_Angeles AS24: Lessons from the Ecosystem: What can Airflow Learn from Other Open-source Communities?

The Apache Airflow community is so large and active that it’s tempting to take the view that “if it ain’t broke don’t fix it.”

In a community as in a codebase, however, improvement and attention are essential to sustaining growth. And bugs are just as inevitable in community management as they are in software development. If only the fixes were, too!

Airflow is large and growing because users love Airflow and our community. But what steps could be taken to enhance the typical user’s and developer’s experience of the community?

This talk will provide an overview of potential learnings for Airflow community management efforts, such as project governance and analytics, derived from the speaker’s experience managing the OpenLineage and Marquez open-source communities.

The talk will answer questions such as: What can we learn from other open-source communities when it comes to supporting users and developers and learning from them? For example, what options exist for getting historical data out of Slack despite the limitations of the free tier? What tools can be used to make adoption metrics more reliable? What are some effective supplements to asynchronous governance?

California West
The Apache Airflow community is so large and active that it’s tempting to take the view that “if it ain’t broke don’t fix it.” In a community as in a codebase, however, improvement and attention are essential to sustaining growth. And bugs are just as inevitable in community management as they are in software development. If only the fixes were, too! Airflow is large and growing because users love Airflow and our community.
12:30 - 13:15.
By Cyrus Dukart, David Sacerdote & Jason Bridgemohansingh
Track: Use cases
Room: California East
09/12/2024 12:30 PM 09/12/2024 1:15 PM America/Los_Angeles AS24: Adaptive Memory Scaling for Robust Airflow Pipelines

At Vibrant Planet, we’re on a mission to make the world’s communities and ecosystems more resilient in the face of climate change. Our cloud-based platform is designed for collaborative scenario planning to tackle wildfires, climate threats, and ecosystem restoration on a massive scale.

In this talk we will dive into how we are using Airflow. Particularly we will focus on how we’re making Airflow pipelines smarter and more resilient, especially when dealing with the task of processing large satellite imagery and other geospatial data.

  1. Self-Healing Pipelines:

Discuss our self-healing pipelines which identify likely out-of-memory events and incrementally allocate more memory for task instance retries, ensuring robust and uninterrupted workflow execution.

  1. Initial Memory Recommendations:

We’ll discuss how we set intelligent initial memory allocations for each task instance, enhancing resource efficiency from the outset.

California East
At Vibrant Planet, we’re on a mission to make the world’s communities and ecosystems more resilient in the face of climate change. Our cloud-based platform is designed for collaborative scenario planning to tackle wildfires, climate threats, and ecosystem restoration on a massive scale. In this talk we will dive into how we are using Airflow. Particularly we will focus on how we’re making Airflow pipelines smarter and more resilient, especially when dealing with the task of processing large satellite imagery and other geospatial data.
12:30 - 13:15.
By Julian LaNeve & Jason Ma
Track: Sponsored
Room: Georgian
09/12/2024 12:30 PM 09/12/2024 1:15 PM America/Los_Angeles AS24: Building Reliable Data Products

Data engineers have shifted from delivering data for internal analytics applications to customer-facing data products. And with that shift comes a whole new level of operational rigor necessary to instill trust and confidence in the data. How do you hold data pipelines to the same standards as traditional software applications? Can you apply principles learned from the field of SRE to the world of data?

In this talk, we’ll explore how we’ve seen this evolve in Astronomer’s customer base and highlight best practices learned from the most critical data product applications we’ve seen. We’ll hear from Astronomer’s own data team as they went through the transformation from analytics to data products. And we’ll showcase a new product we’re building to help data teams around the world solve exactly this problem!

Georgian
Data engineers have shifted from delivering data for internal analytics applications to customer-facing data products. And with that shift comes a whole new level of operational rigor necessary to instill trust and confidence in the data. How do you hold data pipelines to the same standards as traditional software applications? Can you apply principles learned from the field of SRE to the world of data? In this talk, we’ll explore how we’ve seen this evolve in Astronomer’s customer base and highlight best practices learned from the most critical data product applications we’ve seen.
12:30 - 13:15.
By Constance Martineau & Tzu-ping Chung
Track: New features
Room: Elizabethan A+B
09/12/2024 12:30 PM 09/12/2024 1:15 PM America/Los_Angeles AS24: Seeing Clearly with Airflow: The Shift to Data-Aware Orchestration

As Apache Airflow evolves, a key shift is emerging: the move from task-centric to data-aware orchestration. Traditionally, Airflow has focused on managing tasks efficiently, with limited visibility into the data those tasks manipulate. However, the rise of data-centric workflows demands a new approach—one that puts data at the forefront.

This talk will explore how embedding deeper data insights into Airflow can align with modern users’ needs, reducing complexity and enhancing workflow efficiency. We’ll discuss how this evolution can transform Airflow into a more intuitive and powerful tool, better suited to today’s data-driven environments.

Elizabethan A+B
As Apache Airflow evolves, a key shift is emerging: the move from task-centric to data-aware orchestration. Traditionally, Airflow has focused on managing tasks efficiently, with limited visibility into the data those tasks manipulate. However, the rise of data-centric workflows demands a new approach—one that puts data at the forefront. This talk will explore how embedding deeper data insights into Airflow can align with modern users’ needs, reducing complexity and enhancing workflow efficiency.
12:30 - 13:15.
By Jarek Potiuk
Track: Community
Room: California West
09/12/2024 12:30 PM 09/12/2024 1:15 PM America/Los_Angeles AS24: The Silent Symphony: Keeping Airflow's CI/CD and Dev Tools in Tune

Apache Airflow relies on a silent symphony behind the scenes: its CI/CD (Continuous Integration/Continuous Delivery) and development tooling. This presentation explores the critical role these tools play in keeping Airflow efficient and innovative. We’ll delve into how robust CI/CD ensures bug fixes and improvements are seamlessly integrated, while well-maintained development tools empower developers to contribute effectively.

Airflow’s power comes from a well-oiled machine – its CI/CD and development tools. This presentation dives into the world of these often-overlooked heroes. We’ll explore how seamless CI/CD pipelines catch and fix issues early, while robust development tools empower efficient coding and collaboration. Discover how you can use and contribute to a thriving Airflow ecosystem by ensuring these crucial tools stay in top shape.

California West
Apache Airflow relies on a silent symphony behind the scenes: its CI/CD (Continuous Integration/Continuous Delivery) and development tooling. This presentation explores the critical role these tools play in keeping Airflow efficient and innovative. We’ll delve into how robust CI/CD ensures bug fixes and improvements are seamlessly integrated, while well-maintained development tools empower developers to contribute effectively. Airflow’s power comes from a well-oiled machine – its CI/CD and development tools. This presentation dives into the world of these often-overlooked heroes.
14:00 - 14:55.
By Jed Cunningham
Track: New features
Room: Elizabethan A+B
09/12/2024 2:00 PM 09/12/2024 2:55 PM America/Los_Angeles AS24: AIP-63: DAG Versioning - Where are we?

Join us as we check in on the current status of AIP-63: DAG Versioning. This session will explore the motivations behind AIP-63, the challenges faced by Airflow users in understanding and managing DAG history, and how it aims to address them. From tracking TaskInstance history to improving DAG representation in the UI, we’ll examine what we’ve already done and what’s next. We’ll also touch upon the potential future steps outlined in AIP-66 regarding the execution of specific DAG versions.

Elizabethan A+B
Join us as we check in on the current status of AIP-63: DAG Versioning. This session will explore the motivations behind AIP-63, the challenges faced by Airflow users in understanding and managing DAG history, and how it aims to address them. From tracking TaskInstance history to improving DAG representation in the UI, we’ll examine what we’ve already done and what’s next. We’ll also touch upon the potential future steps outlined in AIP-66 regarding the execution of specific DAG versions.
14:00 - 14:25.
By Daniil Dubin
Track: Use cases
Room: California East
09/12/2024 2:00 PM 09/12/2024 2:25 PM America/Los_Angeles AS24: Empowering Business Analysts with DAG Authoring IDE Running 8000 Workflows

At Wix more often than not business analysts build workflows themselves to avoid data engineers being a bottleneck. But how do you enable them to create SQL ETLs starting when dependencies are ready and sending emails or refreshing Tableau reports when the work is done? One simple answer may be to use Airflow. The problem is every BA cannot be expected to know Python and Git so well that they will create thousands of DAGs easily.

To bridge this gap we have built a web-based IDE, called Quix, that allows simple notebook-like development of Trino SQL workflows and converts them to Airflow DAGs when a user hits the “schedule” button.

During the talk we will go through the problems of building a reliable and extendable DAG generating tool, why we preferred Airflow over Apache Oozie and also tricks (sharding, HA-mode, etc) allowing Airflow to run 8000 active DAGs on a single cluster in k8s.

California East
At Wix more often than not business analysts build workflows themselves to avoid data engineers being a bottleneck. But how do you enable them to create SQL ETLs starting when dependencies are ready and sending emails or refreshing Tableau reports when the work is done? One simple answer may be to use Airflow. The problem is every BA cannot be expected to know Python and Git so well that they will create thousands of DAGs easily.
14:00 - 16:40.
Track: Workshops
Room: Elizabethan C, Elizabethan D, Borgia
09/12/2024 2:00 PM 09/12/2024 4:40 PM America/Los_Angeles AS24: Featured Workshops

We will be offering hands-on workshops so you can get practical experience with Airflow tools and managed offerings.

  • Format & duration: Workshops are instructor led, 2-3 hours long, bring your own device.
  • Only available for participants with a Conference + Workshop pass.
  • Workshops have limited capacity. You can sign up in advance for 2 workshops (one per day) to get a confirmed spot.
  • Workshops will be able to receive walk-ins (people who didn’t sign up in advance) but will have limited spots and do not guarantee they will be able to receive all walk-ins.
Elizabethan C, Elizabethan D, Borgia
We will have 3 simultaneous hands-on workshops. Follow the link for details and to sign up.
14:00 - 14:25.
By Pete Dejoy
Track: Community
Room: California West
09/12/2024 2:00 PM 09/12/2024 2:25 PM America/Los_Angeles AS24: How the Airflow Community Productionizes Generative AI

Every data team out there is being asked from their business stakeholders about Generative AI. Taking LLM centric workloads to production is not a trivial task. At the foundational level, there are a set of challenges around data delivery, data quality, and data ingestion that mirror traditional data engineering problems. Once you’re past those, there’s a set of challenges related to the underlying use case you’re trying to solve. Thankfully, because of how Airflow was already being used at these companies for data engineering and MLOps use cases, it has become the defacto orchestration layer behind many GenAI use cases for startups and Fortune 500s.

This talk will be a tour of various methods, best practices, and considerations used in the Airflow community when taking GenAI use cases to production. We’ll focus on 4 primary use cases; RAG, fine tuning, resource management, and batch inference and take a walk through patterns different members in the community have used to productionize this new, exciting technology.

California West
Every data team out there is being asked from their business stakeholders about Generative AI. Taking LLM centric workloads to production is not a trivial task. At the foundational level, there are a set of challenges around data delivery, data quality, and data ingestion that mirror traditional data engineering problems. Once you’re past those, there’s a set of challenges related to the underlying use case you’re trying to solve. Thankfully, because of how Airflow was already being used at these companies for data engineering and MLOps use cases, it has become the defacto orchestration layer behind many GenAI use cases for startups and Fortune 500s.
14:35 - 15:00.
By Ian Moritz
Track: Airflow & ...
Room: California West
09/12/2024 2:35 PM 09/12/2024 3:00 PM America/Los_Angeles AS24: Airflow-as-an-Engine: Lessons from Open-Source Applications Built On Top of Airflow

Airflow is often used for running data pipelines, which themselves connect with other services through the provider system. However, it is also increasingly used as an engine under-the-hood for other projects building on top of the DAG primitive. For example, Cosmos is a framework for automatically transforming dbt DAGs into Airflow DAGs, so that users can supplement the developer experience of dbt with the power of Airflow.

This session dives into how a select group of these frameworks (Cosmos, Meltano, Chronon) use Airflow as an engine for orchestrating complex workflows their systems depend on. In particular, we will discuss ways that we’ve increased Airflow performance to meet application-specific demands (high-task-count Cosmos DAGs, streaming jobs in Chronon), new Airflow features that will evolve how these frameworks use Airflow under the hood (DAG versioning, dataset integrations), and paths we see these projects taking over the next few years as Airflow grows. Airflow is not just a DAG platform, it’s an application platform!

California West
Airflow is often used for running data pipelines, which themselves connect with other services through the provider system. However, it is also increasingly used as an engine under-the-hood for other projects building on top of the DAG primitive. For example, Cosmos is a framework for automatically transforming dbt DAGs into Airflow DAGs, so that users can supplement the developer experience of dbt with the power of Airflow. This session dives into how a select group of these frameworks (Cosmos, Meltano, Chronon) use Airflow as an engine for orchestrating complex workflows their systems depend on.
14:35 - 15:00.
By Olivier Daneau
Track: Best practices
Room: California East
09/12/2024 2:35 PM 09/12/2024 3:00 PM America/Los_Angeles AS24: Using Airflow operational data to optimize Cloud services

Cost management is a continuous challenge for our data teams at Astronomer. Understanding the expenses associated with running our workflows is not always straightforward, and identifying which process ran a query causing unexpected usage on a given day can be time-consuming.

In this talk, we will showcase an Airflow Plugin and specific DAGs developed and used internally at Astronomer to track and optimize the costs of running DAGs. Our internal tool monitors Snowflake query costs, provides insights, and sends alerts for abnormal usage. With it, Astronomer identified and refactored its most costly DAGs, resulting in an almost 25% reduction in Snowflake spending.

We will demonstrate how to track Snowflake-related DAG costs and discuss how the tool can be adapted to any database supporting query tagging like BigQuery, Oracle, and more.

This talk will cover the implementation details and show how Airflow users can effectively adopt this tool to monitor and manage their DAG costs.

California East
Cost management is a continuous challenge for our data teams at Astronomer. Understanding the expenses associated with running our workflows is not always straightforward, and identifying which process ran a query causing unexpected usage on a given day can be time-consuming. In this talk, we will showcase an Airflow Plugin and specific DAGs developed and used internally at Astronomer to track and optimize the costs of running DAGs. Our internal tool monitors Snowflake query costs, provides insights, and sends alerts for abnormal usage.
15:10 - 15:35.
By Vincent La, Jim Howard & Moulay Zaidane Al Bahi Draidia
Track: Use cases
Room: California East
09/12/2024 3:10 PM 09/12/2024 3:35 PM America/Los_Angeles AS24: Customizing LLMs: Leveraging Technology to tailor GenAI using Airflow

Laurel provides an AI-driven timekeeping solution tailored for accounting and legal firms, automating timesheet creation by capturing digital work activities. This session highlights two notable AI projects:

  1. UTBMS Code Prediction: Leveraging small language models, this system builds new embeddings to predict work codes for legal bills with high accuracy. More details are available in our case study: https://www.laurel.ai/resources-post/enhancing-legal-and-accounting-workflows-with-ai-insights-into-work-code-prediction.

  2. Bill Creation and Narrative Generation: Utilizing Retrieval-Augmented Generation (RAG), this approach transforms users’ digital activities into fully billable entries.

Additionally, we will discuss how we use Airflow for model management in these AI projects:

  • Daily Model Retraining: We retrain our models daily

  • Model (Re)deployment: Our Airflow DAG evaluates model performance, redeploying it if improvements are detected

  • Cost Management: To avoid high costs associated with querying large language models frequently, our DAG utilizes RAG to efficiently summarize daily activities into a billable timesheet at day’s end.

California East
Laurel provides an AI-driven timekeeping solution tailored for accounting and legal firms, automating timesheet creation by capturing digital work activities. This session highlights two notable AI projects: UTBMS Code Prediction: Leveraging small language models, this system builds new embeddings to predict work codes for legal bills with high accuracy. More details are available in our case study: https://www.laurel.ai/resources-post/enhancing-legal-and-accounting-workflows-with-ai-insights-into-work-code-prediction. Bill Creation and Narrative Generation: Utilizing Retrieval-Augmented Generation (RAG), this approach transforms users’ digital activities into fully billable entries.
15:10 - 16:10.
By Freddy Demiane, Rahul Vats & Dennis Ferruzzi
Track: Community
Room: California West
09/12/2024 3:10 PM 09/12/2024 4:10 PM America/Los_Angeles AS24: Hello Quality: Building CIs to run Providers Packages System Tests

Airflow operators are a core feature of Apache Airflow and it’s extremely important that we maintain high quality of operators, prevent regressions and on the other hand we help developers with automated tests results to double check if introduced changes don’t cause regressions or backward incompatible changes and we provide Airflow release managers with information whether a given version of a provider should be released or not yet.

Recently a new approach to assuring production quality was implemented for AWS, Google and Astronomer-provided operators - standalone Continuous Integration processes were configured for them and test results dashboards show the results of the last test runs. What has been working well for these operator providers might be a pattern to follow for others - during this presentation, AWS, Google and Astronomer engineers are going to share the information about the internals of Test Dashboards implemented for AWS, Google and Astronomer-provided operators. This approach might be a a ‘blueprint’ to follow for other providers.

California West
Airflow operators are a core feature of Apache Airflow and it’s extremely important that we maintain high quality of operators, prevent regressions and on the other hand we help developers with automated tests results to double check if introduced changes don’t cause regressions or backward incompatible changes and we provide Airflow release managers with information whether a given version of a provider should be released or not yet. Recently a new approach to assuring production quality was implemented for AWS, Google and Astronomer-provided operators - standalone Continuous Integration processes were configured for them and test results dashboards show the results of the last test runs.
15:10 - 15:35.
By Venkata Jagannath & Marwan Sarieddine
Track: New features
Room: Elizabethan A+B
09/12/2024 3:10 PM 09/12/2024 3:35 PM America/Los_Angeles AS24: Using the power of Apache Airflow and Ray for Scalable AI deployments

Many organizations struggle to create a well-orchestrated AI infrastructure, using separate and disconnected platforms for data processing, model training, and inference, which slows down development and increases costs. There’s a clear need for a unified system that can handle all aspects of AI development and deployment, regardless of the size of data or models.

Join our breakout session to see how our comprehensive solution simplifies the development and deployment of large language models in production. Learn how to streamline your AI operations by implementing an end-to-end ML lifecycle on your custom data, including - automated LLM fine-tuning, LLM evaluation & LLM serving and LoRA deployments

Elizabethan A+B
Many organizations struggle to create a well-orchestrated AI infrastructure, using separate and disconnected platforms for data processing, model training, and inference, which slows down development and increases costs. There’s a clear need for a unified system that can handle all aspects of AI development and deployment, regardless of the size of data or models. Join our breakout session to see how our comprehensive solution simplifies the development and deployment of large language models in production.
15:45 - 16:10.
By Sriram Vamsi Ilapakurthy
Track: Airflow intro talks
Room: Elizabethan A+B
09/12/2024 3:45 PM 09/12/2024 4:10 PM America/Los_Angeles AS24: Exploring DAG Design Patterns in Apache Airflow

This talk delves into advanced Directed Acyclic Graph (DAG) design patterns that are pivotal for optimizing data pipeline management and boosting efficiency. We’ll cover dynamic DAG generation, which allows for flexible, scalable workflow creation based on real-time data and configurations. Learn about task grouping and SubDAGs to enhance readability and maintainability of complex workflows. We’ll also explore parameterized DAGs for injecting runtime parameters into tasks, enabling versatile and adaptable pipeline configurations. Additionally, the session will address branching and conditional execution to manage workflow paths dynamically based on data conditions or external triggers. Lastly, understand how to leverage parallelism and concurrency to maximize resource utilization and reduce execution times. This session is designed for intermediate to advanced users who are familiar with the basics of Airflow and looking to deepen their understanding of its more sophisticated capabilities.

This session is crafted to be compelling by focusing on practical, high-impact design patterns that can significantly improve the performance and scalability of Airflow deployments.

Elizabethan A+B
This talk delves into advanced Directed Acyclic Graph (DAG) design patterns that are pivotal for optimizing data pipeline management and boosting efficiency. We’ll cover dynamic DAG generation, which allows for flexible, scalable workflow creation based on real-time data and configurations. Learn about task grouping and SubDAGs to enhance readability and maintainability of complex workflows. We’ll also explore parameterized DAGs for injecting runtime parameters into tasks, enabling versatile and adaptable pipeline configurations.
15:45 - 16:10.
By Gil Reich
Track: Airflow intro talks
Room: California East
09/12/2024 3:45 PM 09/12/2024 4:10 PM America/Los_Angeles AS24: Refactoring DAGs: From Duplication to Delightful Efficiency with a Centralized Library

Feeling trapped in a maze of duplicate Airflow DAG code? We were too! That’s why we embarked on a journey to build a centralized library, eliminating redundancy and unlocking delightful efficiency.

Join us as we share:

  • The struggles of managing repetitive code across DAGs
  • Our approach to a centralized library, revealing design and implementation strategies
  • The amazing results: reduced development time, clean code, effortless maintenance, and a framework that creates efficient and self-documenting DAGs

Let’s break free from complexity and duplication, and build a brighter Airflow future together!

California East
Feeling trapped in a maze of duplicate Airflow DAG code? We were too! That’s why we embarked on a journey to build a centralized library, eliminating redundancy and unlocking delightful efficiency. Join us as we share: The struggles of managing repetitive code across DAGs Our approach to a centralized library, revealing design and implementation strategies The amazing results: reduced development time, clean code, effortless maintenance, and a framework that creates efficient and self-documenting DAGs Let’s break free from complexity and duplication, and build a brighter Airflow future together!
10:10 - 10:30
Morning break
13:15 - 14:00
Lunch
16:20 - 17:00
Lightning talks & Wrap up
09:00 - 09:25. Grand Ballroom
By Vikram Koka
Track: Keynote
Apache Airflow has emerged as the defacto standard for data orchestration. Over the last couple of years, Airflow has also seen increasing adoption for ML and AI use cases. It has been almost four years since the release of Airflow 2 and as a community we have agreed that it’s time for a major foundational release in the form of Airflow 3. This talk will introduce the vision behind Airflow 3, including the emerging technology trends in the industry and how Airflow will evolve in response.
09:25 - 10:10. Grand Ballroom
By Madison Swain-Bowden, Kaxil Naik, Michał Modras, Constance Martineau & Shubham Mehta
Track: Keynote
Join us in this panel with key members of the community behind the development of Apache Airflow where we will discuss the tentative scope for the next generation, i.e. Airflow 3.
10:30 - 10:55. California East
By Anant Agarwal
Track: Use cases
In this talk, we’ll discuss how Instacart leverages Apache Airflow to orchestrate a vast network of data pipelines, powering both our core infrastructure and dbt deployments. As a data-driven company, Airflow plays a critical role in enabling us to execute large and intricate pipelines securely, compliantly, and at scale. We’ll delve into the following key areas: a. High-Throughput Cluster Management: We’ll explore how we manage and maintain our Airflow cluster, ensuring the efficient execution of over 2,000 DAGs across diverse use cases.
10:30 - 11:15. California West
By Briana Okyere, Amogh Desai, Ryan Hatter & Srabasti Banerjee
Track: Community
“Connecting the Dots in Airflow: From User to Contributor” explores the journey of transitioning from an Airflow user to an active project contributor. This talk will cover essential steps, resources, and best practices to effectively engage with the Airflow community and make meaningful contributions. Attendees will gain insights into the collaborative nature of open-source projects and how their involvement can drive both personal growth and project innovation.
10:30 - 10:55. Elizabethan A+B
By Bhavesh Jaisinghani
Track: Use cases
In today’s data-driven era, ensuring data reliability and enhancing our testing and development capabilities are paramount. Local unit testing has its merits but falls short when dealing with the volume of big data. One major challenge is running Spark jobs pre-deployment to ensure they produce expected results and handle production-level data volumes. In this talk, we will discuss how Autodesk leveraged Astronomer to improve pipeline development. We’ll explore how it addresses challenges with sensitive and large data sets that cannot be transferred to local machines or non-production environments.
10:30 - 10:55. Georgian
By Rafal Biegacz & Filip Knapik
Track: Sponsored
In the realm of data engineering, machine learning pipelines and using cloud and web services there is a huge demand for orchestration technologies. Apache Airflow belongs to the most popular orchestration technologies or even is the most popular one. In this presentation we are going to focus these aspects of Airflow that make it so popular and whether it became the orchestration industry standard.
11:00 - 11:25. California East
By Jet Mariscal
Track: Use cases
While Airflow is widely known for orchestrating and managing workflows, particularly in the context of data engineering, data science, ML (Machine Learning), and ETL (Extract, Transform, Load) processes, its flexibility and extensibility make it a highly versatile tool suitable for a variety of use cases beyond these domains. In fact, Cloudflare has publicly shared in the past an example on how Airflow was leveraged to build a system that automates datacenter expansions.
11:00 - 11:25. Elizabethan A+B
By Bartosz Jankiewicz
Track: New features
Apache Airflow is the backbone of countless data pipelines, but optimizing performance and resource utilization can be a challenge. This talk introduces a novel performance testing framework designed to measure, monitor, and improve the efficiency of Airflow deployments. I’ll delve into the framework’s modular architecture, showcasing how it can be tailored to various Airflow setups (Docker, Kubernetes, cloud providers). By measuring key metrics across schedulers, workers, triggers, and databases, this framework provides actionable insights to identify bottlenecks and compare performance across different versions or configurations.
11:00 - 11:25. Georgian
By Udit Saxena
Track: Use cases
This talk will explore ASAPP’s use of Apache Airflow to streamline and optimize our machine learning operations (MLOps). Key highlights include: Integrating with our custom Spark solution for achieving speedup, efficiency, and cost gains for generative AI transcription, summarization and intent categorization pipelines Different design patterns of integrating with efficient LLM servers - like TGI/vllm/tensor-RT for Summarization pipelines with/without Spark. An overview of batched LLM inference using Airflow as opposed to real time inference outside of it
11:30 - 11:55. California East
By Madhav Khakhar, Alexander Shmidt & Mayank Chopra
Track: Use cases
The talk will cover how we use Airflow at the heart of our Workflow Management Platform(WFM) at Booking.com, enabling our internal users to orchestrate big data workflows on Booking Data Exchange(BDX). High level overview of the talk: Adapting open source Airflow helm chart to spin up Airflow installation in Booking Kubernetes Service (BKS) Coming up with Workflow definition format (yaml) Conversion of workflow.yaml to workflow.py DAGs Usage of Deferrable operators to provide standard step templates to users Workspaces (collection of workflows), using it to ensure role based access to DAG permissions for users Using okta for authentication Alerting, monitoring, logging Plans to shift to Astronomer
11:30 - 11:55. California West
By Konrad Schieban
Track: Community
DAGify is a highly extensible, template driven, enterprise scheduler migration accelerator that helps organizations speed up their migration to Apache Airflow. While DAGify does not claim to migrate 100% of existing scheduler functionality it aims to heavily reduce the manual effort it takes for developers to convert their enterprise scheduler formats into Python Native Airflow DAGs. DAGify is an open source tool under Apache 2.0 license and available on Github (https://github.
11:30 - 12:15. Georgian
By Shobhit Shah & Sumit Maheshwari
Track: Use cases
Up until a few years ago, teams at Uber used multiple data workflow systems, with some based on open source projects such as Apache Oozie, Apache Airflow, and Jenkins while others were custom built solutions written in Python and Clojure. Every user who needed to move data around had to learn about and choose from these systems, depending on the specific task they needed to accomplish. Each system required additional maintenance and operational burdens to keep it running, troubleshoot issues, fix bugs, and educate users.
11:30 - 12:15. Elizabethan A+B
By Kaxil Naik & Ash Berlin-Taylor
Track: New features
Gen AI has taken the computing world by storm. As Enterprises and Startups have started to experiment with LLM applications, it has become clear that providing the right context to these LLM applications is critical. This process known as Retrieval augmented generation (RAG) relies on adding custom data to the large language model, so that the efficacy of the response can be improved. Processing custom data and integrating with Enterprise applications is a strength of Apache Airflow.
12:00 - 12:25. California East
By Jianlong Zhong
Track: Use cases
At Coinbase, Airflow is the backbone of ELT, supported by a vibrant community of over 500 developers. This vast engagement results in a continuous stream of enhancements, with hundreds of commits tested and released daily. However, this scale of development presents its own set of challenges, especially in deployment velocity. Traditional deployment methodologies proved inadequate, significantly impeding the productivity of our developers. Recognizing the critical need for a solution that matches our pace of innovation, we developed AirAgent: a bespoke, fully autonomous deployer designed specifically for Airflow.
12:00 - 12:25. California West
By Michael Robinson
Track: Community
The Apache Airflow community is so large and active that it’s tempting to take the view that “if it ain’t broke don’t fix it.” In a community as in a codebase, however, improvement and attention are essential to sustaining growth. And bugs are just as inevitable in community management as they are in software development. If only the fixes were, too! Airflow is large and growing because users love Airflow and our community.
12:30 - 13:15. California East
By Cyrus Dukart, David Sacerdote & Jason Bridgemohansingh
Track: Use cases
At Vibrant Planet, we’re on a mission to make the world’s communities and ecosystems more resilient in the face of climate change. Our cloud-based platform is designed for collaborative scenario planning to tackle wildfires, climate threats, and ecosystem restoration on a massive scale. In this talk we will dive into how we are using Airflow. Particularly we will focus on how we’re making Airflow pipelines smarter and more resilient, especially when dealing with the task of processing large satellite imagery and other geospatial data.
12:30 - 13:15. California West
By Jarek Potiuk
Track: Community
Apache Airflow relies on a silent symphony behind the scenes: its CI/CD (Continuous Integration/Continuous Delivery) and development tooling. This presentation explores the critical role these tools play in keeping Airflow efficient and innovative. We’ll delve into how robust CI/CD ensures bug fixes and improvements are seamlessly integrated, while well-maintained development tools empower developers to contribute effectively. Airflow’s power comes from a well-oiled machine – its CI/CD and development tools. This presentation dives into the world of these often-overlooked heroes.
12:30 - 13:15. Elizabethan A+B
By Constance Martineau & Tzu-ping Chung
Track: New features
As Apache Airflow evolves, a key shift is emerging: the move from task-centric to data-aware orchestration. Traditionally, Airflow has focused on managing tasks efficiently, with limited visibility into the data those tasks manipulate. However, the rise of data-centric workflows demands a new approach—one that puts data at the forefront. This talk will explore how embedding deeper data insights into Airflow can align with modern users’ needs, reducing complexity and enhancing workflow efficiency.
12:30 - 13:15. Georgian
By Julian LaNeve & Jason Ma
Track: Sponsored
Data engineers have shifted from delivering data for internal analytics applications to customer-facing data products. And with that shift comes a whole new level of operational rigor necessary to instill trust and confidence in the data. How do you hold data pipelines to the same standards as traditional software applications? Can you apply principles learned from the field of SRE to the world of data? In this talk, we’ll explore how we’ve seen this evolve in Astronomer’s customer base and highlight best practices learned from the most critical data product applications we’ve seen.
14:00 - 16:40. Elizabethan C, Elizabethan D, Borgia
Track: Workshops
We will have 3 simultaneous hands-on workshops. Follow the link for details and to sign up.
14:00 - 14:25. California East
By Daniil Dubin
Track: Use cases
At Wix more often than not business analysts build workflows themselves to avoid data engineers being a bottleneck. But how do you enable them to create SQL ETLs starting when dependencies are ready and sending emails or refreshing Tableau reports when the work is done? One simple answer may be to use Airflow. The problem is every BA cannot be expected to know Python and Git so well that they will create thousands of DAGs easily.
14:00 - 14:25. California West
By Pete Dejoy
Track: Community
Every data team out there is being asked from their business stakeholders about Generative AI. Taking LLM centric workloads to production is not a trivial task. At the foundational level, there are a set of challenges around data delivery, data quality, and data ingestion that mirror traditional data engineering problems. Once you’re past those, there’s a set of challenges related to the underlying use case you’re trying to solve. Thankfully, because of how Airflow was already being used at these companies for data engineering and MLOps use cases, it has become the defacto orchestration layer behind many GenAI use cases for startups and Fortune 500s.
14:00 - 14:55. Elizabethan A+B
By Jed Cunningham
Track: New features
Join us as we check in on the current status of AIP-63: DAG Versioning. This session will explore the motivations behind AIP-63, the challenges faced by Airflow users in understanding and managing DAG history, and how it aims to address them. From tracking TaskInstance history to improving DAG representation in the UI, we’ll examine what we’ve already done and what’s next. We’ll also touch upon the potential future steps outlined in AIP-66 regarding the execution of specific DAG versions.
14:35 - 15:00. California East
By Olivier Daneau
Track: Best practices
Cost management is a continuous challenge for our data teams at Astronomer. Understanding the expenses associated with running our workflows is not always straightforward, and identifying which process ran a query causing unexpected usage on a given day can be time-consuming. In this talk, we will showcase an Airflow Plugin and specific DAGs developed and used internally at Astronomer to track and optimize the costs of running DAGs. Our internal tool monitors Snowflake query costs, provides insights, and sends alerts for abnormal usage.
14:35 - 15:00. California West
By Ian Moritz
Track: Airflow & ...
Airflow is often used for running data pipelines, which themselves connect with other services through the provider system. However, it is also increasingly used as an engine under-the-hood for other projects building on top of the DAG primitive. For example, Cosmos is a framework for automatically transforming dbt DAGs into Airflow DAGs, so that users can supplement the developer experience of dbt with the power of Airflow. This session dives into how a select group of these frameworks (Cosmos, Meltano, Chronon) use Airflow as an engine for orchestrating complex workflows their systems depend on.
15:10 - 15:35. California East
By Vincent La, Jim Howard & Moulay Zaidane Al Bahi Draidia
Track: Use cases
Laurel provides an AI-driven timekeeping solution tailored for accounting and legal firms, automating timesheet creation by capturing digital work activities. This session highlights two notable AI projects: UTBMS Code Prediction: Leveraging small language models, this system builds new embeddings to predict work codes for legal bills with high accuracy. More details are available in our case study: https://www.laurel.ai/resources-post/enhancing-legal-and-accounting-workflows-with-ai-insights-into-work-code-prediction. Bill Creation and Narrative Generation: Utilizing Retrieval-Augmented Generation (RAG), this approach transforms users’ digital activities into fully billable entries.
15:10 - 16:10. California West
By Freddy Demiane, Rahul Vats & Dennis Ferruzzi
Track: Community
Airflow operators are a core feature of Apache Airflow and it’s extremely important that we maintain high quality of operators, prevent regressions and on the other hand we help developers with automated tests results to double check if introduced changes don’t cause regressions or backward incompatible changes and we provide Airflow release managers with information whether a given version of a provider should be released or not yet. Recently a new approach to assuring production quality was implemented for AWS, Google and Astronomer-provided operators - standalone Continuous Integration processes were configured for them and test results dashboards show the results of the last test runs.
15:10 - 15:35. Elizabethan A+B
By Venkata Jagannath & Marwan Sarieddine
Track: New features
Many organizations struggle to create a well-orchestrated AI infrastructure, using separate and disconnected platforms for data processing, model training, and inference, which slows down development and increases costs. There’s a clear need for a unified system that can handle all aspects of AI development and deployment, regardless of the size of data or models. Join our breakout session to see how our comprehensive solution simplifies the development and deployment of large language models in production.
15:45 - 16:10. California East
By Gil Reich
Track: Airflow intro talks
Feeling trapped in a maze of duplicate Airflow DAG code? We were too! That’s why we embarked on a journey to build a centralized library, eliminating redundancy and unlocking delightful efficiency. Join us as we share: The struggles of managing repetitive code across DAGs Our approach to a centralized library, revealing design and implementation strategies The amazing results: reduced development time, clean code, effortless maintenance, and a framework that creates efficient and self-documenting DAGs Let’s break free from complexity and duplication, and build a brighter Airflow future together!
15:45 - 16:10. Elizabethan A+B
By Sriram Vamsi Ilapakurthy
Track: Airflow intro talks
This talk delves into advanced Directed Acyclic Graph (DAG) design patterns that are pivotal for optimizing data pipeline management and boosting efficiency. We’ll cover dynamic DAG generation, which allows for flexible, scalable workflow creation based on real-time data and configurations. Learn about task grouping and SubDAGs to enhance readability and maintainability of complex workflows. We’ll also explore parameterized DAGs for injecting runtime parameters into tasks, enabling versatile and adaptable pipeline configurations.