These are the sessions planned for Airflow Summit 2024.

Title

10 years of Airflow: history, insights, and looking forward

by Kenten Danas, John Jackson, Marc Lamberti, Rafal Biegacz, Ash Berlin-Taylor & Elad Kalif
10 years after its creation, Airflow is stronger than ever: in last year’s Airflow survey, 81% of users said Airflow is important or very important to their business, 87% said their Airflow usage has grown over time, and 92% said they would recommend Airflow. In this panel discussion, we’ll celebrate a decade of Airflow and delve into how it became the highly recommended industry standard it is today, including history, pivotal moments, and the role of the community.

A deep dive into Airflow configuration options for scalability

by Ephraim Anierobi
Apache Airflow has a lot of configuration options. A change in some of these options can affect the performance of Airflow. If you are wondering why your Airflow instance is not running the number of tasks you expected it to run, after this talk, you will have a better understanding of the configuration options available for improving the number of tasks your Airflow instance can run. We will talk about the DAG parsing configuration options, options for scheduler scalability, etc.

A Game of Constant Learning & Adjustment: Orchestrating ML Pipelines at the Philadelphia Phillies

by Mike Hirsch & Sophie Keith
When developing Machine Learning (ML) models, the biggest challenges are often infrastructural. How do we deploy our model and expose an inference API? How can we retrain? Can we continuously evaluate performance and monitor model drift? In this talk, we will present how we are tackling these problems at the Philadelphia Phillies by developing a suite of tools that enable our software engineering and analytics teams to train, test, evaluate, and deploy ML models - that can be entirely orchestrated in Airflow.

A New DAG Paradigm: Less Airflow more DAGs

by Marion Azoulai & Maggie Stark
Astronomer’s data team recently underwent a major shift in how we work with Airflow. We’ll deep dive into the challenges which prompted that change, how we addressed them and where we are now. This re-architecture included: Switching to dataset scheduling and micro-pipelines to minimize failures and increase reliability. Implementing a Control DAG for complex dependency management and full end-to-end pipeline visibility. Standardized Task Groups for quick onboarding and scalability. With Airflow managing itself, we can once again focus on the data rather than the operational overhead.

Activating operational metadata with Airflow, Atlan and OpenLineage

by Kacper Muda & Eric Veleker
OpenLineage is an open standard for lineage data collection, integrated into the Airflow codebase, facilitating lineage collection across providers like Google, Amazon, and more. Atlan Data Catalog is a 3rd generation active metadata platform that is a single source of trust unifying cataloging, data discovery, lineage, and governance experience. We will demonstrate what OpenLineage is and how, with minimal and intuitive setup across Airlfow and Atlan, it presents unified workflows view, efficient cross-platform lineage collection, including column level, in various technologies (Python, Spark, dbt, SQL etc.

Adaptive Memory Scaling for Robust Airflow Pipelines

by Cyrus Dukart, David Sacerdote & Jason Bridgemohansingh
At Vibrant Planet, we’re on a mission to make the world’s communities and ecosystems more resilient in the face of climate change. Our cloud-based platform is designed for collaborative scenario planning to tackle wildfires, climate threats, and ecosystem restoration on a massive scale. In this talk we will dive into how we are using Airflow. Particularly we will focus on how we’re making Airflow pipelines smarter and more resilient, especially when dealing with the huge task of processing satellite imagery and other geospatial data.

AIP-63: DAG Versioning - Where are we?

by Jed Cunningham
Join us as we check in on the current status of AIP-63: DAG Versioning. This session will explore the motivations behind AIP-63, the challenges faced by Airflow users in understanding and managing DAG history, and how it aims to address them. From tracking TaskInstance history to improving DAG representation in the UI, we’ll examine what we’ve already done and what’s next. We’ll also touch upon the potential future steps outlined in AIP-66 regarding the execution of specific DAG versions.

Airflow 3 - Roadmap Discussion

by Constance Martineau, Kaxil Naik, Michał Modras & Shubham Mehta
This session would be about presenting the tentative scope for the next generation of Airflow, i.e. Airflow 3.

Airflow and Control-M: Where Data Pipelines Meet Business Applications in Production

by Joe Goldberg
With Airflow’s mainstream acceptance in the enterprise, the operational challenges of running with applications in production have emerged. At last year’s Airflow Summit in Toronto, three providers of Apache Airflow met to discuss “The Future of Airflow: What Users Want”. Among the user requirements in the session were: An improved security model allowing “Alice” and “Bob” to run their single DAGs without each requiring a separate Airflow cluster, while still adhering to their organization’s compliance requirements.

Airflow and multi-cluster Slurm working together

by Eloi Codina Torras
Meteosim provides environmental services, mainly based on weather and air quality intelligence, and helps customers make operational and tactical decisions and understand their companies’ environmental impact. We introduced Airflow a couple of years ago to replace a huge Crontab file and we currently have around 7000 DAG Runs per day. In this presentation we will introduce the hardest challenge we had to overcome: adapting Airflow to run on multiple Slurm-managed HPC clusters by using deferrable operators.

Airflow as a workflow for Self Service Based Ingestion

by Ramesh Babu K M
Our Idea to platformize Ingestion pipelines is driven via Airflow in the background and streamline the entire ingestion process for Self Service. With customer experience on top of it and making data ingestion fool proof as part of Analytics data team, Airflow is just complementing for our vision.

Airflow at Burns & McDonnell | Orchestration from zero to 100

by Bonnie Why
As the largest employee-owned engineering and construction firm in the United States, Burns & McDonnell has a massive amount of data. Not only that, it’s hard to pinpoint which source system has the data we need. Our solution to this challenge is to build a unified information platform — a single source of truth where all of our data is searchable, trustworthy, and accessible to our employee-owners and the projects that need it.

Airflow at Ford: A Job Router Training Advance Driver Assistance Systems

by Serjesh Sharma & Vasantha Kosuri-Marshall
Ford Motor Company operates extensively across various nations. The Data Operations (DataOps) team for Advanced Driver Assistance Systems (ADAS) at Ford is tasked with the processing of terabyte-scale daily data from lidar, radar, and video. To manage this, the DataOps team is challenged with orchestrating diverse, compute-intensive pipelines across both on-premises infrastructure and the GCP and deal with sensitive of customer data across both environments The team is also responsible for facilitating the execution of on-demand, compute-intensive algorithms at scale through.

Airflow at NCR Voyix: Streamlining ML workflows development with Airflow

by Shahar Epstein
NCR Voyix Retail Analytics AI team offers ML products for retailers while embracing Airflow as its MLOps Platform. As the team is small and there have been twice as many data scientists as engineers, we encountered challenges in making Airflow accessible to the scientists: As they come from diverse programming backgrounds, we needed an architecture enabling them to develop production-ready ML workflows without prior knowledge of Airflow. Due to dynamic product demands, we had to implement a mechanism to interchange Airflow operators effortlessly.

Airflow Datasets and Pub/Sub for Dynamic DAG Triggering

by Nawfel Bacha & Andrea Bombino
Looking for a way to streamline your data workflows and master the art of orchestration? As we navigate the complexities of modern data engineering, Airflow’s dynamic workflow and complex data pipeline dependencies are starting to become more and more common nowadays. In order to empower data engineers to exploit Airflow as the main orchestrator, Airflow Datasets can be easily integrated in your data journey. This session will showcase the Dynamic Workflow orchestration in Airflow and how to manage multi-DAGs dependencies with Multi-Dataset listening.

Airflow UI Roadmap

by Brent Bovenzi
Soon we will finally switch to a 100% React UI with a full separation between the API and UI as well. While we are doing such a big change, let’s also take the opportunity to imagine whole new interfaces vs just simply modernizing the existing views. How can we use design to help you better understand what is going on with your DAG? Come listen to some of our proposed ideas and bring your own big ideas as the second half will be an open discussion.

Airflow Unleashed: Making Hundreds of Deployments A Day at Coinbase

by Jianlong Zhong
At Coinbase, Airflow is the backbone of ELT, supported by a vibrant community of over 500 developers. This vast engagement results in a continuous stream of enhancements, with hundreds of commits tested and released daily. However, this scale of development presents its own set of challenges, especially in deployment velocity. Traditional deployment methodologies proved inadequate, significantly impeding the productivity of our developers. Recognizing the critical need for a solution that matches our pace of innovation, we developed AirAgent: a bespoke, fully autonomous deployer designed specifically for Airflow.

Airflow-as-an-Engine: Lessons from Open-Source Applications Built On Top of Airflow

by Ian Moritz
Airflow is often used for running data pipelines, which themselves connect with other services through the provider system. However, it is also increasingly used as an engine under-the-hood for other projects building on top of the DAG primitive. For example, Cosmos is a framework for automatically transforming dbt DAGs into Airflow DAGs, so that users can supplement the developer experience of dbt with the power of Airflow. This session dives into how a select group of these frameworks (Cosmos, Meltano, Chronon) use Airflow as an engine for orchestrating complex workflows their systems depend on.

Apache Airflow Bad vs. Best Practices In Production

by Bhavani Ravi
Apache Airflow - The open-ended nature of this orchestration tool gives room for a variety of customization. While this is a good thing, there are no bounds in which the system can or cannot be used, resulting in wasting a lot of time in scaling, testing, and debugging when things aren’t set properly. In this talk, we will go through a series of factors that data teams need to keep a watch for while setting up an Airflow system.

Architecting Blockchain ETL Orchestration: Circle's Airflow Usecase

by Nathaniel Rose
This talk focuses on exploring the implementation of Apache Airflow for Blockchain ETL orchestration, indexing, and the adoption of GitOps at Circle. IT will cover CICD tips, architectural choices for managing Blockchain data at scale, engineering practices to enable data scientists and some learnings from production.

Automated Testing and Deployment of DAGs

by Austin Bennett
DAG integrity is critical. So are coding conventions, consistency in standards for the group. In this talk, we will share the various lessons learned in ChartBoost for testing/verifying our DAGs as part of our GitHub workflows [ for testing as part of the pull request process, and for automated deployment - eventually to production - once merged ]. We will dig into how we have unlocked additional efficiencies, catch errors before they get deployed, and generally how we are better off for having both Airflow & plenty of checks in our CI, before we merge/deploy.

Behaviour Driven Development in Airflow

by Ole Christian Langfjæran
Behaviour Driven Development can, in the simplest of terms, be described as Test Driven Development, only readable. It is of course more than that, but that is not the aim of this talk. This talk aims to show: How to write tests before you write a single line of Airflow code Create reusable and readable steps for setting up tests, in a given-when-then manner. Test rendering and execution of your DAG’s tasks Real world examples from a monorepo containing multiple Airflow projects Written only with pytest, and some code I stole from smart people in github.

Boost Airflow Monitoring and Alerting with Automation Analytics & Intelligence by Broadcom

by Jennifer Chisik
Airflow’s “workflow as code” approach has many benefits, including enabling dynamic pipeline generation and flexibility and extensibility in a seamless development environment. However, what challenges do you face as you expand your Airflow footprint in your organization? What if you could enhance Airflow’s monitoring capabilities, forecast DAG and task executions, obtain predictive alerting, visualize trends, and get more robust logging? Broadcom’s Automation Analytics & Intelligence (AAI) offers advanced analytics for workload automation for cloud and on-premises automation.

Boosting Airflow Efficiency Through Airflow Configuration Tuning & Optimisation

by Nishchay Agrawal
In the world of managing data & ETL workflows, Apache Airflow is a crucial tool for automating tasks. However, getting the most out of it requires more than just a basic setup. In this article, we take a deep dive into the details of Airflow performance improvement, focusing on how it’s set up and configured. Through deep dive analysis, we uncover the details that can greatly improve how DAG tasks are carried out and how the Airflow scheduler performs.

Bronco: Managing Terraform at Scale with Airflow

by Jack Cusick
Airflow is not just purpose-built for data applications. It is a job scheduler on steroids. This is exactly what a cloud platform team needs: a configurable and scalable automation tool that can handle thousands of administrative tasks. Come learn how one enterprise platform team used Airflow to support cloud infrastructure at unprecedented scale.

Building in resource awareness and event dependency into Airflow

by Roberto Santamaria & Xiaodong Deng
In this talk, we will explore how adding custom dependency checks into Airflow’s scheduling system can elevate Airflow’s performance. We will specifically discuss how we added general upstream events dependency checking as well as how to make Airflow aware of used/available compute resources so that the system can better decide when and where to run a given task on Kubernetes infrastructure. We’ll cover why the existing dependency checking in Airflow is not sufficient in our use case, and why adding custom code to Airflow is needed.

Building on Cosmos: Making dbt on Airflow Easy

by Hugo Hobson
Balyasny Asset Management (BAM) is a diversified global investment firm founded in 2001 with over $20 billion in assets under management. As dbt took hold at BAM, we had multiple teams building dbt projects against Snowflake, Redshift, and SQL Server. The common question was: How can we quickly and easily productionise our projects? Airflow is the orchestrator of choice at BAM, but our dbt users ranged from Airflow power users to people who’d never heard of Airflow before.

Connecting the Dots in Airflow: From User to Contributor

by Briana Okyere, Amogh Desai, Ryan Hatter & Srabasti Banerjee
“Connecting the Dots in Airflow: From User to Contributor” explores the journey of transitioning from an Airflow user to an active project contributor. This talk will cover essential steps, resources, and best practices to effectively engage with the Airflow community and make meaningful contributions. Attendees will gain insights into the collaborative nature of open-source projects and how their involvement can drive both personal growth and project innovation.

Converting Legacy Schedulers to Airflow

by Fritz Davenport
Introducing a process and framework to convert legacy scheduler workloads such as Control-M to Airflow using automated transpilation techniques. We will discuss the process and demonstrate a python-based transpiler to automatically migrate legacy scheduler workflows with a standard set of patterns to Airflow DAGs. This framework is easily extended via configurable rulesets to encompass other schedulers such as Automic, Autosys, Oozie, and others.

Customizing LLMs: Leveraging Technology to tailor GenAI using Airflow

by Vincent La
Laurel provides an AI-driven timekeeping solution tailored for accounting and legal firms, automating timesheet creation by capturing digital work activities. This session highlights two notable AI projects: UTBMS Code Prediction: Leveraging small language models, this system builds new embeddings to predict work codes for legal bills with high accuracy. More details are available in our case study: https://www.laurel.ai/resources-post/enhancing-legal-and-accounting-workflows-with-ai-insights-into-work-code-prediction. Bill Creation and Narrative Generation: Utilizing Retrieval-Augmented Generation (RAG), this approach transforms users’ digital activities into fully billable entries.

DAGify - Enterprise Scheduler Migration Accelerator for Airflow

by Konrad Schieban & Tim Hiatt
DAGify is a highly extensible, template driven, enterprise scheduler migration accelerator that helps organizations speed up their migration to Apache Airflow. While DAGify does not claim to migrate 100% of existing scheduler functionality it aims to heavily reduce the manual effort it takes for developers to convert their enterprise scheduler formats into Python Native Airflow DAGs. DAGify is an open source tool under Apache 2.0 license and available on Github (https://github.

Data Orchestration for Emerging Technology Analysis

by Jennifer Melot
The Center for Security and Emerging Technology is a think tank at Georgetown University that studies security implications of emerging technologies, including data-driven analyses across bibliometric, patenting, and investment datasets. This talk will describe CSET’s data infrastructure which uses Airflow to orchestrate data ingestion, model deployment, webscraping, and manual data curation pipelines. We’ll also discuss how outputs from these pipelines are integrated into public-facing web applications and written reports, and some lessons learned from building and maintaining data pipelines on a data team with a diverse skill set.

Data-Centric Airflow

by Tzu-ping Chung
While Airflow has its roots in ETL workflows, it is now more and more common for people to use it in a variety of ways. A traditional ETL approach thinks mainly in functions, while data being considered a side effect of the functions, and generally do not have a presence at all in user-facing interfaces. In a data-centric mindset, however, users put data front-and-center instead, and think the operations as a means to create new data from upstream data.

dbt-Core & Airflow 101: Building Data Pipelines Demystified

by Luan Moreno Medeiros Maciel
dbt became the de facto for data teams building reliable and trustworthy SQL code leveraging a modern data stack architecture. The dbt logic needs to be orchestrated, and jobs scheduled to meet business expectations. That’s where Airflow comes into play. In this quick introduction session, you’ll gonna learn: How to leverage dbt-Core & Airflow to orchestrate pipelines Write DAGs in a Pythonic way Apply best practices on your jobs

Elevating Machine Learning Deployment: Unleashing the Power of Airflow in Wix's ML Platform

by Elad Yaniv
In his presentation, Elad will provide a novel take on Airflow, highlighting its versatility beyond conventional use for scheduled pipelines. He’ll discuss its application as an on-demand tool for initiating and halting jobs, mainly in the Data Science fields, like dataset enrichment and batch prediction via API calls, complete with real-time status tracking and alerts. The talk aims to encourage a fresh approach to Airflow utilization but will also delve into the technical aspects of implementing DAG triggering and cancellation logic.

Empowering Airflow Users: A Framework for Performance Testing and Transparent Resource Optimization

by Bartosz Jankiewicz
Apache Airflow is the backbone of countless data pipelines, but optimizing performance and resource utilization can be a challenge. This talk introduces a novel performance testing framework designed to measure, monitor, and improve the efficiency of Airflow deployments. I’ll delve into the framework’s modular architecture, showcasing how it can be tailored to various Airflow setups (Docker, Kubernetes, cloud providers). By measuring key metrics across schedulers, workers, triggers, and databases, this framework provides actionable insights to identify bottlenecks and compare performance across different versions or configurations.

Empowering business analysts with DAG authoring IDE running 8000 workflows

by Daniil Dubin
At Wix more often than not business analysts build workflows themselves to avoid data engineers being a bottleneck. But how do you enable them to create SQL ETLs starting when dependencies are ready and sending emails or refreshing Tableau reports when the work is done? One simple answer may be to use Airflow. The problem is every BA cannot be expected to know Python and Git so well that they will create thousands of DAGs easily.

Event-driven Data Pipelines with Apache Airflow

by John Jackson
Airflow is all about schedules…we use CRON strings and Timetable to define schedules, and there’s an Airflow Scheduler component that manages those timetables, and a lot more, to ensure that DAGs and tasks are addressed based on those schedules. But what do you do if your data isn’t available on a schedule? What if data is coming from many sources, at varying times, and your job is to make sure it’s all as up-to-date as possible?

Evolution of Orchestration at GoDaddy: A Journey from On-prem to Cloud-based Single Pane Model

by Ozcan Ilikhan & Amit Kumar
Explore the evolutionary journey of orchestration within GoDaddy, tracing its transformation from initial on-premise deployment to a robust cloud-based Apache Airflow orchestration model. This session will detail the pivotal shifts in design, organizational decisions, and governance that have streamlined GoDaddy’s Data Platform and enhanced overall governance. Attendees will gain insights valuable for optimizing Airflow deployments and simplifying complex orchestration processes. Recap of the transformation journey and its impact on GoDaddy’s data operations.

From Oops to Ops: Smart Task Failure Diagnosis with OpenAI

by Nathan Hadfield
This session reveals an experimental venture integrating OpenAI’s AI technologies with Airflow, aimed at advancing error diagnosis. Through the application of AI, our objective is to deepen the understanding of issues, provide comprehensive insights into task failures, and suggest actionable solutions, thereby augmenting the resolution process. This method seeks to not only enhance diagnostic efficiency but also to equip data engineers with AI-informed recommendations. Participants will be guided through the integration journey, illustrating how AI can refine error analysis and potentially simplify troubleshooting workflows.

From Tech Specs to Business Impact: How to Design A Truly End-to-End Airflow Project

by Taylor Facen
There are many Airflow tutorials. However, many don’t show the full process of sourcing, transforming, testing, alerting, documenting, and finally supplying data. This talk with go over how to piece together an end-to-end Airflow project that transforms raw data to be consumable by the business. It will include how various technologies can all be orchestrated by Airflow to satisfy the needs of analysts, engineers, and business stakeholders. Each step and technology mentioned will be something that we at AngelList use, and code snippets will be sprinkled throughout so that attendees can implement this project within their organizations.

Full asynchronous programming in Apache Airflow

by Hussein Awala
Airflow uses multithreading in different components to parallelize the processing but relies heavily on synchronous execution, even for I/O blocking statements, which can slow down processing and increase resource usage. In this session, I’ll demonstrate how migrating Airflow’s components—such as the web server, REST API, executors, and scheduler—to asynchronous programming can significantly reduce execution time, making Airflow faster and more efficient than ever.

Gen AI using Airflow 3: A vision for Airflow RAGs

by Vikram Koka, Kaxil Naik & Ash Berlin-Taylor
Gen AI has taken the computing world by storm. As Enterprises and Startups have started to experiment with LLM applications, it has become clear that providing the right context to these LLM applications is critical. This process known as Retrieval augmented generation (RAG) relies on adding custom data to the large language model, so that the efficacy of the response can be improved. Processing custom data and integrating with Enterprise applications is a strength of Apache Airflow.

How NerdWallet Halved Snowflake Costs with Airflow 2 Upgrade

by Adam Bayrami
NerdWallet’s multi-tenant Airflow setup faced challenges such as slow DAG processing, which resulted in underutilized Snowflake warehouses and elevated costs. Although our transition to Airflow 2 may have been delayed, we recognize that many other teams are also working to get buy-in and resources for their migration efforts. We hope that our story on how unlocking the scheduling capabilities of Airflow 2 helped us reduce our Snowflake spend by half can get you the buy in you need.

How Panasonic Leverages Airflow

by Michael Atondo
Using various operators to perform daily routines. Integration with Technologies: Redis: Acts as a caching mechanism to optimize data retrieval and processing speed, enhancing overall pipeline performance. MySQL: Utilized for storing metadata and managing task state information within Airflow’s backend database. Tableau: Integrates with Airflow to generate interactive visualizations and dashboards, providing valuable insights into the processed data. Amazon Redshift: Panasonic leverages Redshift for scalable data warehousing, seamlessly integrating it with Airflow for data loading and analytics.

How the Airflow Community Productionizes Generative AI

by Pete Dejoy
Every data team out there is being asked from their business stakeholders about Generative AI. Taking LLM centric workloads to production is not a trivial task. At the foundational level, there are a set of challenges around data delivery, data quality, and data ingestion that mirror traditional data engineering problems. Once you’re past those, there’s a set of challenges related to the underlying use case you’re trying to solve. Thankfully, because of how Airflow was already being used at these companies for data engineering and MLOps use cases, it has become the defacto orchestration layer behind many GenAI use cases for startups and Fortune 500s.

How we run 100 Airflow environments and millions of Tasks as a Part Time job using Kubernetes

by Michael Juster
Balyasny Asset Management (BAM) is a diversified global investment firm founded in 2001 with over $20 billion in assets under management. We have more than 100 teams who run a variety of workloads that benefit from Orchestration and parallelization. Platform Engineers working for companies with K8s ecosystems can use their Kubernetes knowledge and leverage their platform to run Airflow and troubleshoot problems successfully. BAM’s Kubernetes Platform provides production-ready Airflow environments that automatically get Logging, Metrics, Alerting, Scalability, Storage from a range of File Systems, Authentication, Dashboards, Secrets Management, and specialized compute including GPU, CPU Optimized, Memory Optimized and even Windows.

How we tuned our Airflow to make 1.2 million DAG runs - per day!

by Jens Scheffler
As we deployed Airflow in our enterprise connected to various event sources to implement our data-driven pipelines we were faced with event storms a couple of times. As of such event storms happened often unplanned and with increased load waves we iteratively tuned the setup in multiple iterations. We were in panic and also needed to add some quick workarounds sometime. Starting from a peak of 1000 triggers in a hour we were happy that workload just queued.

How we use Airflow at Booking to orchestrate Big Data workflows

by Madhav Khakhar & Alexander Shmidt
The talk will cover how we use Airflow at the heart of our Workflow Management Platform(WFM) at Booking.com, enabling our internal users to orchestrate big data workflows on Booking Data Exchange(BDX). High level overview of the talk: Adapting open source Airflow helm chart to spin up Airflow installation in Booking Kubernetes Service (BKS) Coming up with Workflow definition format (yaml) Conversion of workflow.yaml to workflow.py DAGs Usage of Deferrable operators to provide standard step templates to users Workspaces (collection of workflows), using it to ensure role based access to DAG permissions for users Using okta for authentication Alerting, monitoring, logging Plans to shift to Astronomer

Hybrid Executors: Have Your Cake and Eat it Too

by Niko Oliveira
Executors are a core concept in Apache Airflow and they are an essential piece to the execution of DAGs. They continue to see investment and innovation including a new feature launching this year: Hybrid Execution. This talk will give a brief overview of executors, how they work and what they are responsible for. Followed by a description of Hybrid Executors (AIP-61), a new feature to allow multiple executors to be used natively and seamlessly side by side within a single Airflow environment.

Investigating the Many Loops of the Airflow Scheduler

by Philippe Gagnon
The scheduler is unarguably the most important component of an Airflow cluster. It is also the most complex and misunderstood by practitioners and administrators alike. In this talk, we will follow the path that a task instance takes to progress from creation to execution, and discuss the various configuration settings allowing users to tune the scheduler and executor to suit their workload patterns. Finally, we will dive deep into critical sections of the Airflow codebase and explore opportunities for optimization.

Lessons from the Ecosystem: What can Airflow Learn from Other Open-source Communities?

by Michael Robinson
The Apache Airflow community is so large and active that it’s tempting to take the view that “if it ain’t broke don’t fix it.” In a community as in a codebase, however, improvement and attention are essential to sustaining growth. And bugs are just as inevitable in community management as they are in software development. If only the fixes were, too! Airflow is large and growing because users love Airflow and our community.

Linkedin's Continuous Deployment

by Rahul Gade & Keshav Tyagi
LinkedIn Continuous Deployment (LCD), started with the goal of improving the deployment experience and expanding its outreach to all LinkedIn systems. LCD delivers a modern deployment UX and easy-to-customize pipelines which enables all LinkedIn applications to declare their deployment pipelines. LCD’s vision is to automate cluster provisioning, deployments and enable touchless (continuous) deployments while reducing the manual toil involved in deployments. LCD is powered by Airflow to orchestrate its deployment pipelines and automate the validation steps.

Managing version upgrades without feelings of terror

by Daniel Standish
Airflow version upgrades can be challenging. Maybe you upgrade and your dags fail to parse (that’s an easy fix). Or maybe you upgrade and everything looks fine, but when your dag runs, you can no longer connect to mysql because the TLS version changed. In this talk I will provide concrete strategies that users can put into practice to make version upgrades safer and less painful. Topics may include: What semver means and what it implies for the upgrade process Using integration test dags, unit tests, and a test cluster to smoke out problems Strategies around constraints files / pinning, and managing providers vs core versions Using db clean prior to upgrade to reduce table size Rollback strategies What to do about warnings (e.

Mastering Advanced Dataset Scheduling in Apache Airflow

by Ankit Chaurasia
Are you looking to harness the full potential of data-driven pipelines with Apache Airflow? This session will dive into the newly introduced conditional expressions for advanced dataset scheduling in Airflow - a feature highly requested by the Airflow community. Attendees will learn how to effectively use logical operators to create complex dependencies that trigger DAGs based on the dataset updates in real-world scenarios. We’ll also explore the innovative DatasetOrTimeSchedule, which combines time-based and dataset-triggered scheduling for unparalleled flexibility.

Mastering LLM Batch Pipelines : Handling Rate Limits, Asynchronous APIs, and Cloud Scalability

by Avichay Marciano
As large language models (LLMs) gain traction, companies encounter challenges in deploying them effectively. This session focuses on using Airflow to manage LLM batch pipelines, addressing rate limits and optimizing asynchronous batch APIs. We will discuss strategies for managing cloud provider rate limits efficiently to ensure uninterrupted, cost-effective LLM operations. This includes queuing and job prioritization techniques to optimize throughput. Additionally, we’ll explore asynchronous batch processing for tasks such as Retrieval Augmented Generation (RAG) and vector embedding, which enhance processing efficiency and reduce latency.

OpenLineage: From Operators to Hooks

by Maciej Obuchowski
“More data lineage” has been second most popular feature request in Airflow Survey 2023. However, despite the integration of OpenLineage in Airflow 2.7 through AIP-53, the most popular Operator in Airflow - PythonOperator - isn’t covered by lineage support. With addition of TaskFlow API, Airflow Datasets, Airflow ObjectStore, and many other small changes, writing DAGs without using other operators is easier than ever. And that’s why lineage collection in Airflow moves beyond covering specific Operators, to covering Hooks and Object Storage.

Optimize Your DAGs: Embrace Dag Params for Efficiency and Simplicity

by Sumit Maheshwari
In the realm of data engineering, there is a prevalent tendency for professionals to develop similar Directed Acyclic Graphs (DAGs) to manage analogous tasks. Leveraging Dag Params presents an effective strategy for mitigating redundancy within these DAGs. Moreover, the utilization of Dag Params facilitates seamless enforcement of user inputs, thereby streamlining the process of incorporating validations into the DAG codebase.

Optimizing Airflow Performance: Strategies, Techniques, and Best Practices

by Pankaj Singh & Pankaj Koti
Airflow, an open-source platform for orchestrating complex data workflows, is widely adopted for its flexibility and scalability. However, as workflows grow in complexity and scale, optimizing Airflow performance becomes crucial for efficient execution and resource utilization. This session delves into the importance of optimizing Airflow performance and provides strategies, techniques, and best practices to enhance workflow execution speed, reduce resource consumption, and improve system efficiency. Attendees will gain insights into identifying performance bottlenecks, fine-tuning workflow configurations, leveraging advanced features, and implementing optimization strategies to maximize pipeline throughput.

Orchestrating & Optimizing a Batch Ingestion Data Platform for Americas #1 Sportsbook

by Gunnar Lykins
FanDuel Group, an industry leader in sports-tech entertainment, is proud to be recognized as the #1 sports betting company in the US as of 2023 with 53.4% market share. With a workforce exceeding 4,000 employees, including over 100 data engineers, FanDuel Group is at the forefront of innovation in batch processing orchestration platforms. Currently, our platform handles over 250,000 DAG runs & executes ~3 million tasks monthly across 17 deployments. It provides a standardized framework for pipeline development, structured observability, monitoring, & alerting.

Orchestration of ML workloads via Airflow & GKE Batch

by Rafal Biegacz
During this talk we are going to given an overview of different orchestration approaches (Kubeflow, Ray, Airflow, etc.) when running ML workloads on Kubernetes and specifically we will focus on how to use Kubernetes Batch API and Kubernetes Operators to run complex ML workloads.

Overcoming Custom Python Package Hurdles in Airflow

by Amogh Desai & Shubham Raj
DAG Authors, while constructing DAGs, generally use native libraries provided by Airflow in conjunction with python libraries available over public PyPI repositories. But sometimes, DAG authors need to construct DAG using libraries that are either in-house or not available over public PyPI repositories. This poses a serious challenge for users who want to run their custom code with Airflow DAGs, particularly when Airflow is deployed in a cloud-native fashion. Traditionally, these packages are baked in Airflow Docker images.

Overcoming performance hurdles in Integrating dbt with Airflow

by Tatiana Al-Chueyr Martins & Pankaj Koti
The integration between dbt and Airflow is a popular topic in the community, both in previous editions of Airflow Summit, in Coalesce and the #airflow-dbt Slack channel. Astronomer Cosmos (https://github.com/astronomer/astronomer-cosmos/) stands out as one of the libraries that strives to enhance this integration, having over 300k downloads per month. During its development, we’ve encountered various performance challenges in terms of scheduling and task execution. While we’ve managed to address some, others remain to be resolved.

Profiling Airflow tasks with Memray

by Cedrik Neumann
Profiling Airflow tasks can be difficult, specially in remote environments. In this talk I will demonstrate how we can leverage the capabilities of Airflow’s plugin mechanism to selectively run Airflow tasks within the context of a profiler and with the help of operator links and custom views make the results available to the user. The content of this talk can provide inspiration on how Airflow may in the future allow the gathering of custom task metrics and make those metrics easily accessible.

Running Airflow tasks anywhere, in any language

by Ash Berlin-Taylor & Vikram Koka
Imagine a world where writing Airflow tasks in languages like Go, R, Julia, or maybe even Rust is not just a dream but a native capability. Say goodbye to BashOperators; welcome to the future of Airflow task execution. Here’s what you can expect to learn from this session: Multilingual Tasks: Explore how we empower DAG authors to write tasks in any language while retaining seamless access to Airflow Variables and Connections.

Scalable Development of Event Driven Airflow DAGs

by Subramanian Vellaiyan & Ipsa Trivedi
This usecase shows how we deal with data of different varieties from different sources. Each source sends data in different layout, timings, structures, location patterns sizes. The goal is to process the files within SLA and send them out. This a complex multi step processing pipeline that involves multiple spark jobs, api based integrations with microservices, resolving unique ids, deduplication and filtering. Note that this is an event driven system, but not a streaming data system.

Scaling AI Workloads with Apache Airflow

by Shubham Mehta & Rajesh Bishundeo
AI workloads are becoming increasingly complex, with unique requirements around data management, compute scalability, and model lifecycle management. In this session, we will explore the real-world challenges users face when operating AI at scale. Through real-world examples, we will uncover common pitfalls in areas like data versioning, reproducibility, model deployment, and monitoring. Our practical guide will highlight strategies for building robust and scalable AI platforms leveraging Airflow as the orchestration layer and AWS for its extensive AI/ML capabilities.

Scaling Airflow for Data Productivity at Instacart

by Anant Agarwal
In this talk, we’ll discuss how Instacart leverages Apache Airflow to orchestrate a vast network of data pipelines, powering both our core infrastructure and dbt deployments. As a data-driven company, Airflow plays a critical role in enabling us to execute large and intricate pipelines securely, compliantly, and at scale. We’ll delve into the following key areas: a. High-Throughput Cluster Management: We’ll explore how we manage and maintain our Airflow cluster, ensuring the efficient execution of over 2,000 DAGs across diverse use cases.

Security United: collaborative effort on securing Airflow ecosystem with Alpha-Omega, PSF & ASF

by Jarek Potiuk, Michael Winser & Seth Michael Larson
Airflow’s power comes from its vast ecosystem, but securing this intricate web requires a united front. This talk unveils a groundbreaking collaborative effort between the Python Software Foundation (PSF), the Apache Software Foundation (ASF), the Airflow Project Management Committee (PMC), and Alpha-Omega Fund - aimed at securing not only Airflow, but the whole ecosystem. We’ll explore this new project dedicated to improving security across the Airflow landscape.

Seeing Clearly with Airflow: The Shift to Data-Aware Orchestration

by Constance Martineau
Join me at this year’s Airflow Summit as we delve into a pivotal evolution for Apache Airflow: The integration of data awareness. Airflow has long excelled as a workflow orchestration tool, managing complex workflows with ease and efficiency. However, it has operated with limited insight into the data it manipulates or the assets it produces. This talk will explore the implications and benefits of embedding deeper insights about these outputs directly into Airflow.

Simplified user management in Airflow

by Vincent Beck
Before Airflow 2.9, user management was part of core Airflow, therefore modifying it or customizing it to fit user needs was not an easy process. Authentication and authorization managers (auth managers), is a new concept introduced in Airflow 2.9. It was introduced as extensible user management (AIP-56), allowing Airflow users to have a flexible way to integrate with organization’s identity services. Organizations want a single place to manage permissions and FAB (Flask App Builder) made it difficult to achieve.

Streamline data science workflow development using Jupyter notebooks and Airflow

by Neha Singla & Sathish kumar Thangaraj
Jupyter Notebooks are widely used by data scientists and engineers to prototype and experiment with data. However these engineers are often required to work with other data or platform engineers to productionize these experiments due to the complexity in navigating infrastructure and systems. In this talk, we will deep dive into this PR https://github.com/apache/airflow/pull/34840 and share how airflow can be leveraged as a platform to execute notebook pipelines (python, scala or spark) in dynamic environments like Kubernetes for various heterogeneous use cases.

Streamlining a Mortgage ETL Pipeline with Apache Airflow

by Zhang Zhang & Jenny Gao
At Bloomberg, it is our team’s responsibility to ensure the timely delivery to our clients worldwide of a vast dataset comprising approximately 5 billion data points on roughly 50 million loans and over 1.4 million securities, disclosed twice a month by three major government-sponsored mortgage entities. Ingesting this data so we can create and derive complex data structures to be consumed by our applications for our clients has been our biggest challenge.

The Essentials of Custom Executor Development

by Dennis Ferruzzi & Syed Hussain
Since version 2.7 and the advent of AIP-51, Airflow has started to fully support the creation of custom executors. Before we dive into the components of an executor and how they work, we will briefly discuss the Executor Decoupling initiative which allowed this new feature. Once we understand the parts required, we will explore the process of crafting our own executors, using real-world examples, and demonstrations of executors developed within the Amazon Provider Package as a guide.

The road ahead: What’s coming in Airflow 3 and beyond?

by Vikram Koka
Apache Airflow has emerged as the defacto standard for data orchestration. Over the last couple of years, Airflow has also seen increasing adoption for ML and AI use cases. It has been almost four years since the release of Airflow 2 and as a community we have agreed that it’s time for a major foundational release in the form of Airflow 3. This talk will introduce the vision behind Airflow 3, including the emerging technology trends in the industry and how Airflow will evolve in response.

The Silent Symphony: Keeping Airflow's CI/CD and Dev Tools in Tune

by Jarek Potiuk
Apache Airflow relies on a silent symphony behind the scenes: its CI/CD (Continuous Integration/Continuous Delivery) and development tooling. This presentation explores the critical role these tools play in keeping Airflow efficient and innovative. We’ll delve into how robust CI/CD ensures bug fixes and improvements are seamlessly integrated, while well-maintained development tools empower developers to contribute effectively. Airflow’s power comes from a well-oiled machine – its CI/CD and development tools. This presentation dives into the world of these often-overlooked heroes.

Unleash the Power of AI: Streamlining Airflow DAG Development with AI-Driven Automation

by Sriharsh Adari & Jeetendra Vaidya
Nowadays, conversational AI is no longer exclusive to large enterprises. It has become more accessible and affordable, opening up new possibilities and business opportunities. In this session, discover how you can leverage Amazon Bedrock as your AI pair programmer to suggest DAG code and recommend entire functions in real-time, directly from your editor. Visualize how to harness the power of ML, trained on billions of lines of code, to transform natural language prompts into coding suggestions.

Unlocking FMOps/LLMOps using Apache Airflow: A guide to operationalizing and managing Large Language

by Parnab Basak
In the last few years Large Language Models (LLMs) have risen to prominence as outstanding tools capable of transforming businesses. However, bringing such solutions and models to the business-as-usual operations is not an easy task. In this session, we delve into the operationalization of generative AI applications using MLOps principles, leading to the introduction of foundation model operations (FMOps) or LLM operations using Apache Airflow. We further zoom into aspects of expected people and process mindsets, new techniques for model selection and evaluation, data privacy, and model deployment.

Unlocking the Power of AI at Ford: A Behind-the-Scenes Look at Mach1ML and Airflow

by Elona Zharri, Nikhil Nandoskar & Prince Bose
Ford Motor Company is undergoing a significant transformation, embracing AI and Machine Learning to power its smart mobility strategy, enhance customer experiences, and drive innovation in the automotive industry. Mach1ML, Ford’s multi-million dollar ML platform, plays a crucial role in this journey by empowering data scientists and engineers to efficiently build, deploy, and manage ML models at scale. This presentation will delve into how Mach1ML leverages Apache Airflow as its orchestration layer to tackle the challenges of complex ML workflows that include disparate systems, manual processes, security concerns, and deployment complexities.

Unlocking the Power of Airflow Beyond Data Engineering at Cloudflare

by Jet Mariscal
While Airflow is widely known for orchestrating and managing workflows, particularly in the context of data engineering, data science, ML (Machine Learning), and ETL (Extract, Transform, Load) processes, its flexibility and extensibility make it a highly versatile tool suitable for a variety of use cases beyond these domains. In fact, Cloudflare has publicly shared in the past an example on how Airflow was leveraged to build a system that automates datacenter expansions.

Using Airflow for Social Impact by Provisioning Datasets According to Demand

by Albert Okiri
We use Airflow to provide datasets for analytics according to user demands and initiatives being undertaken at the time. The ease of ingesting data by dynamically generating and triggering workflows that correspond to configuration files enables efficient workflows and social impact. We use Airflow to empower decision makers by enabling timely provision of datasets that are tested on quality and other metrics using inbuilt Airflow features. These datasets are accessible at a Superset instance and creates an ‘on-demand for data’ approach to data analysis that is optimized and leads to positive and effective outcomes.

Using Operational Data

by Olivier Daneau
Cost management is a continuous challenge for our data teams at Astronomer. Understanding the expenses associated with running our workflows is not always straightforward, and identifying which process ran a query causing unexpected usage on a given day can be time-consuming. In this talk, we will showcase an Airflow Plugin and specific DAGs developed and used internally at Astronomer to track and optimize the costs of running DAGs. Our internal tool monitors Snowflake query costs, provides insights, and sends alerts for abnormal usage.

Using the power of Apache Airflow and Ray for Scalable AI deployments

by Venkata Jagannath & Marwan Sarieddine
Many organizations struggle to create a well-orchestrated AI infrastructure, using separate and disconnected platforms for data processing, model training, and inference, which slows down development and increases costs. There’s a clear need for a unified system that can handle all aspects of AI development and deployment, regardless of the size of data or models. Join our breakout session to see how our comprehensive solution simplifies the development and deployment of large language models in production.

Weathering the Cloud Storms With Multi-Region Airflow Workflows

by Amit Chauhan
Cloud availability zones and regions are not immune to outages. These zones regularly go down, and regions become unavailable due to natural disasters or human-caused incidents. Thus, if an availability zone or region goes down, so do your Airflow workflows and applications… unless your Airflow workflows function across multiple geographic locations. This hands-on session introduces you to the design patterns of multi-region Airflow workflows in the cloud, which can tolerate zone and region-level incidents.

What If...? Running Airflow Tasks without the workers

by Wei Lee
Airflow executes all tasks on the workers, including deferrable operators that must run on the workers before deferring to the triggerer. However, running some tasks directly from the triggerer can be beneficial in certain situations. This presentation will explain how deferrable operators function and examine ways to modify the Airflow implementation to enable tasks to run directly from the triggerer.

Why Do Airflow Tasks Fail? A Principal Component Analysis

by Julian LaNeve & David Xue
There are 3 certainties in life: death, taxes, and data pipelines failing. Pipelines may fail for a number of reasons: you may run out of memory, your credentials may expire, an upstream data source may not be reliable, etc. But there are patterns we can learn from! Join us as we walk through an analysis we’ve done on a massive dataset of Airflow failure logs. We’ll show how we used principal component analysis and other dimensionality reduction methods to explore the latent space of Airflow task failures in order to cluster, visualize, and understand failures.

Winning Strategies: Powering a World Series Victory with Airflow Orchestration

by Alexander Booth & Oliver Dykstra
Dive into the winning playbook of the 2023 World Series Champions Texas Rangers, and discover how they leverage Apache Airflow to streamline their data pipelines. In this session, we’ll explore how real-world data pipelines enable agile decision-making and drive competitive advantage in the high-stakes world of professional baseball, all by using Airflow as an orchestration platform. Whether you’re a seasoned data engineer or just starting out, this session promises actionable strategies to elevate your data orchestration game to championship levels.

Workshop: Leveraging Automation Analytics & Intelligence with Airflow for Enhanced Workflow Observability

by Chetan Kapoor
In this workshop we will discuss how easy it is to integrate Automation Analytics & Intelligence (AAI) and Airflow, including details on the architecture and implementation strategies. We will then analyze the benefits of this combined approach, showcasing how it can streamline workflow management, enhance operational efficiency, and bolster the overall resilience of automated processes. The integration of these tools can empower organizations with: Proactive management with AAI: AAI’s predictive analytics anticipate potential issues within Airflow workflows, enabling proactive intervention and preventing disruptions.