Welcome to the session program for Airflow Summit 2024.

If you prefer, you can also see this as sessionize layout or list of sessions.

Tuesday, September 10, 2024

09:00
Welcome
10:05
10:30
Morning break
11:00
12:00
13:00
Lunch
14:00
14:35
15:10
15:35
Afternoon break
16:00
16:35
17:10
17:35
Event reception
09:10 - 10:05.
By Kenten Danas, John Jackson, Marc Lamberti, Rafal Biegacz, Ash Berlin-Taylor & Elad Kalif
Track: Keynote
09/10/2024 9:10 AM 09/10/2024 10:05 AM America/Los_Angeles AS24: 10 years of Airflow: history, insights, and looking forward

10 years after its creation, Airflow is stronger than ever: in last year’s Airflow survey, 81% of users said Airflow is important or very important to their business, 87% said their Airflow usage has grown over time, and 92% said they would recommend Airflow. In this panel discussion, we’ll celebrate a decade of Airflow and delve into how it became the highly recommended industry standard it is today, including history, pivotal moments, and the role of the community. Our panel of seasoned experts will also talk about where Airflow is going next, including future use cases like generative AI and the highly anticipated Airflow 3.0. Don’t miss this insightful exploration into one of the most influential tools in the data landscape.

Grand Ballroom
10 years after its creation, Airflow is stronger than ever: in last year’s Airflow survey, 81% of users said Airflow is important or very important to their business, 87% said their Airflow usage has grown over time, and 92% said they would recommend Airflow. In this panel discussion, we’ll celebrate a decade of Airflow and delve into how it became the highly recommended industry standard it is today, including history, pivotal moments, and the role of the community.
10:05 - 10:30.
By Michael Winser & Jarek Potiuk
Track: Keynote
09/10/2024 10:05 AM 09/10/2024 10:30 AM America/Los_Angeles AS24: Security United: collaborative effort on securing Airflow ecosystem with Alpha-Omega, PSF & ASF

Airflow’s power comes from its vast ecosystem, but securing this intricate web requires a united front. This talk unveils a groundbreaking collaborative effort between the Python Software Foundation (PSF), the Apache Software Foundation (ASF), the Airflow Project Management Committee (PMC), and Alpha-Omega Fund - aimed at securing not only Airflow, but the whole ecosystem. We’ll explore this new project dedicated to improving security across the Airflow landscape.

Grand Ballroom
Airflow’s power comes from its vast ecosystem, but securing this intricate web requires a united front. This talk unveils a groundbreaking collaborative effort between the Python Software Foundation (PSF), the Apache Software Foundation (ASF), the Airflow Project Management Committee (PMC), and Alpha-Omega Fund - aimed at securing not only Airflow, but the whole ecosystem. We’ll explore this new project dedicated to improving security across the Airflow landscape.
11:00 - 11:45.
By Kacper Muda & Eric Veleker
Track: Best practices
09/10/2024 11:00 AM 09/10/2024 11:45 AM America/Los_Angeles AS24: Activating operational metadata with Airflow, Atlan and OpenLineage

OpenLineage is an open standard for lineage data collection, integrated into the Airflow codebase, facilitating lineage collection across providers like Google, Amazon, and more.

Atlan Data Catalog is a 3rd generation active metadata platform that is a single source of trust unifying cataloging, data discovery, lineage, and governance experience.

We will demonstrate what OpenLineage is and how, with minimal and intuitive setup across Airlfow and Atlan, it presents unified workflows view, efficient cross-platform lineage collection, including column level, in various technologies (Python, Spark, dbt, SQL etc.) and clouds (AWS, Azure, GCP, etc.) - all orchestrated by Airflow.

This integration enables further use case unlocks on automated metadata management by making the operational pipelines dataset-aware for self-service exploration. It also will demonstrate real world challenges and resolutions for lineage consumers in improving audit and compliance accuracy through column-level lineage traceability across the data estate.

The talk will also briefly overview the most recent OpenLineage developments and planned future enhancements.

Georgian
OpenLineage is an open standard for lineage data collection, integrated into the Airflow codebase, facilitating lineage collection across providers like Google, Amazon, and more. Atlan Data Catalog is a 3rd generation active metadata platform that is a single source of trust unifying cataloging, data discovery, lineage, and governance experience. We will demonstrate what OpenLineage is and how, with minimal and intuitive setup across Airlfow and Atlan, it presents unified workflows view, efficient cross-platform lineage collection, including column level, in various technologies (Python, Spark, dbt, SQL etc.
11:00 - 11:45.
By Bonnie Why
Track: Use cases
09/10/2024 11:00 AM 09/10/2024 11:45 AM America/Los_Angeles AS24: Airflow at Burns & McDonnell | Orchestration from zero to 100

As the largest employee-owned engineering and construction firm in the United States, Burns & McDonnell has a massive amount of data. Not only that, it’s hard to pinpoint which source system has the data we need. Our solution to this challenge is to build a unified information platform — a single source of truth where all of our data is searchable, trustworthy, and accessible to our employee-owners and the projects that need it.

Everyone’s data is important and everyone’s use case is a priority, so how can we get this done quickly? In this session, I will tell you all about how we went from having zero knowledge in Airflow to ingesting many unique and disconnected data sources into our data lakehouse in less than a day.

Come hear the story about how our data team at Burns & McDonnell is using Airflow as an orchestrator to create a scalable, trustworthy data platform that will empower our system to evolve with the ever-changing technology landscape.

California East
As the largest employee-owned engineering and construction firm in the United States, Burns & McDonnell has a massive amount of data. Not only that, it’s hard to pinpoint which source system has the data we need. Our solution to this challenge is to build a unified information platform — a single source of truth where all of our data is searchable, trustworthy, and accessible to our employee-owners and the projects that need it.
11:00 - 11:45.
By Brent Bovenzi
Track: New features
09/10/2024 11:00 AM 09/10/2024 11:45 AM America/Los_Angeles AS24: Airflow UI Roadmap

Soon we will finally switch to a 100% React UI with a full separation between the API and UI as well. While we are doing such a big change, let’s also take the opportunity to imagine whole new interfaces vs just simply modernizing the existing views. How can we use design to help you better understand what is going on with your DAG?

Come listen to some of our proposed ideas and bring your own big ideas as the second half will be an open discussion.

Elizabethan A+B
Soon we will finally switch to a 100% React UI with a full separation between the API and UI as well. While we are doing such a big change, let’s also take the opportunity to imagine whole new interfaces vs just simply modernizing the existing views. How can we use design to help you better understand what is going on with your DAG? Come listen to some of our proposed ideas and bring your own big ideas as the second half will be an open discussion.
11:00 - 11:45.
By Avichay Marciano
Track: Airflow & ...
09/10/2024 11:00 AM 09/10/2024 11:45 AM America/Los_Angeles AS24: Mastering LLM Batch Pipelines : Handling Rate Limits, Asynchronous APIs, and Cloud Scalability

As large language models (LLMs) gain traction, companies encounter challenges in deploying them effectively. This session focuses on using Airflow to manage LLM batch pipelines, addressing rate limits and optimizing asynchronous batch APIs. We will discuss strategies for managing cloud provider rate limits efficiently to ensure uninterrupted, cost-effective LLM operations. This includes queuing and job prioritization techniques to optimize throughput. Additionally, we’ll explore asynchronous batch processing for tasks such as Retrieval Augmented Generation (RAG) and vector embedding, which enhance processing efficiency and reduce latency. The session features a hands-on demonstration on AWS’s managed Airflow service, providing practical insights into configuring and scaling LLM workflows in the cloud.

California West
As large language models (LLMs) gain traction, companies encounter challenges in deploying them effectively. This session focuses on using Airflow to manage LLM batch pipelines, addressing rate limits and optimizing asynchronous batch APIs. We will discuss strategies for managing cloud provider rate limits efficiently to ensure uninterrupted, cost-effective LLM operations. This includes queuing and job prioritization techniques to optimize throughput. Additionally, we’ll explore asynchronous batch processing for tasks such as Retrieval Augmented Generation (RAG) and vector embedding, which enhance processing efficiency and reduce latency.
12:00 - 12:45.
By John Jackson
Track: Best practices
09/10/2024 12:00 PM 09/10/2024 12:45 PM America/Los_Angeles AS24: Event-driven Data Pipelines with Apache Airflow

Airflow is all about schedules…we use CRON strings and Timetable to define schedules, and there’s an Airflow Scheduler component that manages those timetables, and a lot more, to ensure that DAGs and tasks are addressed based on those schedules.

But what do you do if your data isn’t available on a schedule? What if data is coming from many sources, at varying times, and your job is to make sure it’s all as up-to-date as possible? An event-driven data pipeline may be the answer.

An event-driven architecture (or EDA) is an architecture pattern that uses events to decouple an application’s components. It relies on external events, not an internal schedule, to create loosely coupled data pipelines that determine when to take action, and what actions to take. In this session, we will discuss the design considerations when using Airflow in an EDA and the tools Airflow has to make this happen, including Datasets, REST API, Dynamic Task Mapping, custom Timetables, Sensors, and queues.

Georgian
Airflow is all about schedules…we use CRON strings and Timetable to define schedules, and there’s an Airflow Scheduler component that manages those timetables, and a lot more, to ensure that DAGs and tasks are addressed based on those schedules. But what do you do if your data isn’t available on a schedule? What if data is coming from many sources, at varying times, and your job is to make sure it’s all as up-to-date as possible?
12:00 - 12:45.
By Gunnar Lykins
Track: Use cases
09/10/2024 12:00 PM 09/10/2024 12:45 PM America/Los_Angeles AS24: Orchestrating & Optimizing a Batch Ingestion Data Platform for Americas #1 Sportsbook

FanDuel Group, an industry leader in sports-tech entertainment, is proud to be recognized as the #1 sports betting company in the US as of 2023 with 53.4% market share. With a workforce exceeding 4,000 employees, including over 100 data engineers, FanDuel Group is at the forefront of innovation in batch processing orchestration platforms. Currently, our platform handles over 250,000 DAG runs & executes ~3 million tasks monthly across 17 deployments. It provides a standardized framework for pipeline development, structured observability, monitoring, & alerting. It also offers automated data processing managed by an in-house team, enabling stakeholders to concentrate on core business objectives. Our batch ingestion platform is the backbone of endless use cases, facilitating the landing of data into storage at scheduled intervals, real-time ingestion of micro batches triggered by events, standardization processes, & ensuring data availability for downstream applications. Our proposed session also delves into our forward-looking tech strategy as well as addressing the expansion of orchestration diversity by integrating scheduled jobs from various domains into our robust data platform.

California East
FanDuel Group, an industry leader in sports-tech entertainment, is proud to be recognized as the #1 sports betting company in the US as of 2023 with 53.4% market share. With a workforce exceeding 4,000 employees, including over 100 data engineers, FanDuel Group is at the forefront of innovation in batch processing orchestration platforms. Currently, our platform handles over 250,000 DAG runs & executes ~3 million tasks monthly across 17 deployments. It provides a standardized framework for pipeline development, structured observability, monitoring, & alerting.
12:00 - 12:45.
By Ash Berlin-Taylor & Vikram Koka
Track: New features
09/10/2024 12:00 PM 09/10/2024 12:45 PM America/Los_Angeles AS24: Running Airflow tasks anywhere, in any language

Imagine a world where writing Airflow tasks in languages like Go, R, Julia, or maybe even Rust is not just a dream but a native capability. Say goodbye to BashOperators; welcome to the future of Airflow task execution.

Here’s what you can expect to learn from this session:

  • Multilingual Tasks: Explore how we empower DAG authors to write tasks in any language while retaining seamless access to Airflow Variables and Connections.

  • Simplified Developmenlt and Testing: Discover how a standardized interface for task execution promises to streamline development efforts and elevate code maintainability.

  • Enhanced Scalability and Remote Workers: Learn how enabling tasks to run on remote workers opens up possibilities for seamless deployment on diverse platforms, including Windows and remote Spark or Ray clusters. Experience the convenience of effortless deployments as we unlock new avenues for Airflow usage.

Join us as we embark on an exploratory journey to shape the future of Airflow task execution. Your insights and contributions are invaluable as we refine this vision together. Let’s chart a course towards a more versatile, efficient, and accessible Airflow ecosystem.

Elizabethan A+B
Imagine a world where writing Airflow tasks in languages like Go, R, Julia, or maybe even Rust is not just a dream but a native capability. Say goodbye to BashOperators; welcome to the future of Airflow task execution. Here’s what you can expect to learn from this session: Multilingual Tasks: Explore how we empower DAG authors to write tasks in any language while retaining seamless access to Airflow Variables and Connections.
12:00 - 12:45.
By Sriharsh Adari, Jeetendra Vaidya & Joseph Morotti
Track: Airflow & ...
09/10/2024 12:00 PM 09/10/2024 12:45 PM America/Los_Angeles AS24: Unleash the Power of AI: Streamlining Airflow DAG Development with AI-Driven Automation

Nowadays, conversational AI is no longer exclusive to large enterprises. It has become more accessible and affordable, opening up new possibilities and business opportunities. In this session, discover how you can leverage Generative AI as your AI pair programmer to suggest DAG code and recommend entire functions in real-time, directly from your editor. Visualize how to harness the power of ML, trained on billions of lines of code, to transform natural language prompts into coding suggestions. Seamlessly cycle through lines of code, complete function suggestions, and choose to accept, reject, or edit them. Witness firsthand how Generative AI provides recommendations based on the project’s context and style conventions. The objective is to equip you with techniques that allow you to spend less time on boilerplate and repetitive code patterns, and more time on what truly matters: building exceptional orchestration software.

California West
Nowadays, conversational AI is no longer exclusive to large enterprises. It has become more accessible and affordable, opening up new possibilities and business opportunities. In this session, discover how you can leverage Generative AI as your AI pair programmer to suggest DAG code and recommend entire functions in real-time, directly from your editor. Visualize how to harness the power of ML, trained on billions of lines of code, to transform natural language prompts into coding suggestions.
14:00 - 14:25.
By Luan Moreno Medeiros Maciel
Track: Airflow Intro talks
09/10/2024 2:00 PM 09/10/2024 2:25 PM America/Los_Angeles AS24: dbt-Core & Airflow 101: Building Data Pipelines Demystified

dbt became the de facto for data teams building reliable and trustworthy SQL code leveraging a modern data stack architecture. The dbt logic needs to be orchestrated, and jobs scheduled to meet business expectations. That’s where Airflow comes into play.

In this quick introduction session, you’ll gonna learn:

  • How to leverage dbt-Core & Airflow to orchestrate pipelines

  • Write DAGs in a Pythonic way

  • Apply best practices on your jobs

Elizabethan A+B
dbt became the de facto for data teams building reliable and trustworthy SQL code leveraging a modern data stack architecture. The dbt logic needs to be orchestrated, and jobs scheduled to meet business expectations. That’s where Airflow comes into play. In this quick introduction session, you’ll gonna learn: How to leverage dbt-Core & Airflow to orchestrate pipelines Write DAGs in a Pythonic way Apply best practices on your jobs
14:00 - 14:25.
By Sumit Maheshwari
Track: Best practices
09/10/2024 2:00 PM 09/10/2024 2:25 PM America/Los_Angeles AS24: Optimize Your DAGs: Embrace Dag Params for Efficiency and Simplicity

In the realm of data engineering, there is a prevalent tendency for professionals to develop similar Directed Acyclic Graphs (DAGs) to manage analogous tasks. Leveraging Dag Params presents an effective strategy for mitigating redundancy within these DAGs. Moreover, the utilization of Dag Params facilitates seamless enforcement of user inputs, thereby streamlining the process of incorporating validations into the DAG codebase.

Georgian
In the realm of data engineering, there is a prevalent tendency for professionals to develop similar Directed Acyclic Graphs (DAGs) to manage analogous tasks. Leveraging Dag Params presents an effective strategy for mitigating redundancy within these DAGs. Moreover, the utilization of Dag Params facilitates seamless enforcement of user inputs, thereby streamlining the process of incorporating validations into the DAG codebase.
14:00 - 14:25.
By Zhang Zhang & Jenny Gao
Track: Use cases
09/10/2024 2:00 PM 09/10/2024 2:25 PM America/Los_Angeles AS24: Streamlining a Mortgage ETL Pipeline with Apache Airflow

At Bloomberg, it is our team’s responsibility to ensure the timely delivery to our clients worldwide of a vast dataset comprising approximately 5 billion data points on roughly 50 million loans and over 1.4 million securities, disclosed twice a month by three major government-sponsored mortgage entities. Ingesting this data so we can create and derive complex data structures to be consumed by our applications for our clients has been our biggest challenge. In this talk, we will discuss our transition from a manually-managed spreadsheet-based system to an automated centralized orchestration tool, and how Apache Airflow has helped make the process more transparent, predictable, and visible.

California East
At Bloomberg, it is our team’s responsibility to ensure the timely delivery to our clients worldwide of a vast dataset comprising approximately 5 billion data points on roughly 50 million loans and over 1.4 million securities, disclosed twice a month by three major government-sponsored mortgage entities. Ingesting this data so we can create and derive complex data structures to be consumed by our applications for our clients has been our biggest challenge.
14:00 - 14:25.
By Parnab Basak
Track: Airflow & ...
09/10/2024 2:00 PM 09/10/2024 2:25 PM America/Los_Angeles AS24: Unlocking FMOps/LLMOps using Apache Airflow: A guide to operationalizing and managing Large Language

In the last few years Large Language Models (LLMs) have risen to prominence as outstanding tools capable of transforming businesses. However, bringing such solutions and models to the business-as-usual operations is not an easy task. In this session, we delve into the operationalization of generative AI applications using MLOps principles, leading to the introduction of foundation model operations (FMOps) or LLM operations using Apache Airflow. We further zoom into aspects of expected people and process mindsets, new techniques for model selection and evaluation, data privacy, and model deployment. Additionally, know how you can use the prescriptive features of Apache Airflow to aid your operational journey. Whether you are building using out of the box models (open-source or proprietary), creating new foundation models from scratch, or fine-tuning an existing model, with the structured approaches described you can effectively integrate LLMs into your operations, enhancing efficiency and productivity without causing disruptions in the cloud or on-premises.

California West
In the last few years Large Language Models (LLMs) have risen to prominence as outstanding tools capable of transforming businesses. However, bringing such solutions and models to the business-as-usual operations is not an easy task. In this session, we delve into the operationalization of generative AI applications using MLOps principles, leading to the introduction of foundation model operations (FMOps) or LLM operations using Apache Airflow. We further zoom into aspects of expected people and process mindsets, new techniques for model selection and evaluation, data privacy, and model deployment.
14:35 - 15:00.
By Jennifer Melot
Track: Use cases
09/10/2024 2:35 PM 09/10/2024 3:00 PM America/Los_Angeles AS24: Data Orchestration for Emerging Technology Analysis

The Center for Security and Emerging Technology is a think tank at Georgetown University that studies security implications of emerging technologies, including data-driven analyses across bibliometric, patenting, and investment datasets. This talk will describe CSET’s data infrastructure which uses Airflow to orchestrate data ingestion, model deployment, webscraping, and manual data curation pipelines. We’ll also discuss how outputs from these pipelines are integrated into public-facing web applications and written reports, and some lessons learned from building and maintaining data pipelines on a data team with a diverse skill set.

California East
The Center for Security and Emerging Technology is a think tank at Georgetown University that studies security implications of emerging technologies, including data-driven analyses across bibliometric, patenting, and investment datasets. This talk will describe CSET’s data infrastructure which uses Airflow to orchestrate data ingestion, model deployment, webscraping, and manual data curation pipelines. We’ll also discuss how outputs from these pipelines are integrated into public-facing web applications and written reports, and some lessons learned from building and maintaining data pipelines on a data team with a diverse skill set.
14:35 - 15:00.
By Nathan Hadfield
Track: Airflow & ...
09/10/2024 2:35 PM 09/10/2024 3:00 PM America/Los_Angeles AS24: From Oops to Ops: Smart Task Failure Diagnosis with OpenAI

This session reveals an experimental venture integrating OpenAI’s AI technologies with Airflow, aimed at advancing error diagnosis.

Through the application of AI, our objective is to deepen the understanding of issues, provide comprehensive insights into task failures, and suggest actionable solutions, thereby augmenting the resolution process. This method seeks to not only enhance diagnostic efficiency but also to equip data engineers with AI-informed recommendations.

Participants will be guided through the integration journey, illustrating how AI can refine error analysis and potentially simplify troubleshooting workflows.

California West
This session reveals an experimental venture integrating OpenAI’s AI technologies with Airflow, aimed at advancing error diagnosis. Through the application of AI, our objective is to deepen the understanding of issues, provide comprehensive insights into task failures, and suggest actionable solutions, thereby augmenting the resolution process. This method seeks to not only enhance diagnostic efficiency but also to equip data engineers with AI-informed recommendations. Participants will be guided through the integration journey, illustrating how AI can refine error analysis and potentially simplify troubleshooting workflows.
14:35 - 15:00.
By Taylor Facen
Track: Best practices
09/10/2024 2:35 PM 09/10/2024 3:00 PM America/Los_Angeles AS24: From Tech Specs to Business Impact: How to Design A Truly End-to-End Airflow Project

There are many Airflow tutorials. However, many don’t show the full process of sourcing, transforming, testing, alerting, documenting, and finally supplying data. This talk with go over how to piece together an end-to-end Airflow project that transforms raw data to be consumable by the business. It will include how various technologies can all be orchestrated by Airflow to satisfy the needs of analysts, engineers, and business stakeholders. Each step and technology mentioned will be something that we at AngelList use, and code snippets will be sprinkled throughout so that attendees can implement this project within their organizations.

The talk will be divided into the following sections:

  1. Introduction: Introducing the business problem and how we came up with the solution design

  2. Data sourcing: Fetching and storing API data using basic operators and hooks

  3. Transformation and Testing: How to use dbt to build and test models based on the raw data

  4. Alerting: Alerting the necessary parties when any part of this DAG fails using Elementary and Slack

  5. Documentation and Consumption: How to sync data documentation to Metabase and dbt docs for easy consumption and lineage

Georgian
There are many Airflow tutorials. However, many don’t show the full process of sourcing, transforming, testing, alerting, documenting, and finally supplying data. This talk with go over how to piece together an end-to-end Airflow project that transforms raw data to be consumable by the business. It will include how various technologies can all be orchestrated by Airflow to satisfy the needs of analysts, engineers, and business stakeholders. Each step and technology mentioned will be something that we at AngelList use, and code snippets will be sprinkled throughout so that attendees can implement this project within their organizations.
14:35 - 15:00.
By Daniel Standish
Track: Airflow Intro talks
09/10/2024 2:35 PM 09/10/2024 3:00 PM America/Los_Angeles AS24: Managing version upgrades without feelings of terror

Airflow version upgrades can be challenging. Maybe you upgrade and your dags fail to parse (that’s an easy fix). Or maybe you upgrade and everything looks fine, but when your dag runs, you can no longer connect to mysql because the TLS version changed. In this talk I will provide concrete strategies that users can put into practice to make version upgrades safer and less painful. Topics may include:

  • What semver means and what it implies for the upgrade process

  • Using integration test dags, unit tests, and a test cluster to smoke out problems

  • Strategies around constraints files / pinning, and managing providers vs core versions

  • Using db clean prior to upgrade to reduce table size

  • Rollback strategies

  • What to do about warnings (e.g. deprecation warnings)?

I’ll also focus on keeping it simple. Sometimes things like “integration tests” and “CI” can be scary for people. Even without having set up anything automated, there are still things you can do to make management of upgrades a little less painful and risky.

Elizabethan A+B
Airflow version upgrades can be challenging. Maybe you upgrade and your dags fail to parse (that’s an easy fix). Or maybe you upgrade and everything looks fine, but when your dag runs, you can no longer connect to mysql because the TLS version changed. In this talk I will provide concrete strategies that users can put into practice to make version upgrades safer and less painful. Topics may include: What semver means and what it implies for the upgrade process
15:10 - 15:35.
By Ephraim Anierobi
Track: Airflow Intro talks
09/10/2024 3:10 PM 09/10/2024 3:35 PM America/Los_Angeles AS24: A deep dive into Airflow configuration options for scalability

Apache Airflow has a lot of configuration options. A change in some of these options can affect the performance of Airflow.

If you are wondering why your Airflow instance is not running the number of tasks you expected it to run, after this talk, you will have a better understanding of the configuration options available for improving the number of tasks your Airflow instance can run.

We will talk about the DAG parsing configuration options, options for scheduler scalability, etc., and the pros and cons of these options.

Elizabethan A+B
Apache Airflow has a lot of configuration options. A change in some of these options can affect the performance of Airflow. If you are wondering why your Airflow instance is not running the number of tasks you expected it to run, after this talk, you will have a better understanding of the configuration options available for improving the number of tasks your Airflow instance can run. We will talk about the DAG parsing configuration options, options for scheduler scalability, etc.
15:10 - 15:35.
By Nawfel Bacha & Andrea Bombino
Track: Best practices
09/10/2024 3:10 PM 09/10/2024 3:35 PM America/Los_Angeles AS24: Airflow Datasets and Pub/Sub for Dynamic DAG Triggering

Looking for a way to streamline your data workflows and master the art of orchestration? As we navigate the complexities of modern data engineering, Airflow’s dynamic workflow and complex data pipeline dependencies are starting to become more and more common nowadays. In order to empower data engineers to exploit Airflow as the main orchestrator, Airflow Datasets can be easily integrated in your data journey.

This session will showcase the Dynamic Workflow orchestration in Airflow and how to manage multi-DAGs dependencies with Multi-Dataset listening.

We’ll take you through a real-time data pipeline with Pub/Sub messaging integration and dbt in Google Cloud environment, to ensure data transformations are triggered only upon new data ingestion, moving away from rigid time-based scheduling or the use of sensors and other legacy ways to trigger a DAG.

Georgian
Looking for a way to streamline your data workflows and master the art of orchestration? As we navigate the complexities of modern data engineering, Airflow’s dynamic workflow and complex data pipeline dependencies are starting to become more and more common nowadays. In order to empower data engineers to exploit Airflow as the main orchestrator, Airflow Datasets can be easily integrated in your data journey. This session will showcase the Dynamic Workflow orchestration in Airflow and how to manage multi-DAGs dependencies with Multi-Dataset listening.
15:10 - 15:35.
By Elad Yaniv
Track: Airflow & ...
09/10/2024 3:10 PM 09/10/2024 3:35 PM America/Los_Angeles AS24: Elevating Machine Learning Deployment: Unleashing the Power of Airflow in Wix's ML Platform

In his presentation, Elad will provide a novel take on Airflow, highlighting its versatility beyond conventional use for scheduled pipelines. He’ll discuss its application as an on-demand tool for initiating and halting jobs, mainly in the Data Science fields, like dataset enrichment and batch prediction via API calls, complete with real-time status tracking and alerts. The talk aims to encourage a fresh approach to Airflow utilization but will also delve into the technical aspects of implementing DAG triggering and cancellation logic.

What will the audience learn:

  1. Real-life use case of leveraging Airflow capabilities beyond traditional pipeline scheduling, with innovative integration as the infrastructure for ML Platform.

  2. Trigger on-demand DAGs through API.

  3. Cancel running DAGs.

  4. Demonstration of an end-to-end ML pipeline utilizing AWS Sagemaker for batch predictions.

  5. Some more Airflow best practices.

Join us to learn from Wix’s experience and best practices!

California West
In his presentation, Elad will provide a novel take on Airflow, highlighting its versatility beyond conventional use for scheduled pipelines. He’ll discuss its application as an on-demand tool for initiating and halting jobs, mainly in the Data Science fields, like dataset enrichment and batch prediction via API calls, complete with real-time status tracking and alerts. The talk aims to encourage a fresh approach to Airflow utilization but will also delve into the technical aspects of implementing DAG triggering and cancellation logic.
15:10 - 15:35.
By Michael Atondo
Track: Use cases
09/10/2024 3:10 PM 09/10/2024 3:35 PM America/Los_Angeles AS24: How Panasonic Leverages Airflow

Using various operators to perform daily routines.

Integration with Technologies:

Redis: Acts as a caching mechanism to optimize data retrieval and processing speed, enhancing overall pipeline performance.

MySQL: Utilized for storing metadata and managing task state information within Airflow’s backend database.

Tableau: Integrates with Airflow to generate interactive visualizations and dashboards, providing valuable insights into the processed data.

Amazon Redshift: Panasonic leverages Redshift for scalable data warehousing, seamlessly integrating it with Airflow for data loading and analytics.

Foundry: Integrated with Airflow to access and process data stored within Foundry’s data platform, ensuring data consistency and reliability.

Plotly Dashboards: Employed for creating custom, interactive web-based dashboards to visualize and analyze data processed through Airflow pipelines.

GitLab CI/CD Pipelines: Utilized for version control and continuous integration/continuous deployment (CI/CD) of Airflow DAGs (Directed Acyclic Graphs), ensuring efficient development and deployment of workflows.

California East
Using various operators to perform daily routines. Integration with Technologies: Redis: Acts as a caching mechanism to optimize data retrieval and processing speed, enhancing overall pipeline performance. MySQL: Utilized for storing metadata and managing task state information within Airflow’s backend database. Tableau: Integrates with Airflow to generate interactive visualizations and dashboards, providing valuable insights into the processed data. Amazon Redshift: Panasonic leverages Redshift for scalable data warehousing, seamlessly integrating it with Airflow for data loading and analytics.
16:00 - 16:25.
By Shahar Epstein
Track: Use cases
09/10/2024 4:00 PM 09/10/2024 4:25 PM America/Los_Angeles AS24: Airflow at NCR Voyix: Streamlining ML workflows development with Airflow

NCR Voyix Retail Analytics AI team offers ML products for retailers while embracing Airflow as its MLOps Platform. As the team is small and there have been twice as many data scientists as engineers, we encountered challenges in making Airflow accessible to the scientists:

  1. As they come from diverse programming backgrounds, we needed an architecture enabling them to develop production-ready ML workflows without prior knowledge of Airflow.

  2. Due to dynamic product demands, we had to implement a mechanism to interchange Airflow operators effortlessly.

  3. As workflows serve multiple customers, they should be easily configurable and simultaneously deployable.

We came up with the following architecture to deal with the above:

  1. Enabling our data scientists to formulate ML workflows as structured Python files.

  2. Seamlessly converting the workflows into Airflow DAGs while aggregating their steps to be executed on different Airflow operators.

  3. Deploying DAGs via CI/CD’s UI to the DAGs folder for all customers while considering definitions for each in their configuration files.

In this session, we will cover Airflow’s evolution in our team and review the concepts of our architecture.

California East
NCR Voyix Retail Analytics AI team offers ML products for retailers while embracing Airflow as its MLOps Platform. As the team is small and there have been twice as many data scientists as engineers, we encountered challenges in making Airflow accessible to the scientists: As they come from diverse programming backgrounds, we needed an architecture enabling them to develop production-ready ML workflows without prior knowledge of Airflow. Due to dynamic product demands, we had to implement a mechanism to interchange Airflow operators effortlessly.
16:00 - 16:25.
By Spencer Tollefson
Track: Airflow Intro talks
09/10/2024 4:00 PM 09/10/2024 4:25 PM America/Los_Angeles AS24: Empowering more teams in your organization to self-service their Airflow needs

Does your organization feel like the responsibility to write Airflow DAGs, handle the Airflow infrastructure administration, debug failing tasks, and keep up with new features and best practices is too much for too few people? Perhaps you only have one data team that owns all of that; or you have too many teams that have too many permissions into other teams’ DAGs.

The topic of this talk is how Rakuten Kobo enables self-service for various teams within its organization to build their own DAGs in Airflow. The talk will include how we delineate the Airflow responsibilities of various teams, build guard rails for new Airflow developers, how different teams automatically have permissions required for their “own” DAGs (but not others), the unique responsibilities of Operations and Data Engineering teams, and how it is done in a scalable manner.

Maybe you’ll be inspired to make changes in your own organization, or have some tips of your own to share! Depending on questions, we could discuss some of the technical details as well.

Elizabethan A+B
Does your organization feel like the responsibility to write Airflow DAGs, handle the Airflow infrastructure administration, debug failing tasks, and keep up with new features and best practices is too much for too few people? Perhaps you only have one data team that owns all of that; or you have too many teams that have too many permissions into other teams’ DAGs. The topic of this talk is how Rakuten Kobo enables self-service for various teams within its organization to build their own DAGs in Airflow.
16:00 - 16:25.
By Pankaj Singh & Utkarsh Sharma
Track: Best practices
09/10/2024 4:00 PM 09/10/2024 4:25 PM America/Los_Angeles AS24: Optimizing Airflow Performance: Strategies, Techniques, and Best Practices

Airflow, an open-source platform for orchestrating complex data workflows, is widely adopted for its flexibility and scalability. However, as workflows grow in complexity and scale, optimizing Airflow performance becomes crucial for efficient execution and resource utilization. This session delves into the importance of optimizing Airflow performance and provides strategies, techniques, and best practices to enhance workflow execution speed, reduce resource consumption, and improve system efficiency. Attendees will gain insights into identifying performance bottlenecks, fine-tuning workflow configurations, leveraging advanced features, and implementing optimization strategies to maximize pipeline throughput. Whether you’re a seasoned Airflow user or just getting started, this session equips you with the knowledge and tools needed to optimize your Airflow deployments for optimal performance and scalability. We’ll also explore topics such as DAG writing best practices, monitoring and updating Airflow configurations, and database performance optimization, covering unused indexes, missing indexes, and minimizing table and index bloat.

Georgian
Airflow, an open-source platform for orchestrating complex data workflows, is widely adopted for its flexibility and scalability. However, as workflows grow in complexity and scale, optimizing Airflow performance becomes crucial for efficient execution and resource utilization. This session delves into the importance of optimizing Airflow performance and provides strategies, techniques, and best practices to enhance workflow execution speed, reduce resource consumption, and improve system efficiency. Attendees will gain insights into identifying performance bottlenecks, fine-tuning workflow configurations, leveraging advanced features, and implementing optimization strategies to maximize pipeline throughput.
16:00 - 16:25.
By Neha Singla & Sathish kumar Thangaraj
Track: Airflow & ...
09/10/2024 4:00 PM 09/10/2024 4:25 PM America/Los_Angeles AS24: Streamline data science workflow development using Jupyter notebooks and Airflow

Jupyter Notebooks are widely used by data scientists and engineers to prototype and experiment with data. However these engineers are often required to work with other data or platform engineers to productionize these experiments due to the complexity in navigating infrastructure and systems. In this talk, we will deep dive into this PR https://github.com/apache/airflow/pull/34840 and share how airflow can be leveraged as a platform to execute notebook pipelines (python, scala or spark) in dynamic environments like Kubernetes for various heterogeneous use cases. We will demonstrate how data scientists can use a Jupyter extension to easily build and manage such pipelines which are executed using Airflow streamlining data science workflow development and supercharging productivity

California West
Jupyter Notebooks are widely used by data scientists and engineers to prototype and experiment with data. However these engineers are often required to work with other data or platform engineers to productionize these experiments due to the complexity in navigating infrastructure and systems. In this talk, we will deep dive into this PR https://github.com/apache/airflow/pull/34840 and share how airflow can be leveraged as a platform to execute notebook pipelines (python, scala or spark) in dynamic environments like Kubernetes for various heterogeneous use cases.
16:35 - 17:00.
By Austin Bennett
Track: Best practices
09/10/2024 4:35 PM 09/10/2024 5:00 PM America/Los_Angeles AS24: Automated Testing and Deployment of DAGs

DAG integrity is critical. So are coding conventions, consistency in standards for the group. In this talk, we will share the various lessons learned for testing/verifying our DAGs as part of our GitHub workflows [ for testing as part of the pull request process, and for automated deployment - eventually to production - once merged ]. We will dig into how we have unlocked additional efficiencies, catch errors before they get deployed, and generally how we are better off for having both Airflow & plenty of checks in our CI, before we merge/deploy.

Georgian
DAG integrity is critical. So are coding conventions, consistency in standards for the group. In this talk, we will share the various lessons learned for testing/verifying our DAGs as part of our GitHub workflows [ for testing as part of the pull request process, and for automated deployment - eventually to production - once merged ]. We will dig into how we have unlocked additional efficiencies, catch errors before they get deployed, and generally how we are better off for having both Airflow & plenty of checks in our CI, before we merge/deploy.
16:35 - 17:00.
By Ozcan Ilikhan & Amit Kumar
Track: Use cases
09/10/2024 4:35 PM 09/10/2024 5:00 PM America/Los_Angeles AS24: Evolution of Orchestration at GoDaddy: A Journey from On-prem to Cloud-based Single Pane Model

Explore the evolutionary journey of orchestration within GoDaddy, tracing its transformation from initial on-premise deployment to a robust cloud-based Apache Airflow orchestration model. This session will detail the pivotal shifts in design, organizational decisions, and governance that have streamlined GoDaddy’s Data Platform and enhanced overall governance.

Attendees will gain insights valuable for optimizing Airflow deployments and simplifying complex orchestration processes.

Recap of the transformation journey and its impact on GoDaddy’s data operations.

Future directions and ongoing improvements in orchestration at GoDaddy.

This session will benefit attendees by providing a comprehensive case study on optimizing orchestration in a complex enterprise environment, emphasizing practical insights and scalable solutions.

California East
Explore the evolutionary journey of orchestration within GoDaddy, tracing its transformation from initial on-premise deployment to a robust cloud-based Apache Airflow orchestration model. This session will detail the pivotal shifts in design, organizational decisions, and governance that have streamlined GoDaddy’s Data Platform and enhanced overall governance. Attendees will gain insights valuable for optimizing Airflow deployments and simplifying complex orchestration processes. Recap of the transformation journey and its impact on GoDaddy’s data operations.
16:35 - 17:00.
By Rafal Biegacz & Piotr Leśniak
Track: Airflow & ...
09/10/2024 4:35 PM 09/10/2024 5:00 PM America/Los_Angeles AS24: Orchestration of ML workloads via Airflow & GKE Batch

During this talk we are going to given an overview of different orchestration approaches (Kubeflow, Ray, Airflow, etc.) when running ML workloads on Kubernetes and specifically we will focus on how to use Kubernetes Batch API and Kubernetes Operators to run complex ML workloads.

California West
During this talk we are going to given an overview of different orchestration approaches (Kubeflow, Ray, Airflow, etc.) when running ML workloads on Kubernetes and specifically we will focus on how to use Kubernetes Batch API and Kubernetes Operators to run complex ML workloads.
16:35 - 17:00.
By Shubham Mehta & Rajesh Bishundeo
Track: Airflow Intro talks
09/10/2024 4:35 PM 09/10/2024 5:00 PM America/Los_Angeles AS24: Scaling AI Workloads with Apache Airflow

AI workloads are becoming increasingly complex, with unique requirements around data management, compute scalability, and model lifecycle management. In this session, we will explore the real-world challenges users face when operating AI at scale. Through real-world examples, we will uncover common pitfalls in areas like data versioning, reproducibility, model deployment, and monitoring. Our practical guide will highlight strategies for building robust and scalable AI platforms leveraging Airflow as the orchestration layer and AWS for its extensive AI/ML capabilities. We will showcase how users have tackled these challenges, streamlined their AI workflows, and unlocked new levels of productivity and innovation.

Elizabethan A+B
AI workloads are becoming increasingly complex, with unique requirements around data management, compute scalability, and model lifecycle management. In this session, we will explore the real-world challenges users face when operating AI at scale. Through real-world examples, we will uncover common pitfalls in areas like data versioning, reproducibility, model deployment, and monitoring. Our practical guide will highlight strategies for building robust and scalable AI platforms leveraging Airflow as the orchestration layer and AWS for its extensive AI/ML capabilities.
17:10 - 17:35.
By Maxime Beauchemin
Track: Community
09/10/2024 5:10 PM 09/10/2024 5:35 PM America/Los_Angeles AS24: AI Reality Checkpoint: The Good, the Bad, and the Overhyped

In the past 18 months, artificial intelligence has not just entered our workspaces – it has taken over. As we stand at the crossroads of innovation and automation, it’s time for a candid reflection on how AI has reshaped our professional lives, and to talk about where it’s been a game changer, where it’s falling short, and what’s about to shift dramatically in the short term.

Since the release of ChatGPT in December 2022, I’ve developed a “first-reflex” to augment and accelerate nearly every task with AI. As a founder and CEO, this spans a wide array of responsibilities from fundraising, internal communications, legal, operations, product marketing, finance, and beyond. In this keynote, I’ll cover diverse use cases across all areas of business, offering a comprehensive view of AI’s impact.

Join me as I sort out through this new reality and try and forecast the future of AI in our work. It’s time for a radical checkpoint. Everything’s changing fast. In some areas, AI has been a slam dunk; in others, it’s been frustrating as hell. And once a few key challenges are tackled, we’re on the cusp of a tsunami of transformation.

3 major milestones are right around the corner: top-human-level reasoning, solid memory accumulation and recall, and proper executive skills. How is this going to affect all of us?

Elizabethan A+B
In the past 18 months, artificial intelligence has not just entered our workspaces – it has taken over. As we stand at the crossroads of innovation and automation, it’s time for a candid reflection on how AI has reshaped our professional lives, and to talk about where it’s been a game changer, where it’s falling short, and what’s about to shift dramatically in the short term. Since the release of ChatGPT in December 2022, I’ve developed a “first-reflex” to augment and accelerate nearly every task with AI.
17:10 - 17:35.
By Arnab Kundu
Track: Use cases
09/10/2024 5:10 PM 09/10/2024 5:35 PM America/Los_Angeles AS24: DAG Dependency Management across Lowes

DAG dependency is already a solved use case for the same Airflow instance. But what happens when you have 50+ Airflow instances across teams and the workflow of one or many depends on others? By leveraging sensors and datasets we have created a custom operator that brings in the capability of cross-cluster dependencies. It works with our OnPrem Kubernetes architecture which is responsible for deployment of the custom operators throughout the entire Organization.

California East
DAG dependency is already a solved use case for the same Airflow instance. But what happens when you have 50+ Airflow instances across teams and the workflow of one or many depends on others? By leveraging sensors and datasets we have created a custom operator that brings in the capability of cross-cluster dependencies. It works with our OnPrem Kubernetes architecture which is responsible for deployment of the custom operators throughout the entire Organization.
17:10 - 17:35.
By Kevin Wang & Palanieppan Muthiah
Track: Use cases
09/10/2024 5:10 PM 09/10/2024 5:35 PM America/Los_Angeles AS24: Optimizing Critical Operations: Enhancing Robinhood's Workflow Journey with Airflow

Airflow is widely used within Robinhood. In addition to traditional offline analytics use cases (to schedule ingestion and analytics workloads that populate our data lake), we also use Airflow in our backend services to orchestrate various workflows that are highly critical for the business, e.g: compliance and regulatory reporting, user facing reports and more.

As part of this, we have evolved what we believe is a unique deployment architecture for Airflow. We have central schedulers that are responsible for workloads from multiple different teams, but the workflow tasks themselves run on workers owned by respective teams that are highly coupled with their backend services and codebase.

Furthermore, Robinhood augmented Airflow with a bunch of customizations — airflow worker template for Kubernetes, enhanced observability, enhanced SLA detection, and a collection of operators, sensors, and plugins to tailor Airflow to their exact needs.

This session is going to walk through how we grew our architecture and adapted Airflow to fit Robinhood’s variety of needs and use cases.

Georgian
Airflow is widely used within Robinhood. In addition to traditional offline analytics use cases (to schedule ingestion and analytics workloads that populate our data lake), we also use Airflow in our backend services to orchestrate various workflows that are highly critical for the business, e.g: compliance and regulatory reporting, user facing reports and more. As part of this, we have evolved what we believe is a unique deployment architecture for Airflow.
17:10 - 17:35.
By Amogh Desai & Shubham Raj
Track: Airflow & ...
09/10/2024 5:10 PM 09/10/2024 5:35 PM America/Los_Angeles AS24: Overcoming Custom Python Package Hurdles in Airflow

DAG Authors, while constructing DAGs, generally use native libraries provided by Airflow in conjunction with python libraries available over public PyPI repositories.

But sometimes, DAG authors need to construct DAG using libraries that are either in-house or not available over public PyPI repositories. This poses a serious challenge for users who want to run their custom code with Airflow DAGs, particularly when Airflow is deployed in a cloud-native fashion.

Traditionally, these packages are baked in Airflow Docker images. This won’t work post deployment and is super impractical if your library is under development.

We propose a solution that creates a dedicated Airflow global python environment that dynamically generates the requirements, establishes a version-compatible pyenv adhering to Airflow’s policies, and manages custom pip repository authentication seamlessly. Importantly, the service executes these steps in a fail-safe manner, not compromising core components.

Join us as we discuss the solution to this common problem, touching upon the design, and seeing the solution in action. We also candidly discuss some challenges, and the shortcomings of the proposed solution.

California West
DAG Authors, while constructing DAGs, generally use native libraries provided by Airflow in conjunction with python libraries available over public PyPI repositories. But sometimes, DAG authors need to construct DAG using libraries that are either in-house or not available over public PyPI repositories. This poses a serious challenge for users who want to run their custom code with Airflow DAGs, particularly when Airflow is deployed in a cloud-native fashion. Traditionally, these packages are baked in Airflow Docker images.
9:00 - 9:10
Welcome
10:30 - 11:00
Morning break
13:00 - 14:00
Lunch
15:35 - 16:00
Afternoon break
17:35 - 19:40
Event reception
09:10 - 10:05. Grand Ballroom
By Kenten Danas, John Jackson, Marc Lamberti, Rafal Biegacz, Ash Berlin-Taylor & Elad Kalif
Track: Keynote
10 years after its creation, Airflow is stronger than ever: in last year’s Airflow survey, 81% of users said Airflow is important or very important to their business, 87% said their Airflow usage has grown over time, and 92% said they would recommend Airflow. In this panel discussion, we’ll celebrate a decade of Airflow and delve into how it became the highly recommended industry standard it is today, including history, pivotal moments, and the role of the community.
10:05 - 10:30. Grand Ballroom
By Michael Winser & Jarek Potiuk
Track: Keynote
Airflow’s power comes from its vast ecosystem, but securing this intricate web requires a united front. This talk unveils a groundbreaking collaborative effort between the Python Software Foundation (PSF), the Apache Software Foundation (ASF), the Airflow Project Management Committee (PMC), and Alpha-Omega Fund - aimed at securing not only Airflow, but the whole ecosystem. We’ll explore this new project dedicated to improving security across the Airflow landscape.
11:00 - 11:45. California East
By Bonnie Why
Track: Use cases
As the largest employee-owned engineering and construction firm in the United States, Burns & McDonnell has a massive amount of data. Not only that, it’s hard to pinpoint which source system has the data we need. Our solution to this challenge is to build a unified information platform — a single source of truth where all of our data is searchable, trustworthy, and accessible to our employee-owners and the projects that need it.
11:00 - 11:45. California West
By Avichay Marciano
Track: Airflow & ...
As large language models (LLMs) gain traction, companies encounter challenges in deploying them effectively. This session focuses on using Airflow to manage LLM batch pipelines, addressing rate limits and optimizing asynchronous batch APIs. We will discuss strategies for managing cloud provider rate limits efficiently to ensure uninterrupted, cost-effective LLM operations. This includes queuing and job prioritization techniques to optimize throughput. Additionally, we’ll explore asynchronous batch processing for tasks such as Retrieval Augmented Generation (RAG) and vector embedding, which enhance processing efficiency and reduce latency.
11:00 - 11:45. Elizabethan A+B
By Brent Bovenzi
Track: New features
Soon we will finally switch to a 100% React UI with a full separation between the API and UI as well. While we are doing such a big change, let’s also take the opportunity to imagine whole new interfaces vs just simply modernizing the existing views. How can we use design to help you better understand what is going on with your DAG? Come listen to some of our proposed ideas and bring your own big ideas as the second half will be an open discussion.
11:00 - 11:45. Georgian
By Kacper Muda & Eric Veleker
Track: Best practices
OpenLineage is an open standard for lineage data collection, integrated into the Airflow codebase, facilitating lineage collection across providers like Google, Amazon, and more. Atlan Data Catalog is a 3rd generation active metadata platform that is a single source of trust unifying cataloging, data discovery, lineage, and governance experience. We will demonstrate what OpenLineage is and how, with minimal and intuitive setup across Airlfow and Atlan, it presents unified workflows view, efficient cross-platform lineage collection, including column level, in various technologies (Python, Spark, dbt, SQL etc.
12:00 - 12:45. California East
By Gunnar Lykins
Track: Use cases
FanDuel Group, an industry leader in sports-tech entertainment, is proud to be recognized as the #1 sports betting company in the US as of 2023 with 53.4% market share. With a workforce exceeding 4,000 employees, including over 100 data engineers, FanDuel Group is at the forefront of innovation in batch processing orchestration platforms. Currently, our platform handles over 250,000 DAG runs & executes ~3 million tasks monthly across 17 deployments. It provides a standardized framework for pipeline development, structured observability, monitoring, & alerting.
12:00 - 12:45. California West
By Sriharsh Adari, Jeetendra Vaidya & Joseph Morotti
Track: Airflow & ...
Nowadays, conversational AI is no longer exclusive to large enterprises. It has become more accessible and affordable, opening up new possibilities and business opportunities. In this session, discover how you can leverage Generative AI as your AI pair programmer to suggest DAG code and recommend entire functions in real-time, directly from your editor. Visualize how to harness the power of ML, trained on billions of lines of code, to transform natural language prompts into coding suggestions.
12:00 - 12:45. Elizabethan A+B
By Ash Berlin-Taylor & Vikram Koka
Track: New features
Imagine a world where writing Airflow tasks in languages like Go, R, Julia, or maybe even Rust is not just a dream but a native capability. Say goodbye to BashOperators; welcome to the future of Airflow task execution. Here’s what you can expect to learn from this session: Multilingual Tasks: Explore how we empower DAG authors to write tasks in any language while retaining seamless access to Airflow Variables and Connections.
12:00 - 12:45. Georgian
By John Jackson
Track: Best practices
Airflow is all about schedules…we use CRON strings and Timetable to define schedules, and there’s an Airflow Scheduler component that manages those timetables, and a lot more, to ensure that DAGs and tasks are addressed based on those schedules. But what do you do if your data isn’t available on a schedule? What if data is coming from many sources, at varying times, and your job is to make sure it’s all as up-to-date as possible?
14:00 - 14:25. California East
By Zhang Zhang & Jenny Gao
Track: Use cases
At Bloomberg, it is our team’s responsibility to ensure the timely delivery to our clients worldwide of a vast dataset comprising approximately 5 billion data points on roughly 50 million loans and over 1.4 million securities, disclosed twice a month by three major government-sponsored mortgage entities. Ingesting this data so we can create and derive complex data structures to be consumed by our applications for our clients has been our biggest challenge.
14:00 - 14:25. California West
By Parnab Basak
Track: Airflow & ...
In the last few years Large Language Models (LLMs) have risen to prominence as outstanding tools capable of transforming businesses. However, bringing such solutions and models to the business-as-usual operations is not an easy task. In this session, we delve into the operationalization of generative AI applications using MLOps principles, leading to the introduction of foundation model operations (FMOps) or LLM operations using Apache Airflow. We further zoom into aspects of expected people and process mindsets, new techniques for model selection and evaluation, data privacy, and model deployment.
14:00 - 14:25. Elizabethan A+B
By Luan Moreno Medeiros Maciel
Track: Airflow Intro talks
dbt became the de facto for data teams building reliable and trustworthy SQL code leveraging a modern data stack architecture. The dbt logic needs to be orchestrated, and jobs scheduled to meet business expectations. That’s where Airflow comes into play. In this quick introduction session, you’ll gonna learn: How to leverage dbt-Core & Airflow to orchestrate pipelines Write DAGs in a Pythonic way Apply best practices on your jobs
14:00 - 14:25. Georgian
By Sumit Maheshwari
Track: Best practices
In the realm of data engineering, there is a prevalent tendency for professionals to develop similar Directed Acyclic Graphs (DAGs) to manage analogous tasks. Leveraging Dag Params presents an effective strategy for mitigating redundancy within these DAGs. Moreover, the utilization of Dag Params facilitates seamless enforcement of user inputs, thereby streamlining the process of incorporating validations into the DAG codebase.
14:35 - 15:00. California East
By Jennifer Melot
Track: Use cases
The Center for Security and Emerging Technology is a think tank at Georgetown University that studies security implications of emerging technologies, including data-driven analyses across bibliometric, patenting, and investment datasets. This talk will describe CSET’s data infrastructure which uses Airflow to orchestrate data ingestion, model deployment, webscraping, and manual data curation pipelines. We’ll also discuss how outputs from these pipelines are integrated into public-facing web applications and written reports, and some lessons learned from building and maintaining data pipelines on a data team with a diverse skill set.
14:35 - 15:00. California West
By Nathan Hadfield
Track: Airflow & ...
This session reveals an experimental venture integrating OpenAI’s AI technologies with Airflow, aimed at advancing error diagnosis. Through the application of AI, our objective is to deepen the understanding of issues, provide comprehensive insights into task failures, and suggest actionable solutions, thereby augmenting the resolution process. This method seeks to not only enhance diagnostic efficiency but also to equip data engineers with AI-informed recommendations. Participants will be guided through the integration journey, illustrating how AI can refine error analysis and potentially simplify troubleshooting workflows.
14:35 - 15:00. Elizabethan A+B
By Daniel Standish
Track: Airflow Intro talks
Airflow version upgrades can be challenging. Maybe you upgrade and your dags fail to parse (that’s an easy fix). Or maybe you upgrade and everything looks fine, but when your dag runs, you can no longer connect to mysql because the TLS version changed. In this talk I will provide concrete strategies that users can put into practice to make version upgrades safer and less painful. Topics may include: What semver means and what it implies for the upgrade process
14:35 - 15:00. Georgian
By Taylor Facen
Track: Best practices
There are many Airflow tutorials. However, many don’t show the full process of sourcing, transforming, testing, alerting, documenting, and finally supplying data. This talk with go over how to piece together an end-to-end Airflow project that transforms raw data to be consumable by the business. It will include how various technologies can all be orchestrated by Airflow to satisfy the needs of analysts, engineers, and business stakeholders. Each step and technology mentioned will be something that we at AngelList use, and code snippets will be sprinkled throughout so that attendees can implement this project within their organizations.
15:10 - 15:35. California East
By Michael Atondo
Track: Use cases
Using various operators to perform daily routines. Integration with Technologies: Redis: Acts as a caching mechanism to optimize data retrieval and processing speed, enhancing overall pipeline performance. MySQL: Utilized for storing metadata and managing task state information within Airflow’s backend database. Tableau: Integrates with Airflow to generate interactive visualizations and dashboards, providing valuable insights into the processed data. Amazon Redshift: Panasonic leverages Redshift for scalable data warehousing, seamlessly integrating it with Airflow for data loading and analytics.
15:10 - 15:35. California West
By Elad Yaniv
Track: Airflow & ...
In his presentation, Elad will provide a novel take on Airflow, highlighting its versatility beyond conventional use for scheduled pipelines. He’ll discuss its application as an on-demand tool for initiating and halting jobs, mainly in the Data Science fields, like dataset enrichment and batch prediction via API calls, complete with real-time status tracking and alerts. The talk aims to encourage a fresh approach to Airflow utilization but will also delve into the technical aspects of implementing DAG triggering and cancellation logic.
15:10 - 15:35. Elizabethan A+B
By Ephraim Anierobi
Track: Airflow Intro talks
Apache Airflow has a lot of configuration options. A change in some of these options can affect the performance of Airflow. If you are wondering why your Airflow instance is not running the number of tasks you expected it to run, after this talk, you will have a better understanding of the configuration options available for improving the number of tasks your Airflow instance can run. We will talk about the DAG parsing configuration options, options for scheduler scalability, etc.
15:10 - 15:35. Georgian
By Nawfel Bacha & Andrea Bombino
Track: Best practices
Looking for a way to streamline your data workflows and master the art of orchestration? As we navigate the complexities of modern data engineering, Airflow’s dynamic workflow and complex data pipeline dependencies are starting to become more and more common nowadays. In order to empower data engineers to exploit Airflow as the main orchestrator, Airflow Datasets can be easily integrated in your data journey. This session will showcase the Dynamic Workflow orchestration in Airflow and how to manage multi-DAGs dependencies with Multi-Dataset listening.
16:00 - 16:25. California East
By Shahar Epstein
Track: Use cases
NCR Voyix Retail Analytics AI team offers ML products for retailers while embracing Airflow as its MLOps Platform. As the team is small and there have been twice as many data scientists as engineers, we encountered challenges in making Airflow accessible to the scientists: As they come from diverse programming backgrounds, we needed an architecture enabling them to develop production-ready ML workflows without prior knowledge of Airflow. Due to dynamic product demands, we had to implement a mechanism to interchange Airflow operators effortlessly.
16:00 - 16:25. California West
By Neha Singla & Sathish kumar Thangaraj
Track: Airflow & ...
Jupyter Notebooks are widely used by data scientists and engineers to prototype and experiment with data. However these engineers are often required to work with other data or platform engineers to productionize these experiments due to the complexity in navigating infrastructure and systems. In this talk, we will deep dive into this PR https://github.com/apache/airflow/pull/34840 and share how airflow can be leveraged as a platform to execute notebook pipelines (python, scala or spark) in dynamic environments like Kubernetes for various heterogeneous use cases.
16:00 - 16:25. Elizabethan A+B
By Spencer Tollefson
Track: Airflow Intro talks
Does your organization feel like the responsibility to write Airflow DAGs, handle the Airflow infrastructure administration, debug failing tasks, and keep up with new features and best practices is too much for too few people? Perhaps you only have one data team that owns all of that; or you have too many teams that have too many permissions into other teams’ DAGs. The topic of this talk is how Rakuten Kobo enables self-service for various teams within its organization to build their own DAGs in Airflow.
16:00 - 16:25. Georgian
By Pankaj Singh & Utkarsh Sharma
Track: Best practices
Airflow, an open-source platform for orchestrating complex data workflows, is widely adopted for its flexibility and scalability. However, as workflows grow in complexity and scale, optimizing Airflow performance becomes crucial for efficient execution and resource utilization. This session delves into the importance of optimizing Airflow performance and provides strategies, techniques, and best practices to enhance workflow execution speed, reduce resource consumption, and improve system efficiency. Attendees will gain insights into identifying performance bottlenecks, fine-tuning workflow configurations, leveraging advanced features, and implementing optimization strategies to maximize pipeline throughput.
16:35 - 17:00. California East
By Ozcan Ilikhan & Amit Kumar
Track: Use cases
Explore the evolutionary journey of orchestration within GoDaddy, tracing its transformation from initial on-premise deployment to a robust cloud-based Apache Airflow orchestration model. This session will detail the pivotal shifts in design, organizational decisions, and governance that have streamlined GoDaddy’s Data Platform and enhanced overall governance. Attendees will gain insights valuable for optimizing Airflow deployments and simplifying complex orchestration processes. Recap of the transformation journey and its impact on GoDaddy’s data operations.
16:35 - 17:00. California West
By Rafal Biegacz & Piotr Leśniak
Track: Airflow & ...
During this talk we are going to given an overview of different orchestration approaches (Kubeflow, Ray, Airflow, etc.) when running ML workloads on Kubernetes and specifically we will focus on how to use Kubernetes Batch API and Kubernetes Operators to run complex ML workloads.
16:35 - 17:00. Elizabethan A+B
By Shubham Mehta & Rajesh Bishundeo
Track: Airflow Intro talks
AI workloads are becoming increasingly complex, with unique requirements around data management, compute scalability, and model lifecycle management. In this session, we will explore the real-world challenges users face when operating AI at scale. Through real-world examples, we will uncover common pitfalls in areas like data versioning, reproducibility, model deployment, and monitoring. Our practical guide will highlight strategies for building robust and scalable AI platforms leveraging Airflow as the orchestration layer and AWS for its extensive AI/ML capabilities.
16:35 - 17:00. Georgian
By Austin Bennett
Track: Best practices
DAG integrity is critical. So are coding conventions, consistency in standards for the group. In this talk, we will share the various lessons learned for testing/verifying our DAGs as part of our GitHub workflows [ for testing as part of the pull request process, and for automated deployment - eventually to production - once merged ]. We will dig into how we have unlocked additional efficiencies, catch errors before they get deployed, and generally how we are better off for having both Airflow & plenty of checks in our CI, before we merge/deploy.
17:10 - 17:35. California East
By Arnab Kundu
Track: Use cases
DAG dependency is already a solved use case for the same Airflow instance. But what happens when you have 50+ Airflow instances across teams and the workflow of one or many depends on others? By leveraging sensors and datasets we have created a custom operator that brings in the capability of cross-cluster dependencies. It works with our OnPrem Kubernetes architecture which is responsible for deployment of the custom operators throughout the entire Organization.
17:10 - 17:35. California West
By Amogh Desai & Shubham Raj
Track: Airflow & ...
DAG Authors, while constructing DAGs, generally use native libraries provided by Airflow in conjunction with python libraries available over public PyPI repositories. But sometimes, DAG authors need to construct DAG using libraries that are either in-house or not available over public PyPI repositories. This poses a serious challenge for users who want to run their custom code with Airflow DAGs, particularly when Airflow is deployed in a cloud-native fashion. Traditionally, these packages are baked in Airflow Docker images.
17:10 - 17:35. Elizabethan A+B
By Maxime Beauchemin
Track: Community
In the past 18 months, artificial intelligence has not just entered our workspaces – it has taken over. As we stand at the crossroads of innovation and automation, it’s time for a candid reflection on how AI has reshaped our professional lives, and to talk about where it’s been a game changer, where it’s falling short, and what’s about to shift dramatically in the short term. Since the release of ChatGPT in December 2022, I’ve developed a “first-reflex” to augment and accelerate nearly every task with AI.
17:10 - 17:35. Georgian
By Kevin Wang & Palanieppan Muthiah
Track: Use cases
Airflow is widely used within Robinhood. In addition to traditional offline analytics use cases (to schedule ingestion and analytics workloads that populate our data lake), we also use Airflow in our backend services to orchestrate various workflows that are highly critical for the business, e.g: compliance and regulatory reporting, user facing reports and more. As part of this, we have evolved what we believe is a unique deployment architecture for Airflow.

Wednesday, September 11, 2024

09:00
09:30
10:00
Morning break
10:30
11:00
11:30
12:00
12:30
13:00
13:30
Lunch
14:30
15:05
15:40
16:05
Afternoon break
16:30
17:05
17:40
09:00 - 09:25.
By
Track: Keynote
09/11/2024 9:00 AM 09/11/2024 9:25 AM America/Los_Angeles AS24: Keynote to be confirmed Grand Ballroom
09:30 - 09:55.
By Alexander Booth & Oliver Dykstra
Track: Keynote
09/11/2024 9:30 AM 09/11/2024 9:55 AM America/Los_Angeles AS24: Winning Strategies: Powering a World Series Victory with Airflow Orchestration

Dive into the winning playbook of the 2023 World Series Champions Texas Rangers, and discover how they leverage Apache Airflow to streamline their data pipelines. In this session, we’ll explore how real-world data pipelines enable agile decision-making and drive competitive advantage in the high-stakes world of professional baseball, all by using Airflow as an orchestration platform. Whether you’re a seasoned data engineer or just starting out, this session promises actionable strategies to elevate your data orchestration game to championship levels.

Grand Ballroom
Dive into the winning playbook of the 2023 World Series Champions Texas Rangers, and discover how they leverage Apache Airflow to streamline their data pipelines. In this session, we’ll explore how real-world data pipelines enable agile decision-making and drive competitive advantage in the high-stakes world of professional baseball, all by using Airflow as an orchestration platform. Whether you’re a seasoned data engineer or just starting out, this session promises actionable strategies to elevate your data orchestration game to championship levels.
10:30 - 10:55.
By Ramesh Babu K M
Track: Best practices
09/11/2024 10:30 AM 09/11/2024 10:55 AM America/Los_Angeles AS24: Airflow as a workflow for Self Service Based Ingestion

Our Idea to platformize Ingestion pipelines is driven via Airflow in the background and streamline the entire ingestion process for Self Service.

With customer experience on top of it and making data ingestion fool proof as part of Analytics data team, Airflow is just complementing for our vision.

Georgian
Our Idea to platformize Ingestion pipelines is driven via Airflow in the background and streamline the entire ingestion process for Self Service. With customer experience on top of it and making data ingestion fool proof as part of Analytics data team, Airflow is just complementing for our vision.
10:30 - 11:15.
By Serjesh Sharma & Vasantha Kosuri-Marshall
Track: Use cases
09/11/2024 10:30 AM 09/11/2024 11:15 AM America/Los_Angeles AS24: Airflow at Ford: A Job Router Training Advance Driver Assistance Systems

Ford Motor Company operates extensively across various nations. The Data Operations (DataOps) team for Advanced Driver Assistance Systems (ADAS) at Ford is tasked with the processing of terabyte-scale daily data from lidar, radar, and video. To manage this, the DataOps team is challenged with orchestrating diverse, compute-intensive pipelines across both on-premises infrastructure and the GCP and deal with sensitive of customer data across both environments The team is also responsible for facilitating the execution of on-demand, compute-intensive algorithms at scale through. To achieve these objectives, the team employs Astronomer/Airflow at the core of its strategic approach. This involves various deployments of Astronomer/Airflow that integrate seamlessly and securely (via Apigee) to initiate batch data processing and ML jobs on the cloud, as well as compute-intensive computer vision tasks on-premises, with essential alerting provided through the ELK stack. This presentation will delve into the architecture and strategic planning surrounding the hybrid batch router, highlighting its pivotal role in promoting rapid innovation and scalability in the development of ADAS features.

California East
Ford Motor Company operates extensively across various nations. The Data Operations (DataOps) team for Advanced Driver Assistance Systems (ADAS) at Ford is tasked with the processing of terabyte-scale daily data from lidar, radar, and video. To manage this, the DataOps team is challenged with orchestrating diverse, compute-intensive pipelines across both on-premises infrastructure and the GCP and deal with sensitive of customer data across both environments The team is also responsible for facilitating the execution of on-demand, compute-intensive algorithms at scale through.
10:30 - 10:55.
By Rajesh Bishundeo
Track: Airflow & ...
09/11/2024 10:30 AM 09/11/2024 10:55 AM America/Los_Angeles AS24: Growing with Apache Airflow: A Providers Journey

It has been nearly 4 years since the launch of Managed Workflows for Apache Airflow (MWAA) by AWS. It has gone through the trials and tribulations as with any new idea, working with customers to better understand its shortcomings, building dedicated teams focused on scaling and growth, and at its core, preserving the integrity and functionality of Apache Airflow. Initially launched with Airflow 1.10, MWAA is now available globally in multiple AWS regions supporting the latest version of Airflow along with a multitude of features. In this talk, we will cover a bit of that history along with debunking a few myths surrounding the critical needs for users today. From compliance requirements, larger environments, observability, and pricing, we will discuss how MWAA has evolved and continues to grow through its focus on customer value and more importantly, its dedication to the Apache Airflow community.

Elizabethan A+B
It has been nearly 4 years since the launch of Managed Workflows for Apache Airflow (MWAA) by AWS. It has gone through the trials and tribulations as with any new idea, working with customers to better understand its shortcomings, building dedicated teams focused on scaling and growth, and at its core, preserving the integrity and functionality of Apache Airflow. Initially launched with Airflow 1.10, MWAA is now available globally in multiple AWS regions supporting the latest version of Airflow along with a multitude of features.
10:30 - 11:15.
By Tatiana Al-Chueyr Martins & Pankaj Koti
Track: Airflow & ...
09/11/2024 10:30 AM 09/11/2024 11:15 AM America/Los_Angeles AS24: Overcoming performance hurdles in Integrating dbt with Airflow

The integration between dbt and Airflow is a popular topic in the community, both in previous editions of Airflow Summit, in Coalesce and the #airflow-dbt Slack channel.

Astronomer Cosmos (https://github.com/astronomer/astronomer-cosmos/) stands out as one of the libraries that strives to enhance this integration, having over 300k downloads per month.

During its development, we’ve encountered various performance challenges in terms of scheduling and task execution. While we’ve managed to address some, others remain to be resolved.

This talk describes how Cosmos works, the improvements made over the last 1.5 years, and the roadmap. It also aims to collect feedback from the community on how we can further improve the experience of running dbt in Airflow.

California West
The integration between dbt and Airflow is a popular topic in the community, both in previous editions of Airflow Summit, in Coalesce and the #airflow-dbt Slack channel. Astronomer Cosmos (https://github.com/astronomer/astronomer-cosmos/) stands out as one of the libraries that strives to enhance this integration, having over 300k downloads per month. During its development, we’ve encountered various performance challenges in terms of scheduling and task execution. While we’ve managed to address some, others remain to be resolved.
11:00 - 11:25.
By Ole Christian Langfjæran
Track: Best practices
09/11/2024 11:00 AM 09/11/2024 11:25 AM America/Los_Angeles AS24: Behaviour Driven Development in Airflow

Behaviour Driven Development can, in the simplest of terms, be described as Test Driven Development, only readable. It is of course more than that, but that is not the aim of this talk. This talk aims to show:

  • How to write tests before you write a single line of Airflow code

  • Create reusable and readable steps for setting up tests, in a given-when-then manner.

  • Test rendering and execution of your DAG’s tasks

  • Real world examples from a monorepo containing multiple Airflow projects

Written only with pytest, and some code I stole from smart people in github.com/apache/airflow/tests

Georgian
Behaviour Driven Development can, in the simplest of terms, be described as Test Driven Development, only readable. It is of course more than that, but that is not the aim of this talk. This talk aims to show: How to write tests before you write a single line of Airflow code Create reusable and readable steps for setting up tests, in a given-when-then manner. Test rendering and execution of your DAG’s tasks
11:00 - 11:25.
By Jennifer Chisik
Track: Sponsored
09/11/2024 11:00 AM 09/11/2024 11:25 AM America/Los_Angeles AS24: Boost Airflow Monitoring and Alerting with Automation Analytics & Intelligence by Broadcom

This talk is presented by Broadcom.

Airflow’s “workflow as code” approach has many benefits, including enabling dynamic pipeline generation and flexibility and extensibility in a seamless development environment. However, what challenges do you face as you expand your Airflow footprint in your organization? What if you could enhance Airflow’s monitoring capabilities, forecast DAG and task executions, obtain predictive alerting, visualize trends, and get more robust logging?

Broadcom’s Automation Analytics & Intelligence (AAI) offers advanced analytics for workload automation for cloud and on-premises automation. It connects easily with Airflow to offer improved visibility into dependencies between tasks in Airflow DAGs along with the workload’s critical path, dynamic SLA management, and more.

Join our presentation to hear more about how AAI can help you improve service delivery. We will also lead a workshop that will allow you to dive deeper into how easy it is to install our Airflow Connector and get started visualizing your Airflow DAGs to optimize your workload and identify issues before they impact your business.

Elizabethan A+B
This talk is presented by Broadcom. Airflow’s “workflow as code” approach has many benefits, including enabling dynamic pipeline generation and flexibility and extensibility in a seamless development environment. However, what challenges do you face as you expand your Airflow footprint in your organization? What if you could enhance Airflow’s monitoring capabilities, forecast DAG and task executions, obtain predictive alerting, visualize trends, and get more robust logging? Broadcom’s Automation Analytics & Intelligence (AAI) offers advanced analytics for workload automation for cloud and on-premises automation.
11:30 - 12:15.
By Marion Azoulai & Maggie Stark
Track: Best practices
09/11/2024 11:30 AM 09/11/2024 12:15 PM America/Los_Angeles AS24: A New DAG Paradigm: Less Airflow more DAGs

Astronomer’s data team recently underwent a major shift in how we work with Airflow. We’ll deep dive into the challenges which prompted that change, how we addressed them and where we are now.

This re-architecture included:

Switching to dataset scheduling and micro-pipelines to minimize failures and increase reliability.

Implementing a Control DAG for complex dependency management and full end-to-end pipeline visibility.

Standardized Task Groups for quick onboarding and scalability.

With Airflow managing itself, we can once again focus on the data rather than the operational overhead. As proof we’ll share our favorite statistics from the terabyte of data we process daily revealing insights into how the world’s data teams use Airflow.

Georgian
Astronomer’s data team recently underwent a major shift in how we work with Airflow. We’ll deep dive into the challenges which prompted that change, how we addressed them and where we are now. This re-architecture included: Switching to dataset scheduling and micro-pipelines to minimize failures and increase reliability. Implementing a Control DAG for complex dependency management and full end-to-end pipeline visibility. Standardized Task Groups for quick onboarding and scalability. With Airflow managing itself, we can once again focus on the data rather than the operational overhead.
11:30 - 12:15.
By Hugo Hobson
Track: Airflow & ...
09/11/2024 11:30 AM 09/11/2024 12:15 PM America/Los_Angeles AS24: Building on Cosmos: Making dbt on Airflow Easy

Balyasny Asset Management (BAM) is a diversified global investment firm founded in 2001 with over $20 billion in assets under management.

As dbt took hold at BAM, we had multiple teams building dbt projects against Snowflake, Redshift, and SQL Server. The common question was: How can we quickly and easily productionise our projects?

Airflow is the orchestrator of choice at BAM, but our dbt users ranged from Airflow power users to people who’d never heard of Airflow before. We built a single solution on top of Cosmos that allowed us to:

  • Decouple the dbt project from the Airflow repository

  • Have each dbt node run as a separate Airflow task

  • Allow users to run dbt with little to no Airflow knowledge

  • Enable users to have fine-grained control over how dbt is run and to combine it with other Airflow tasks

  • Provide observability, monitoring, and alerting

California West
Balyasny Asset Management (BAM) is a diversified global investment firm founded in 2001 with over $20 billion in assets under management. As dbt took hold at BAM, we had multiple teams building dbt projects against Snowflake, Redshift, and SQL Server. The common question was: How can we quickly and easily productionise our projects? Airflow is the orchestrator of choice at BAM, but our dbt users ranged from Airflow power users to people who’d never heard of Airflow before.
11:30 - 11:55.
By Mikhail Epikhin & Deepan Ignaatious
Track: Sponsored
09/11/2024 11:30 AM 09/11/2024 11:55 AM America/Los_Angeles AS24: Comparing Airflow Executors and Custom Environments

With recent works in the direction of Executor Decoupling and interest in Hybrid Execution, we find it’s still quite common for Airflow users to use the old-time rule of thumbs like “Don’t Use Airflow with LocalExecutor in production”, “If your scheduler lags, split your DAGs over two separate Airflow Clusters”, and so on. In our talk, we will show a deep dive comparison between various Execution models Airflow support and hopefully update understanding of their efficiency and limitations.

Elizabethan A+B
With recent works in the direction of Executor Decoupling and interest in Hybrid Execution, we find it’s still quite common for Airflow users to use the old-time rule of thumbs like “Don’t Use Airflow with LocalExecutor in production”, “If your scheduler lags, split your DAGs over two separate Airflow Clusters”, and so on. In our talk, we will show a deep dive comparison between various Execution models Airflow support and hopefully update understanding of their efficiency and limitations.
11:30 - 12:15.
By Albert Okiri
Track: Use cases
09/11/2024 11:30 AM 09/11/2024 12:15 PM America/Los_Angeles AS24: Using Airflow for Social Impact by Provisioning Datasets According to Demand

We use Airflow to provide datasets for analytics according to user demands and initiatives being undertaken at the time. The ease of ingesting data by dynamically generating and triggering workflows that correspond to configuration files enables efficient workflows and social impact. We use Airflow to empower decision makers by enabling timely provision of datasets that are tested on quality and other metrics using inbuilt Airflow features. These datasets are accessible at a Superset instance and creates an ‘on-demand for data’ approach to data analysis that is optimized and leads to positive and effective outcomes.

California East
We use Airflow to provide datasets for analytics according to user demands and initiatives being undertaken at the time. The ease of ingesting data by dynamically generating and triggering workflows that correspond to configuration files enables efficient workflows and social impact. We use Airflow to empower decision makers by enabling timely provision of datasets that are tested on quality and other metrics using inbuilt Airflow features. These datasets are accessible at a Superset instance and creates an ‘on-demand for data’ approach to data analysis that is optimized and leads to positive and effective outcomes.
12:00 - 12:25.
By Joe Goldberg
Track: Sponsored
09/11/2024 12:00 PM 09/11/2024 12:25 PM America/Los_Angeles AS24: Airflow and Control-M: Where Data Pipelines Meet Business Applications in Production

This talk is presented by BMC

With Airflow’s mainstream acceptance in the enterprise, the operational challenges of running with applications in production have emerged.

At last year’s Airflow Summit in Toronto, three providers of Apache Airflow met to discuss “The Future of Airflow: What Users Want”. Among the user requirements in the session were:

  • An improved security model allowing “Alice” and “Bob” to run their single DAGs without each requiring a separate Airflow cluster, while still adhering to their organization’s compliance requirements.

  • An “Orchestrator of Orchestrators” relationship in which Airflow oversees the myriad orchestrators embedded in many tools and provided by cloud vendors.

That panel discussion described what Airflow users now understand to be mandatory for their workloads in enterprise production, and defined the exact operational requirements our customers have successfully tackled for decades.

Join us in this session to learn how Control-M’s Airflow integration helps data engineers do what they need to do with Airflow and gives IT Ops the key to deliver enterprise business application results in production.

Elizabethan A+B
This talk is presented by BMC With Airflow’s mainstream acceptance in the enterprise, the operational challenges of running with applications in production have emerged. At last year’s Airflow Summit in Toronto, three providers of Apache Airflow met to discuss “The Future of Airflow: What Users Want”. Among the user requirements in the session were: An improved security model allowing “Alice” and “Bob” to run their single DAGs without each requiring a separate Airflow cluster, while still adhering to their organization’s compliance requirements.
12:30 - 13:15.
By Rahul Gade & Keshav Tyagi
Track: Use cases
09/11/2024 12:30 PM 09/11/2024 1:15 PM America/Los_Angeles AS24: Linkedin's Continuous Deployment

LinkedIn Continuous Deployment (LCD), started with the goal of improving the deployment experience and expanding its outreach to all LinkedIn systems. LCD delivers a modern deployment UX and easy-to-customize pipelines which enables all LinkedIn applications to declare their deployment pipelines.

LCD’s vision is to automate cluster provisioning, deployments and enable touchless (continuous) deployments while reducing the manual toil involved in deployments.

LCD is powered by Airflow to orchestrate its deployment pipelines and automate the validation steps. For our customers Airflow is an implementation detail and we have well abstracted it out with our no-code/low code pipelines. Users describe their pipeline intent (via CLI/UI) and LCD translates the pipeline intent into Airflow DAGs.

LCD pipelines are built of steps. Inorder to democratize the adoption of the LCD, we have leveraged K8sPodOperator to run steps inside the pipeline. LCD partner teams expose validation actions as containers, which LCD pipeline runs as steps.

At full scale, LCD will have about 10K+ DAGs running in parallel.

California East
LinkedIn Continuous Deployment (LCD), started with the goal of improving the deployment experience and expanding its outreach to all LinkedIn systems. LCD delivers a modern deployment UX and easy-to-customize pipelines which enables all LinkedIn applications to declare their deployment pipelines. LCD’s vision is to automate cluster provisioning, deployments and enable touchless (continuous) deployments while reducing the manual toil involved in deployments. LCD is powered by Airflow to orchestrate its deployment pipelines and automate the validation steps.
12:30 - 12:55.
By Nick Bilozerov
Track: Sponsored
09/11/2024 12:30 PM 09/11/2024 12:55 PM America/Los_Angeles AS24: Stress-Free Airflow development: From Dev to Prod at Stripe

At Stripe, compliance with regulations is of utmost importance, and ensuring the integrity of production data is crucial. To address this challenge, Stripe developed a powerful system called User Scope Mode (USM), which allows users to safely and efficiently test new or existing Airflow pipelines without the risk of corrupting production data.

USM takes care of automatically overwriting the necessary configurations for Airflow pipelines, enabling users to test their production-ready pipelines locally with ease. This approach empowers Stripe’s teams to iterate and refine their workflows without the burden of manual setup or the fear of disrupting live operations.

In this talk, we’ll dive into the inner workings of USM and explore how it has transformed Stripe’s development and testing processes. Discover how this system seamlessly integrates with Airflow, allowing users to validate their pipelines with confidence and agility, all while maintaining the highest standards of compliance and data integrity.

Elizabethan A+B
At Stripe, compliance with regulations is of utmost importance, and ensuring the integrity of production data is crucial. To address this challenge, Stripe developed a powerful system called User Scope Mode (USM), which allows users to safely and efficiently test new or existing Airflow pipelines without the risk of corrupting production data. USM takes care of automatically overwriting the necessary configurations for Airflow pipelines, enabling users to test their production-ready pipelines locally with ease.
12:30 - 13:15.
By Amit Chauhan
Track: Airflow & ...
09/11/2024 12:30 PM 09/11/2024 1:15 PM America/Los_Angeles AS24: Weathering the Cloud Storms With Multi-Region Airflow Workflows

Cloud availability zones and regions are not immune to outages. These zones regularly go down, and regions become unavailable due to natural disasters or human-caused incidents. Thus, if an availability zone or region goes down, so do your Airflow workflows and applications… unless your Airflow workflows function across multiple geographic locations.

This hands-on session introduces you to the design patterns of multi-region Airflow workflows in the cloud, which can tolerate zone and region-level incidents. We will start with a traditional single-region configuration and then switch to a multi-region setting. By the end, we’ll have a working prototype of a multi-region Airflow pipeline that recovers from region-level outages within a few seconds, with no data loss or disruption to the application layer.

California West
Cloud availability zones and regions are not immune to outages. These zones regularly go down, and regions become unavailable due to natural disasters or human-caused incidents. Thus, if an availability zone or region goes down, so do your Airflow workflows and applications… unless your Airflow workflows function across multiple geographic locations. This hands-on session introduces you to the design patterns of multi-region Airflow workflows in the cloud, which can tolerate zone and region-level incidents.
12:30 - 13:15.
By Julian LaNeve & David Xue
Track: Best practices
09/11/2024 12:30 PM 09/11/2024 1:15 PM America/Los_Angeles AS24: Why Do Airflow Tasks Fail? An Analysis through Machine Learning Techniques

There are 3 certainties in life: death, taxes, and data pipelines failing. Pipelines may fail for a number of reasons: you may run out of memory, your credentials may expire, an upstream data source may not be reliable, etc. But there are patterns we can learn from!

Join us as we walk through an analysis we’ve done on a massive dataset of Airflow failure logs. We’ll show how we used natural language processing and dimensionality reduction methods to explore the latent space of Airflow task failures in order to cluster, visualize, and understand failures.

We’ll conclude the talk by walking through mitigation methods for common task failure reasons, and walk through how we can use Airflow to build an MLOps platform to turn this one-time analysis into a reliable, recurring activity.

Georgian
There are 3 certainties in life: death, taxes, and data pipelines failing. Pipelines may fail for a number of reasons: you may run out of memory, your credentials may expire, an upstream data source may not be reliable, etc. But there are patterns we can learn from! Join us as we walk through an analysis we’ve done on a massive dataset of Airflow failure logs. We’ll show how we used natural language processing and dimensionality reduction methods to explore the latent space of Airflow task failures in order to cluster, visualize, and understand failures.
13:00 - 13:25.
By
Track: Sponsored
09/11/2024 1:00 PM 09/11/2024 1:25 PM America/Los_Angeles AS24: Session presented by Astronomer Elizabethan A+B
14:30 - 14:55.
By Roberto Santamaria & Xiaodong Deng
Track: Airflow & ...
09/11/2024 2:30 PM 09/11/2024 2:55 PM America/Los_Angeles AS24: Building in resource awareness and event dependency into Airflow

In this talk, we will explore how adding custom dependency checks into Airflow’s scheduling system can elevate Airflow’s performance.

We will specifically discuss how we added general upstream events dependency checking as well as how to make Airflow aware of used/available compute resources so that the system can better decide when and where to run a given task on Kubernetes infrastructure.

We’ll cover why the existing dependency checking in Airflow is not sufficient in our use case, and why adding custom code to Airflow is needed. We’ll cover the pros and cons with this approach.

California West
In this talk, we will explore how adding custom dependency checks into Airflow’s scheduling system can elevate Airflow’s performance. We will specifically discuss how we added general upstream events dependency checking as well as how to make Airflow aware of used/available compute resources so that the system can better decide when and where to run a given task on Kubernetes infrastructure. We’ll cover why the existing dependency checking in Airflow is not sufficient in our use case, and why adding custom code to Airflow is needed.
14:30 - 17:20.
Track: Workshops
09/11/2024 2:30 PM 09/11/2024 5:20 PM America/Los_Angeles AS24: Featured Workshops

We will be offering hands-on workshops so you can get practical experience with Airflow tools and managed offerings.

  • Format & duration: Workshops are instructor led, 2-3 hours long, bring your own device.
  • Only available for participants with a Conference + Workshop pass.
  • Workshops have limited capacity. You can sign up in advance for 2 workshops (one per day) to get a confirmed spot.
  • Workshops will be able to receive walk-ins (people who didn’t sign up in advance) but will have limited spots and do not guarantee they will be able to receive all walk-ins.
Elizabethan C, Elizabethan D, Borgia
We will be offering hands-on workshops so you can get practical experience with Airflow tools and managed offerings. Format & duration: Workshops are instructor led, 2-3 hours long, bring your own device. Only available for participants with a Conference + Workshop pass. Workshops have limited capacity. You can sign up in advance for 2 workshops (one per day) to get a confirmed spot. Workshops will be able to receive walk-ins (people who didn’t sign up in advance) but will have limited spots and do not guarantee they will be able to receive all walk-ins.
14:30 - 14:55.
By Ipsa Trivedi & Subramanian Vellaiyan
Track: Use cases
09/11/2024 2:30 PM 09/11/2024 2:55 PM America/Los_Angeles AS24: Scalable Development of Event Driven Airflow DAGs

This usecase shows how we deal with data of different varieties from different sources. Each source sends data in different layout, timings, structures, location patterns sizes. The goal is to process the files within SLA and send them out. This a complex multi step processing pipeline that involves multiple spark jobs, api based integrations with microservices, resolving unique ids, deduplication and filtering. Note that this is an event driven system, but not a streaming data system. The files are of gigabyte scale, and each day the data being processed is of terabyte scale.

We will be talking about how to make DAG creation and business logic building a “low-code no-code process” so that non technical analysts can write business logic and light developers can deploy DAGs without much manual effort. Every aspect is either source specific or source-agnostic configuration driven.

Airflow was chosen to enable easy DAG building, scaling, monitoring, troubleshooting and rerunning.

California East
This usecase shows how we deal with data of different varieties from different sources. Each source sends data in different layout, timings, structures, location patterns sizes. The goal is to process the files within SLA and send them out. This a complex multi step processing pipeline that involves multiple spark jobs, api based integrations with microservices, resolving unique ids, deduplication and filtering. Note that this is an event driven system, but not a streaming data system.
14:30 - 14:55.
By Dennis Ferruzzi & Syed Hussain
Track: New features
09/11/2024 2:30 PM 09/11/2024 2:55 PM America/Los_Angeles AS24: The Essentials of Custom Executor Development

Since version 2.7 and the advent of AIP-51, Airflow has started to fully support the creation of custom executors. Before we dive into the components of an executor and how they work, we will briefly discuss the Executor Decoupling initiative which allowed this new feature. Once we understand the parts required, we will explore the process of crafting our own executors, using real-world examples, and demonstrations of executors developed within the Amazon Provider Package as a guide. By demystifying the process of executor creation and emphasizing the opportunities for contribution, we aim to empower Airflow users and providers to harness the full potential of custom executors, enriching the Airflow ecosystem as a whole!

Elizabethan A+B
Since version 2.7 and the advent of AIP-51, Airflow has started to fully support the creation of custom executors. Before we dive into the components of an executor and how they work, we will briefly discuss the Executor Decoupling initiative which allowed this new feature. Once we understand the parts required, we will explore the process of crafting our own executors, using real-world examples, and demonstrations of executors developed within the Amazon Provider Package as a guide.
15:05 - 15:30.
By Mike Hirsch & Sophie Keith
Track: Use cases
09/11/2024 3:05 PM 09/11/2024 3:30 PM America/Los_Angeles AS24: A Game of Constant Learning & Adjustment: Orchestrating ML Pipelines at the Philadelphia Phillies

When developing Machine Learning (ML) models, the biggest challenges are often infrastructural. How do we deploy our model and expose an inference API? How can we retrain? Can we continuously evaluate performance and monitor model drift?

In this talk, we will present how we are tackling these problems at the Philadelphia Phillies by developing a suite of tools that enable our software engineering and analytics teams to train, test, evaluate, and deploy ML models - that can be entirely orchestrated in Airflow. This framework abstracts away the infrastructural complexities that productionizing ML Pipelines presents and allows our analysts to focus on developing robust baseball research for baseball operations stakeholders across player evaluation, acquisition, and development.

We’ll also look at how we use Airflow, MLflow, MLServer, cloud services, and GitHub Actions to architect a platform that supports our framework for all points of the ML Lifecycle.

California East
When developing Machine Learning (ML) models, the biggest challenges are often infrastructural. How do we deploy our model and expose an inference API? How can we retrain? Can we continuously evaluate performance and monitor model drift? In this talk, we will present how we are tackling these problems at the Philadelphia Phillies by developing a suite of tools that enable our software engineering and analytics teams to train, test, evaluate, and deploy ML models - that can be entirely orchestrated in Airflow.
15:05 - 15:30.
By Niko Oliveira
Track: New features
09/11/2024 3:05 PM 09/11/2024 3:30 PM America/Los_Angeles AS24: Hybrid Executors: Have Your Cake and Eat it Too

Executors are a core concept in Apache Airflow and they are an essential piece to the execution of DAGs. They continue to see investment and innovation including a new feature launching this year: Hybrid Execution.

This talk will give a brief overview of executors, how they work and what they are responsible for. Followed by a description of Hybrid Executors (AIP-61), a new feature to allow multiple executors to be used natively and seamlessly side by side within a single Airflow environment. We’ll deep dive into how this feature works, how users can make use of it, compare this new feature to what was available before, and finally a demo to see it in action. Don’t miss this chance to learn about the cutting edge capabilities of executors in Apache Airflow!

Elizabethan A+B
Executors are a core concept in Apache Airflow and they are an essential piece to the execution of DAGs. They continue to see investment and innovation including a new feature launching this year: Hybrid Execution. This talk will give a brief overview of executors, how they work and what they are responsible for. Followed by a description of Hybrid Executors (AIP-61), a new feature to allow multiple executors to be used natively and seamlessly side by side within a single Airflow environment.
15:05 - 15:30.
By Philippe Gagnon
Track: Airflow & ...
09/11/2024 3:05 PM 09/11/2024 3:30 PM America/Los_Angeles AS24: Investigating the Many Loops of the Airflow Scheduler

The scheduler is unarguably the most important component of an Airflow cluster. It is also the most complex and misunderstood by practitioners and administrators alike.

In this talk, we will follow the path that a task instance takes to progress from creation to execution, and discuss the various configuration settings allowing users to tune the scheduler and executor to suit their workload patterns. Finally, we will dive deep into critical sections of the Airflow codebase and explore opportunities for optimization.

California West
The scheduler is unarguably the most important component of an Airflow cluster. It is also the most complex and misunderstood by practitioners and administrators alike. In this talk, we will follow the path that a task instance takes to progress from creation to execution, and discuss the various configuration settings allowing users to tune the scheduler and executor to suit their workload patterns. Finally, we will dive deep into critical sections of the Airflow codebase and explore opportunities for optimization.
15:40 - 16:05.
By Nathaniel Rose
Track: Use cases
09/11/2024 3:40 PM 09/11/2024 4:05 PM America/Los_Angeles AS24: Architecting Blockchain ETL Orchestration: Circle's Airflow Usecase

This talk focuses on exploring the implementation of Apache Airflow for Blockchain ETL orchestration, indexing, and the adoption of GitOps at Circle. IT will cover CICD tips, architectural choices for managing Blockchain data at scale, engineering practices to enable data scientists and some learnings from production.

California East
This talk focuses on exploring the implementation of Apache Airflow for Blockchain ETL orchestration, indexing, and the adoption of GitOps at Circle. IT will cover CICD tips, architectural choices for managing Blockchain data at scale, engineering practices to enable data scientists and some learnings from production.
15:40 - 16:05.
By Fritz Davenport
Track: Airflow & ...
09/11/2024 3:40 PM 09/11/2024 4:05 PM America/Los_Angeles AS24: Converting Legacy Schedulers to Airflow

Introducing a process and framework to convert legacy scheduler workloads such as Control-M to Airflow using automated transpilation techniques. We will discuss the process and demonstrate a python-based transpiler to automatically migrate legacy scheduler workflows with a standard set of patterns to Airflow DAGs. This framework is easily extended via configurable rulesets to encompass other schedulers such as Automic, Autosys, Oozie, and others.

California West
Introducing a process and framework to convert legacy scheduler workloads such as Control-M to Airflow using automated transpilation techniques. We will discuss the process and demonstrate a python-based transpiler to automatically migrate legacy scheduler workflows with a standard set of patterns to Airflow DAGs. This framework is easily extended via configurable rulesets to encompass other schedulers such as Automic, Autosys, Oozie, and others.
15:40 - 16:05.
By Wei Lee
Track: New features
09/11/2024 3:40 PM 09/11/2024 4:05 PM America/Los_Angeles AS24: What If...? Running Airflow Tasks without the workers

Airflow executes all tasks on the workers, including deferrable operators that must run on the workers before deferring to the triggerer. However, running some tasks directly from the triggerer can be beneficial in certain situations. This presentation will explain how deferrable operators function and examine ways to modify the Airflow implementation to enable tasks to run directly from the triggerer.

Elizabethan A+B
Airflow executes all tasks on the workers, including deferrable operators that must run on the workers before deferring to the triggerer. However, running some tasks directly from the triggerer can be beneficial in certain situations. This presentation will explain how deferrable operators function and examine ways to modify the Airflow implementation to enable tasks to run directly from the triggerer.
16:30 - 16:55.
By Maciej Obuchowski
Track: New features
09/11/2024 4:30 PM 09/11/2024 4:55 PM America/Los_Angeles AS24: OpenLineage: From Operators to Hooks

“More data lineage” has been second most popular feature request in Airflow Survey 2023. However, despite the integration of OpenLineage in Airflow 2.7 through AIP-53, the most popular Operator in Airflow - PythonOperator - isn’t covered by lineage support.

With addition of TaskFlow API, Airflow Datasets, Airflow ObjectStore, and many other small changes, writing DAGs without using other operators is easier than ever. And that’s why lineage collection in Airflow moves beyond covering specific Operators, to covering Hooks and Object Storage.

In this session, you’ll learn how newly added AIP-62 will allow you author DAGs the way you love, while also keeping benefits of a data pipeline well covered by lineage.

Elizabethan A+B
“More data lineage” has been second most popular feature request in Airflow Survey 2023. However, despite the integration of OpenLineage in Airflow 2.7 through AIP-53, the most popular Operator in Airflow - PythonOperator - isn’t covered by lineage support. With addition of TaskFlow API, Airflow Datasets, Airflow ObjectStore, and many other small changes, writing DAGs without using other operators is easier than ever. And that’s why lineage collection in Airflow moves beyond covering specific Operators, to covering Hooks and Object Storage.
16:30 - 16:55.
By Cedrik Neumann
Track: Airflow & ...
09/11/2024 4:30 PM 09/11/2024 4:55 PM America/Los_Angeles AS24: Profiling Airflow tasks with Memray

Profiling Airflow tasks can be difficult, specially in remote environments. In this talk I will demonstrate how we can leverage the capabilities of Airflow’s plugin mechanism to selectively run Airflow tasks within the context of a profiler and with the help of operator links and custom views make the results available to the user.

The content of this talk can provide inspiration on how Airflow may in the future allow the gathering of custom task metrics and make those metrics easily accessible.

California West
Profiling Airflow tasks can be difficult, specially in remote environments. In this talk I will demonstrate how we can leverage the capabilities of Airflow’s plugin mechanism to selectively run Airflow tasks within the context of a profiler and with the help of operator links and custom views make the results available to the user. The content of this talk can provide inspiration on how Airflow may in the future allow the gathering of custom task metrics and make those metrics easily accessible.
16:30 - 16:55.
By Elona Zharri, Nikhil Nandoskar & Prince Bose
Track: Use cases
09/11/2024 4:30 PM 09/11/2024 4:55 PM America/Los_Angeles AS24: Unlocking the Power of AI at Ford: A Behind-the-Scenes Look at Mach1ML and Airflow

Ford Motor Company is undergoing a significant transformation, embracing AI and Machine Learning to power its smart mobility strategy, enhance customer experiences, and drive innovation in the automotive industry. Mach1ML, Ford’s multi-million dollar ML platform, plays a crucial role in this journey by empowering data scientists and engineers to efficiently build, deploy, and manage ML models at scale. This presentation will delve into how Mach1ML leverages Apache Airflow as its orchestration layer to tackle the challenges of complex ML workflows that include disparate systems, manual processes, security concerns, and deployment complexities. We will explore the benefits of using Airflow, such as increased efficiency, improved reliability, enhanced scalability, and faster time-to-value. Additionally, we will showcase how Mach1ML utilizes Airflow capabilities to generate reusable templates and streamline environment promotions to further empower Ford’s AI practitioners and accelerate the delivery of cutting-edge AI-powered solutions supporting the next generation of vehicles.

California East
Ford Motor Company is undergoing a significant transformation, embracing AI and Machine Learning to power its smart mobility strategy, enhance customer experiences, and drive innovation in the automotive industry. Mach1ML, Ford’s multi-million dollar ML platform, plays a crucial role in this journey by empowering data scientists and engineers to efficiently build, deploy, and manage ML models at scale. This presentation will delve into how Mach1ML leverages Apache Airflow as its orchestration layer to tackle the challenges of complex ML workflows that include disparate systems, manual processes, security concerns, and deployment complexities.
17:05 - 17:30.
By Eloi Codina Torras
Track: Airflow & ...
09/11/2024 5:05 PM 09/11/2024 5:30 PM America/Los_Angeles AS24: Airflow and multi-cluster Slurm working together

Meteosim provides environmental services, mainly based on weather and air quality intelligence, and helps customers make operational and tactical decisions and understand their companies’ environmental impact. We introduced Airflow a couple of years ago to replace a huge Crontab file and we currently have around 7000 DAG Runs per day.

In this presentation we will introduce the hardest challenge we had to overcome: adapting Airflow to run on multiple Slurm-managed HPC clusters by using deferrable operators. Slurm is an open-source cluster manager, used especially in science-based companies or organizations and many supercomputers worldwide. By using Slurm our simulations run on bare metal nodes, eliminating overhead and speeding up the intensive calculations.

Moreover, we will present our use case: how we use Airflow to provide our services and how we streamlined the DAG creation process, so our Product Engineers need to write a few lines of code and all DAGs are standardized and stored in a database.

California West
Meteosim provides environmental services, mainly based on weather and air quality intelligence, and helps customers make operational and tactical decisions and understand their companies’ environmental impact. We introduced Airflow a couple of years ago to replace a huge Crontab file and we currently have around 7000 DAG Runs per day. In this presentation we will introduce the hardest challenge we had to overcome: adapting Airflow to run on multiple Slurm-managed HPC clusters by using deferrable operators.
17:05 - 17:30.
By Jens Scheffler
Track: Use cases
09/11/2024 5:05 PM 09/11/2024 5:30 PM America/Los_Angeles AS24: How we tuned our Airflow to make 1.2 million DAG runs - per day!

As we deployed Airflow in our enterprise connected to various event sources to implement our data-driven pipelines we were faced with event storms a couple of times. As of such event storms happened often unplanned and with increased load waves we iteratively tuned the setup in multiple iterations. We were in panic and also needed to add some quick workarounds sometime.

Starting from a peak of 1000 triggers in a hour we were happy that workload just queued. But at a certain point we started tuning the setup. With about 10-20 iterations which we would like to share as best practice we started tuning standard parameters, increased resources, changed integration strategies as well and developed patches to core scheduler.

This talk is a retro of the steps we did to share about options to tune and strategies to scale. Being afraid of a queue which degraded performance when having 10000 runs to a peak event reception of 400k runs in an hour it was a long way. You also might hear about some anti-patterns as learning.

California East
As we deployed Airflow in our enterprise connected to various event sources to implement our data-driven pipelines we were faced with event storms a couple of times. As of such event storms happened often unplanned and with increased load waves we iteratively tuned the setup in multiple iterations. We were in panic and also needed to add some quick workarounds sometime. Starting from a peak of 1000 triggers in a hour we were happy that workload just queued.
17:05 - 17:30.
By Vincent Beck
Track: New features
09/11/2024 5:05 PM 09/11/2024 5:30 PM America/Los_Angeles AS24: Simplified user management in Airflow

Before Airflow 2.9, user management was part of core Airflow, therefore modifying it or customizing it to fit user needs was not an easy process. Authentication and authorization managers (auth managers), is a new concept introduced in Airflow 2.9. It was introduced as extensible user management (AIP-56), allowing Airflow users to have a flexible way to integrate with organization’s identity services. Organizations want a single place to manage permissions and FAB (Flask App Builder) made it difficult to achieve. In this talk, after explaining the concept of auth managers and why we built this, we will show you how you can leverage the new auth manager interface to build an authorization service for Airflow based on your existing identity provider. We will see that auth managers can be leveraged to change considerably how users and their permissions are managed in an Airflow environment.

Finally, we will dive deep into the AWS auth manager as an alternative auth manager and see some different usages as examples.

Elizabethan A+B
Before Airflow 2.9, user management was part of core Airflow, therefore modifying it or customizing it to fit user needs was not an easy process. Authentication and authorization managers (auth managers), is a new concept introduced in Airflow 2.9. It was introduced as extensible user management (AIP-56), allowing Airflow users to have a flexible way to integrate with organization’s identity services. Organizations want a single place to manage permissions and FAB (Flask App Builder) made it difficult to achieve.
17:40 - 18:05.
By Jack Cusick
Track: Airflow & ...
09/11/2024 5:40 PM 09/11/2024 6:05 PM America/Los_Angeles AS24: Bronco: Managing Terraform at Scale with Airflow

Airflow is not just purpose-built for data applications. It is a job scheduler on steroids. This is exactly what a cloud platform team needs: a configurable and scalable automation tool that can handle thousands of administrative tasks.

Come learn how one enterprise platform team used Airflow to support cloud infrastructure at unprecedented scale.

California West
Airflow is not just purpose-built for data applications. It is a job scheduler on steroids. This is exactly what a cloud platform team needs: a configurable and scalable automation tool that can handle thousands of administrative tasks. Come learn how one enterprise platform team used Airflow to support cloud infrastructure at unprecedented scale.
17:40 - 18:05.
By Michael Juster
Track: Use cases
09/11/2024 5:40 PM 09/11/2024 6:05 PM America/Los_Angeles AS24: How we run 100 Airflow environments and millions of Tasks as a Part Time job using Kubernetes

Balyasny Asset Management (BAM) is a diversified global investment firm founded in 2001 with over $20 billion in assets under management. We have more than 100 teams who run a variety of workloads that benefit from Orchestration and parallelization.

Platform Engineers working for companies with K8s ecosystems can use their Kubernetes knowledge and leverage their platform to run Airflow and troubleshoot problems successfully. BAM’s Kubernetes Platform provides production-ready Airflow environments that automatically get Logging, Metrics, Alerting, Scalability, Storage from a range of File Systems, Authentication, Dashboards, Secrets Management, and specialized compute including GPU, CPU Optimized, Memory Optimized and even Windows. If you can run thousands of Pods on your Kubernetes Cluster then you can run thousands of Tasks without needing to do anything! The intention of this talk is to cover:

  • Why K8s and Airflow work so well together

  • How a team of Platform Engineers can leverage their Kubernetes Platform and knowledge to run millions of Tasks without Airflow being their primary focus

  • Examples of where this model can start to fall apart at extreme scale

California East
Balyasny Asset Management (BAM) is a diversified global investment firm founded in 2001 with over $20 billion in assets under management. We have more than 100 teams who run a variety of workloads that benefit from Orchestration and parallelization. Platform Engineers working for companies with K8s ecosystems can use their Kubernetes knowledge and leverage their platform to run Airflow and troubleshoot problems successfully. BAM’s Kubernetes Platform provides production-ready Airflow environments that automatically get Logging, Metrics, Alerting, Scalability, Storage from a range of File Systems, Authentication, Dashboards, Secrets Management, and specialized compute including GPU, CPU Optimized, Memory Optimized and even Windows.
17:40 - 18:05.
By Ankit Chaurasia
Track: New features
09/11/2024 5:40 PM 09/11/2024 6:05 PM America/Los_Angeles AS24: Mastering Advanced Dataset Scheduling in Apache Airflow

Are you looking to harness the full potential of data-driven pipelines with Apache Airflow? This session will dive into the newly introduced conditional expressions for advanced dataset scheduling in Airflow - a feature highly requested by the Airflow community. Attendees will learn how to effectively use logical operators to create complex dependencies that trigger DAGs based on the dataset updates in real-world scenarios. We’ll also explore the innovative DatasetOrTimeSchedule, which combines time-based and dataset-triggered scheduling for unparalleled flexibility. Furthermore, attendees will discover the latest API endpoints that facilitate external updates and resets of dataset events, streamlining workflow management across different deployments.

This talk also aims to explain:

  • The basics of using conditional expressions for dataset scheduling.

  • How do we integrate time-based schedules with dataset triggers?

  • Practical applications of the new API endpoints for enhanced dataset management.

  • Real-world examples of how these features can optimize your data workflows.

Elizabethan A+B
Are you looking to harness the full potential of data-driven pipelines with Apache Airflow? This session will dive into the newly introduced conditional expressions for advanced dataset scheduling in Airflow - a feature highly requested by the Airflow community. Attendees will learn how to effectively use logical operators to create complex dependencies that trigger DAGs based on the dataset updates in real-world scenarios. We’ll also explore the innovative DatasetOrTimeSchedule, which combines time-based and dataset-triggered scheduling for unparalleled flexibility.
10:00 - 10:30
Morning break
13:30 - 14:30
Lunch
16:05 - 16:30
Afternoon break
09:00 - 09:25. Grand Ballroom
By
Track: Keynote
09:30 - 09:55. Grand Ballroom
By Alexander Booth & Oliver Dykstra
Track: Keynote
Dive into the winning playbook of the 2023 World Series Champions Texas Rangers, and discover how they leverage Apache Airflow to streamline their data pipelines. In this session, we’ll explore how real-world data pipelines enable agile decision-making and drive competitive advantage in the high-stakes world of professional baseball, all by using Airflow as an orchestration platform. Whether you’re a seasoned data engineer or just starting out, this session promises actionable strategies to elevate your data orchestration game to championship levels.
10:30 - 11:15. California East
By Serjesh Sharma & Vasantha Kosuri-Marshall
Track: Use cases
Ford Motor Company operates extensively across various nations. The Data Operations (DataOps) team for Advanced Driver Assistance Systems (ADAS) at Ford is tasked with the processing of terabyte-scale daily data from lidar, radar, and video. To manage this, the DataOps team is challenged with orchestrating diverse, compute-intensive pipelines across both on-premises infrastructure and the GCP and deal with sensitive of customer data across both environments The team is also responsible for facilitating the execution of on-demand, compute-intensive algorithms at scale through.
10:30 - 11:15. California West
By Tatiana Al-Chueyr Martins & Pankaj Koti
Track: Airflow & ...
The integration between dbt and Airflow is a popular topic in the community, both in previous editions of Airflow Summit, in Coalesce and the #airflow-dbt Slack channel. Astronomer Cosmos (https://github.com/astronomer/astronomer-cosmos/) stands out as one of the libraries that strives to enhance this integration, having over 300k downloads per month. During its development, we’ve encountered various performance challenges in terms of scheduling and task execution. While we’ve managed to address some, others remain to be resolved.
10:30 - 10:55. Elizabethan A+B
By Rajesh Bishundeo
Track: Airflow & ...
It has been nearly 4 years since the launch of Managed Workflows for Apache Airflow (MWAA) by AWS. It has gone through the trials and tribulations as with any new idea, working with customers to better understand its shortcomings, building dedicated teams focused on scaling and growth, and at its core, preserving the integrity and functionality of Apache Airflow. Initially launched with Airflow 1.10, MWAA is now available globally in multiple AWS regions supporting the latest version of Airflow along with a multitude of features.
10:30 - 10:55. Georgian
By Ramesh Babu K M
Track: Best practices
Our Idea to platformize Ingestion pipelines is driven via Airflow in the background and streamline the entire ingestion process for Self Service. With customer experience on top of it and making data ingestion fool proof as part of Analytics data team, Airflow is just complementing for our vision.
11:00 - 11:25. Elizabethan A+B
By Jennifer Chisik
Track: Sponsored
This talk is presented by Broadcom. Airflow’s “workflow as code” approach has many benefits, including enabling dynamic pipeline generation and flexibility and extensibility in a seamless development environment. However, what challenges do you face as you expand your Airflow footprint in your organization? What if you could enhance Airflow’s monitoring capabilities, forecast DAG and task executions, obtain predictive alerting, visualize trends, and get more robust logging? Broadcom’s Automation Analytics & Intelligence (AAI) offers advanced analytics for workload automation for cloud and on-premises automation.
11:00 - 11:25. Georgian
By Ole Christian Langfjæran
Track: Best practices
Behaviour Driven Development can, in the simplest of terms, be described as Test Driven Development, only readable. It is of course more than that, but that is not the aim of this talk. This talk aims to show: How to write tests before you write a single line of Airflow code Create reusable and readable steps for setting up tests, in a given-when-then manner. Test rendering and execution of your DAG’s tasks
11:30 - 12:15. California East
By Albert Okiri
Track: Use cases
We use Airflow to provide datasets for analytics according to user demands and initiatives being undertaken at the time. The ease of ingesting data by dynamically generating and triggering workflows that correspond to configuration files enables efficient workflows and social impact. We use Airflow to empower decision makers by enabling timely provision of datasets that are tested on quality and other metrics using inbuilt Airflow features. These datasets are accessible at a Superset instance and creates an ‘on-demand for data’ approach to data analysis that is optimized and leads to positive and effective outcomes.
11:30 - 12:15. California West
By Hugo Hobson
Track: Airflow & ...
Balyasny Asset Management (BAM) is a diversified global investment firm founded in 2001 with over $20 billion in assets under management. As dbt took hold at BAM, we had multiple teams building dbt projects against Snowflake, Redshift, and SQL Server. The common question was: How can we quickly and easily productionise our projects? Airflow is the orchestrator of choice at BAM, but our dbt users ranged from Airflow power users to people who’d never heard of Airflow before.
11:30 - 11:55. Elizabethan A+B
By Mikhail Epikhin & Deepan Ignaatious
Track: Sponsored
With recent works in the direction of Executor Decoupling and interest in Hybrid Execution, we find it’s still quite common for Airflow users to use the old-time rule of thumbs like “Don’t Use Airflow with LocalExecutor in production”, “If your scheduler lags, split your DAGs over two separate Airflow Clusters”, and so on. In our talk, we will show a deep dive comparison between various Execution models Airflow support and hopefully update understanding of their efficiency and limitations.
11:30 - 12:15. Georgian
By Marion Azoulai & Maggie Stark
Track: Best practices
Astronomer’s data team recently underwent a major shift in how we work with Airflow. We’ll deep dive into the challenges which prompted that change, how we addressed them and where we are now. This re-architecture included: Switching to dataset scheduling and micro-pipelines to minimize failures and increase reliability. Implementing a Control DAG for complex dependency management and full end-to-end pipeline visibility. Standardized Task Groups for quick onboarding and scalability. With Airflow managing itself, we can once again focus on the data rather than the operational overhead.
12:00 - 12:25. Elizabethan A+B
By Joe Goldberg
Track: Sponsored
This talk is presented by BMC With Airflow’s mainstream acceptance in the enterprise, the operational challenges of running with applications in production have emerged. At last year’s Airflow Summit in Toronto, three providers of Apache Airflow met to discuss “The Future of Airflow: What Users Want”. Among the user requirements in the session were: An improved security model allowing “Alice” and “Bob” to run their single DAGs without each requiring a separate Airflow cluster, while still adhering to their organization’s compliance requirements.
12:30 - 13:15. California East
By Rahul Gade & Keshav Tyagi
Track: Use cases
LinkedIn Continuous Deployment (LCD), started with the goal of improving the deployment experience and expanding its outreach to all LinkedIn systems. LCD delivers a modern deployment UX and easy-to-customize pipelines which enables all LinkedIn applications to declare their deployment pipelines. LCD’s vision is to automate cluster provisioning, deployments and enable touchless (continuous) deployments while reducing the manual toil involved in deployments. LCD is powered by Airflow to orchestrate its deployment pipelines and automate the validation steps.
12:30 - 13:15. California West
By Amit Chauhan
Track: Airflow & ...
Cloud availability zones and regions are not immune to outages. These zones regularly go down, and regions become unavailable due to natural disasters or human-caused incidents. Thus, if an availability zone or region goes down, so do your Airflow workflows and applications… unless your Airflow workflows function across multiple geographic locations. This hands-on session introduces you to the design patterns of multi-region Airflow workflows in the cloud, which can tolerate zone and region-level incidents.
12:30 - 12:55. Elizabethan A+B
By Nick Bilozerov
Track: Sponsored
At Stripe, compliance with regulations is of utmost importance, and ensuring the integrity of production data is crucial. To address this challenge, Stripe developed a powerful system called User Scope Mode (USM), which allows users to safely and efficiently test new or existing Airflow pipelines without the risk of corrupting production data. USM takes care of automatically overwriting the necessary configurations for Airflow pipelines, enabling users to test their production-ready pipelines locally with ease.
12:30 - 13:15. Georgian
By Julian LaNeve & David Xue
Track: Best practices
There are 3 certainties in life: death, taxes, and data pipelines failing. Pipelines may fail for a number of reasons: you may run out of memory, your credentials may expire, an upstream data source may not be reliable, etc. But there are patterns we can learn from! Join us as we walk through an analysis we’ve done on a massive dataset of Airflow failure logs. We’ll show how we used natural language processing and dimensionality reduction methods to explore the latent space of Airflow task failures in order to cluster, visualize, and understand failures.
13:00 - 13:25. Elizabethan A+B
By
Track: Sponsored
14:30 - 17:20. Elizabethan C, Elizabethan D, Borgia
Track: Workshops
We will be offering hands-on workshops so you can get practical experience with Airflow tools and managed offerings. Format & duration: Workshops are instructor led, 2-3 hours long, bring your own device. Only available for participants with a Conference + Workshop pass. Workshops have limited capacity. You can sign up in advance for 2 workshops (one per day) to get a confirmed spot. Workshops will be able to receive walk-ins (people who didn’t sign up in advance) but will have limited spots and do not guarantee they will be able to receive all walk-ins.
14:30 - 14:55. California East
By Ipsa Trivedi & Subramanian Vellaiyan
Track: Use cases
This usecase shows how we deal with data of different varieties from different sources. Each source sends data in different layout, timings, structures, location patterns sizes. The goal is to process the files within SLA and send them out. This a complex multi step processing pipeline that involves multiple spark jobs, api based integrations with microservices, resolving unique ids, deduplication and filtering. Note that this is an event driven system, but not a streaming data system.
14:30 - 14:55. California West
By Roberto Santamaria & Xiaodong Deng
Track: Airflow & ...
In this talk, we will explore how adding custom dependency checks into Airflow’s scheduling system can elevate Airflow’s performance. We will specifically discuss how we added general upstream events dependency checking as well as how to make Airflow aware of used/available compute resources so that the system can better decide when and where to run a given task on Kubernetes infrastructure. We’ll cover why the existing dependency checking in Airflow is not sufficient in our use case, and why adding custom code to Airflow is needed.
14:30 - 14:55. Elizabethan A+B
By Dennis Ferruzzi & Syed Hussain
Track: New features
Since version 2.7 and the advent of AIP-51, Airflow has started to fully support the creation of custom executors. Before we dive into the components of an executor and how they work, we will briefly discuss the Executor Decoupling initiative which allowed this new feature. Once we understand the parts required, we will explore the process of crafting our own executors, using real-world examples, and demonstrations of executors developed within the Amazon Provider Package as a guide.
15:05 - 15:30. California East
By Mike Hirsch & Sophie Keith
Track: Use cases
When developing Machine Learning (ML) models, the biggest challenges are often infrastructural. How do we deploy our model and expose an inference API? How can we retrain? Can we continuously evaluate performance and monitor model drift? In this talk, we will present how we are tackling these problems at the Philadelphia Phillies by developing a suite of tools that enable our software engineering and analytics teams to train, test, evaluate, and deploy ML models - that can be entirely orchestrated in Airflow.
15:05 - 15:30. California West
By Philippe Gagnon
Track: Airflow & ...
The scheduler is unarguably the most important component of an Airflow cluster. It is also the most complex and misunderstood by practitioners and administrators alike. In this talk, we will follow the path that a task instance takes to progress from creation to execution, and discuss the various configuration settings allowing users to tune the scheduler and executor to suit their workload patterns. Finally, we will dive deep into critical sections of the Airflow codebase and explore opportunities for optimization.
15:05 - 15:30. Elizabethan A+B
By Niko Oliveira
Track: New features
Executors are a core concept in Apache Airflow and they are an essential piece to the execution of DAGs. They continue to see investment and innovation including a new feature launching this year: Hybrid Execution. This talk will give a brief overview of executors, how they work and what they are responsible for. Followed by a description of Hybrid Executors (AIP-61), a new feature to allow multiple executors to be used natively and seamlessly side by side within a single Airflow environment.
15:40 - 16:05. California East
By Nathaniel Rose
Track: Use cases
This talk focuses on exploring the implementation of Apache Airflow for Blockchain ETL orchestration, indexing, and the adoption of GitOps at Circle. IT will cover CICD tips, architectural choices for managing Blockchain data at scale, engineering practices to enable data scientists and some learnings from production.
15:40 - 16:05. California West
By Fritz Davenport
Track: Airflow & ...
Introducing a process and framework to convert legacy scheduler workloads such as Control-M to Airflow using automated transpilation techniques. We will discuss the process and demonstrate a python-based transpiler to automatically migrate legacy scheduler workflows with a standard set of patterns to Airflow DAGs. This framework is easily extended via configurable rulesets to encompass other schedulers such as Automic, Autosys, Oozie, and others.
15:40 - 16:05. Elizabethan A+B
By Wei Lee
Track: New features
Airflow executes all tasks on the workers, including deferrable operators that must run on the workers before deferring to the triggerer. However, running some tasks directly from the triggerer can be beneficial in certain situations. This presentation will explain how deferrable operators function and examine ways to modify the Airflow implementation to enable tasks to run directly from the triggerer.
16:30 - 16:55. California East
By Elona Zharri, Nikhil Nandoskar & Prince Bose
Track: Use cases
Ford Motor Company is undergoing a significant transformation, embracing AI and Machine Learning to power its smart mobility strategy, enhance customer experiences, and drive innovation in the automotive industry. Mach1ML, Ford’s multi-million dollar ML platform, plays a crucial role in this journey by empowering data scientists and engineers to efficiently build, deploy, and manage ML models at scale. This presentation will delve into how Mach1ML leverages Apache Airflow as its orchestration layer to tackle the challenges of complex ML workflows that include disparate systems, manual processes, security concerns, and deployment complexities.
16:30 - 16:55. California West
By Cedrik Neumann
Track: Airflow & ...
Profiling Airflow tasks can be difficult, specially in remote environments. In this talk I will demonstrate how we can leverage the capabilities of Airflow’s plugin mechanism to selectively run Airflow tasks within the context of a profiler and with the help of operator links and custom views make the results available to the user. The content of this talk can provide inspiration on how Airflow may in the future allow the gathering of custom task metrics and make those metrics easily accessible.
16:30 - 16:55. Elizabethan A+B
By Maciej Obuchowski
Track: New features
“More data lineage” has been second most popular feature request in Airflow Survey 2023. However, despite the integration of OpenLineage in Airflow 2.7 through AIP-53, the most popular Operator in Airflow - PythonOperator - isn’t covered by lineage support. With addition of TaskFlow API, Airflow Datasets, Airflow ObjectStore, and many other small changes, writing DAGs without using other operators is easier than ever. And that’s why lineage collection in Airflow moves beyond covering specific Operators, to covering Hooks and Object Storage.
17:05 - 17:30. California East
By Jens Scheffler
Track: Use cases
As we deployed Airflow in our enterprise connected to various event sources to implement our data-driven pipelines we were faced with event storms a couple of times. As of such event storms happened often unplanned and with increased load waves we iteratively tuned the setup in multiple iterations. We were in panic and also needed to add some quick workarounds sometime. Starting from a peak of 1000 triggers in a hour we were happy that workload just queued.
17:05 - 17:30. California West
By Eloi Codina Torras
Track: Airflow & ...
Meteosim provides environmental services, mainly based on weather and air quality intelligence, and helps customers make operational and tactical decisions and understand their companies’ environmental impact. We introduced Airflow a couple of years ago to replace a huge Crontab file and we currently have around 7000 DAG Runs per day. In this presentation we will introduce the hardest challenge we had to overcome: adapting Airflow to run on multiple Slurm-managed HPC clusters by using deferrable operators.
17:05 - 17:30. Elizabethan A+B
By Vincent Beck
Track: New features
Before Airflow 2.9, user management was part of core Airflow, therefore modifying it or customizing it to fit user needs was not an easy process. Authentication and authorization managers (auth managers), is a new concept introduced in Airflow 2.9. It was introduced as extensible user management (AIP-56), allowing Airflow users to have a flexible way to integrate with organization’s identity services. Organizations want a single place to manage permissions and FAB (Flask App Builder) made it difficult to achieve.
17:40 - 18:05. California East
By Michael Juster
Track: Use cases
Balyasny Asset Management (BAM) is a diversified global investment firm founded in 2001 with over $20 billion in assets under management. We have more than 100 teams who run a variety of workloads that benefit from Orchestration and parallelization. Platform Engineers working for companies with K8s ecosystems can use their Kubernetes knowledge and leverage their platform to run Airflow and troubleshoot problems successfully. BAM’s Kubernetes Platform provides production-ready Airflow environments that automatically get Logging, Metrics, Alerting, Scalability, Storage from a range of File Systems, Authentication, Dashboards, Secrets Management, and specialized compute including GPU, CPU Optimized, Memory Optimized and even Windows.
17:40 - 18:05. California West
By Jack Cusick
Track: Airflow & ...
Airflow is not just purpose-built for data applications. It is a job scheduler on steroids. This is exactly what a cloud platform team needs: a configurable and scalable automation tool that can handle thousands of administrative tasks. Come learn how one enterprise platform team used Airflow to support cloud infrastructure at unprecedented scale.
17:40 - 18:05. Elizabethan A+B
By Ankit Chaurasia
Track: New features
Are you looking to harness the full potential of data-driven pipelines with Apache Airflow? This session will dive into the newly introduced conditional expressions for advanced dataset scheduling in Airflow - a feature highly requested by the Airflow community. Attendees will learn how to effectively use logical operators to create complex dependencies that trigger DAGs based on the dataset updates in real-world scenarios. We’ll also explore the innovative DatasetOrTimeSchedule, which combines time-based and dataset-triggered scheduling for unparalleled flexibility.

Thursday, September 12, 2024

09:00
09:30
10:10
Morning break
10:30
11:00
11:30
12:00
12:30
13:15
Lunch
14:00
14:35
15:10
15:45
16:20
17:00
Wrap up
09:00 - 09:25.
By Vikram Koka
Track: Keynote
09/12/2024 9:00 AM 09/12/2024 9:25 AM America/Los_Angeles AS24: The road ahead: What’s coming in Airflow 3 and beyond?

Apache Airflow has emerged as the defacto standard for data orchestration. Over the last couple of years, Airflow has also seen increasing adoption for ML and AI use cases. It has been almost four years since the release of Airflow 2 and as a community we have agreed that it’s time for a major foundational release in the form of Airflow 3.

This talk will introduce the vision behind Airflow 3, including the emerging technology trends in the industry and how Airflow will evolve in response. Specifically, this will include an overview of the architectural changes in Airflow to support emerging use cases and distributed data infrastructure models. This talk will also introduce the major features and the desired outcomes of the release. Airflow 3 will be a foundational release and therefore this talk will similarly introduce the new concepts being introduced as part of Airflow 3, which may be fully realized in follow-on 3.x releases.

The goal of this talk is to raise awareness about Airflow 3 and to get feedback from the Airflow community while the release is still in the development phase.

Grand Ballroom
Apache Airflow has emerged as the defacto standard for data orchestration. Over the last couple of years, Airflow has also seen increasing adoption for ML and AI use cases. It has been almost four years since the release of Airflow 2 and as a community we have agreed that it’s time for a major foundational release in the form of Airflow 3. This talk will introduce the vision behind Airflow 3, including the emerging technology trends in the industry and how Airflow will evolve in response.
09:25 - 10:10.
By Madison Swain-Bowden, Kaxil Naik, Michał Modras, Constance Martineau & Shubham Mehta
Track: Keynote
09/12/2024 9:25 AM 09/12/2024 10:10 AM America/Los_Angeles AS24: Airflow 3 - Roadmap Discussion

This session would be about presenting the tentative scope for the next generation of Airflow, i.e. Airflow 3.

Grand Ballroom
This session would be about presenting the tentative scope for the next generation of Airflow, i.e. Airflow 3.
10:30 - 11:15.
By Briana Okyere, Amogh Desai, Ryan Hatter & Srabasti Banerjee
Track: Community
09/12/2024 10:30 AM 09/12/2024 11:15 AM America/Los_Angeles AS24: Connecting the Dots in Airflow: From User to Contributor

“Connecting the Dots in Airflow: From User to Contributor” explores the journey of transitioning from an Airflow user to an active project contributor. This talk will cover essential steps, resources, and best practices to effectively engage with the Airflow community and make meaningful contributions. Attendees will gain insights into the collaborative nature of open-source projects and how their involvement can drive both personal growth and project innovation.

California West
“Connecting the Dots in Airflow: From User to Contributor” explores the journey of transitioning from an Airflow user to an active project contributor. This talk will cover essential steps, resources, and best practices to effectively engage with the Airflow community and make meaningful contributions. Attendees will gain insights into the collaborative nature of open-source projects and how their involvement can drive both personal growth and project innovation.
10:30 - 10:55.
By Adam Bayrami
Track: Use cases
09/12/2024 10:30 AM 09/12/2024 10:55 AM America/Los_Angeles AS24: How NerdWallet Halved Snowflake Costs with Airflow 2 Upgrade

NerdWallet’s multi-tenant Airflow setup faced challenges such as slow DAG processing, which resulted in underutilized Snowflake warehouses and elevated costs. Although our transition to Airflow 2 may have been delayed, we recognize that many other teams are also working to get buy-in and resources for their migration efforts. We hope that our story on how unlocking the scheduling capabilities of Airflow 2 helped us reduce our Snowflake spend by half can get you the buy in you need.

The session will cover our comprehensive strategies, the technical challenges we overcame, and how Airflow 2’s features have enabled more efficient operations and cost savings at scale.

California East
NerdWallet’s multi-tenant Airflow setup faced challenges such as slow DAG processing, which resulted in underutilized Snowflake warehouses and elevated costs. Although our transition to Airflow 2 may have been delayed, we recognize that many other teams are also working to get buy-in and resources for their migration efforts. We hope that our story on how unlocking the scheduling capabilities of Airflow 2 helped us reduce our Snowflake spend by half can get you the buy in you need.
10:30 - 10:55.
By Bhavesh Jaisinghani
Track: Use cases
09/12/2024 10:30 AM 09/12/2024 10:55 AM America/Los_Angeles AS24: Scale and Security : How Autodesk Securely Develops and Tests PII Pipelines with Airflow

In today’s data-driven era, ensuring data reliability and enhancing our testing and development capabilities are paramount. Local unit testing has its merits but falls short when dealing with the volume of big data. One major challenge is running Spark jobs pre-deployment to ensure they produce expected results and handle production-level data volumes.

In this talk, we will discuss how Autodesk leveraged Astronomer to improve pipeline development. We’ll explore how it addresses challenges with sensitive and large data sets that cannot be transferred to local machines or non-production environments. Additionally, we’ll cover how this approach supports over 10 engineers working simultaneously on different feature branches within the same repo.

We will highlight the benefits, such as conflict-free development and testing, and eliminating concerns about data corruption when running DAGs on production Airflow servers.

Join me to discover how solutions like Astronomer empower developers to work with increased efficiency and reliability. This talk is perfect for those interested in big data, cloud solutions, and innovative development practices.

Elizabethan A+B
In today’s data-driven era, ensuring data reliability and enhancing our testing and development capabilities are paramount. Local unit testing has its merits but falls short when dealing with the volume of big data. One major challenge is running Spark jobs pre-deployment to ensure they produce expected results and handle production-level data volumes. In this talk, we will discuss how Autodesk leveraged Astronomer to improve pipeline development. We’ll explore how it addresses challenges with sensitive and large data sets that cannot be transferred to local machines or non-production environments.
10:30 - 10:55.
By Rafal Biegacz
Track: Airflow & ...
09/12/2024 10:30 AM 09/12/2024 10:55 AM America/Los_Angeles AS24: Session presented by Google Cloud

TBD

Georgian
11:00 - 11:25.
By Udit Saxena
Track: Use cases
09/12/2024 11:00 AM 09/12/2024 11:25 AM America/Los_Angeles AS24: Airflow, Spark, and LLMs: Turbocharging MLOps at ASAPP

This talk will explore ASAPP’s use of Apache Airflow to streamline and optimize our machine learning operations (MLOps). Key highlights include:

  • Integrating with our custom Spark solution for achieving speedup, efficiency, and cost gains for generative AI transcription, summarization and intent categorization pipelines

  • Different design patterns of integrating with efficient LLM servers - like TGI/vllm/tensor-RT for Summarization pipelines with/without Spark.

  • An overview of batched LLM inference using Airflow as opposed to real time inference outside of it

  • [Tentative] Possible extension of this scaffolding to Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) for fine-tuning LLMs, using Airflow as the orchestrator.

Additionally, the talk will cover ASAPP’s MLOps journey with Airflow over the past few years, including an overview of our cloud infrastructure, various data backends, and sources.

The primary focus will be on the machine learning workflows at ASAPP, rather than the data workflows, providing a detailed look at how Airflow enhances our MLOps processes.

Georgian
This talk will explore ASAPP’s use of Apache Airflow to streamline and optimize our machine learning operations (MLOps). Key highlights include: Integrating with our custom Spark solution for achieving speedup, efficiency, and cost gains for generative AI transcription, summarization and intent categorization pipelines Different design patterns of integrating with efficient LLM servers - like TGI/vllm/tensor-RT for Summarization pipelines with/without Spark. An overview of batched LLM inference using Airflow as opposed to real time inference outside of it
11:00 - 11:25.
By Bartosz Jankiewicz
Track: New features
09/12/2024 11:00 AM 09/12/2024 11:25 AM America/Los_Angeles AS24: Empowering Airflow Users: A Framework for Performance Testing and Transparent Resource Optimization

Apache Airflow is the backbone of countless data pipelines, but optimizing performance and resource utilization can be a challenge. This talk introduces a novel performance testing framework designed to measure, monitor, and improve the efficiency of Airflow deployments. I’ll delve into the framework’s modular architecture, showcasing how it can be tailored to various Airflow setups (Docker, Kubernetes, cloud providers). By measuring key metrics across schedulers, workers, triggers, and databases, this framework provides actionable insights to identify bottlenecks and compare performance across different versions or configurations.

Attendees will learn:

The motivation behind developing a standardized performance testing approach.

Key design considerations and challenges in measuring performance across diverse Airflow environments.

How to leverage the framework to construct test suites for different use cases (e.g., version comparison).

Practical tips for interpreting performance test results and making informed decisions about resource allocation.

How this framework contributes to greater transparency in Airflow release notes, empowering users with performance data.

Elizabethan A+B
Apache Airflow is the backbone of countless data pipelines, but optimizing performance and resource utilization can be a challenge. This talk introduces a novel performance testing framework designed to measure, monitor, and improve the efficiency of Airflow deployments. I’ll delve into the framework’s modular architecture, showcasing how it can be tailored to various Airflow setups (Docker, Kubernetes, cloud providers). By measuring key metrics across schedulers, workers, triggers, and databases, this framework provides actionable insights to identify bottlenecks and compare performance across different versions or configurations.
11:00 - 11:25.
By Jet Mariscal
Track: Use cases
09/12/2024 11:00 AM 09/12/2024 11:25 AM America/Los_Angeles AS24: Unlocking the Power of Airflow Beyond Data Engineering at Cloudflare

While Airflow is widely known for orchestrating and managing workflows, particularly in the context of data engineering, data science, ML (Machine Learning), and ETL (Extract, Transform, Load) processes, its flexibility and extensibility make it a highly versatile tool suitable for a variety of use cases beyond these domains. In fact, Cloudflare has publicly shared in the past an example on how Airflow was leveraged to build a system that automates datacenter expansions.

In this talk, I will share a few more of our use cases beyond traditional data engineering, demonstrating Airflow’s sophisticated capabilities for orchestrating a wide variety of complex workflows, and discussing how Airflow played a crucial role in building some of the highly successful autonomous systems at Cloudflare, from handling automated bare metal server diagnostics and recovery at scale, to Zero Touch Provisioning that is helping us accelerate the roll out of inference-optimized GPUs in 150+ cities in multiple countries globally.

California East
While Airflow is widely known for orchestrating and managing workflows, particularly in the context of data engineering, data science, ML (Machine Learning), and ETL (Extract, Transform, Load) processes, its flexibility and extensibility make it a highly versatile tool suitable for a variety of use cases beyond these domains. In fact, Cloudflare has publicly shared in the past an example on how Airflow was leveraged to build a system that automates datacenter expansions.
11:30 - 12:15.
By Konrad Schieban & Tim Hiatt
Track: Community
09/12/2024 11:30 AM 09/12/2024 12:15 PM America/Los_Angeles AS24: DAGify - Enterprise Scheduler Migration Accelerator for Airflow

DAGify is a highly extensible, template driven, enterprise scheduler migration accelerator that helps organizations speed up their migration to Apache Airflow. While DAGify does not claim to migrate 100% of existing scheduler functionality it aims to heavily reduce the manual effort it takes for developers to convert their enterprise scheduler formats into Python Native Airflow DAGs.

DAGify is an open source tool under Apache 2.0 license and available on Github (https://github.com/GoogleCloudPlatform/dagify).

In this session we will introduce DAGify, its use cases and demo its functionality by converting Control-M XML files to Airflow DAGs.

Additionally we will highlight DAGify’s “no-code” extensibility by creating custom conversion templates to map Control-M functionality to Airflow operators.

California West
DAGify is a highly extensible, template driven, enterprise scheduler migration accelerator that helps organizations speed up their migration to Apache Airflow. While DAGify does not claim to migrate 100% of existing scheduler functionality it aims to heavily reduce the manual effort it takes for developers to convert their enterprise scheduler formats into Python Native Airflow DAGs. DAGify is an open source tool under Apache 2.0 license and available on Github (https://github.
11:30 - 12:15.
By Kaxil Naik & Ash Berlin-Taylor
Track: New features
09/12/2024 11:30 AM 09/12/2024 12:15 PM America/Los_Angeles AS24: Gen AI using Airflow 3: A vision for Airflow RAGs

Gen AI has taken the computing world by storm. As Enterprises and Startups have started to experiment with LLM applications, it has become clear that providing the right context to these LLM applications is critical.

This process known as Retrieval augmented generation (RAG) relies on adding custom data to the large language model, so that the efficacy of the response can be improved. Processing custom data and integrating with Enterprise applications is a strength of Apache Airflow.

This talk goes into details about a vision to enhance Apache Airflow to more intuitively support RAG, with additional capabilities and patterns. Specifically, these include the following

  • Support for unstructured data sources such as Text, but also

extending to Image, Audio, Video, and Custom sensor data

  • LLM model invocation, including both external model services

through APIs and local models using container invocation.

  • Automatic Index Refreshing with a focus on unstructured data

lifecycle management to avoid cumbersome and expensive

index creation on Vector databases

  • Templates for hallucination reduction via testing and scoping

strategies

Elizabethan A+B
Gen AI has taken the computing world by storm. As Enterprises and Startups have started to experiment with LLM applications, it has become clear that providing the right context to these LLM applications is critical. This process known as Retrieval augmented generation (RAG) relies on adding custom data to the large language model, so that the efficacy of the response can be improved. Processing custom data and integrating with Enterprise applications is a strength of Apache Airflow.
11:30 - 11:55.
By Madhav Khakhar & Alexander Shmidt
Track: Use cases
09/12/2024 11:30 AM 09/12/2024 11:55 AM America/Los_Angeles AS24: How we use Airflow at Booking to orchestrate Big Data workflows

The talk will cover how we use Airflow at the heart of our Workflow Management Platform(WFM) at Booking.com, enabling our internal users to orchestrate big data workflows on Booking Data Exchange(BDX).

High level overview of the talk:

  • Adapting open source Airflow helm chart to spin up Airflow installation in Booking Kubernetes Service (BKS)

  • Coming up with Workflow definition format (yaml)

  • Conversion of workflow.yaml to workflow.py DAGs

  • Usage of Deferrable operators to provide standard step templates to users

  • Workspaces (collection of workflows), using it to ensure role based access to DAG permissions for users

  • Using okta for authentication

  • Alerting, monitoring, logging

  • Plans to shift to Astronomer

California East
The talk will cover how we use Airflow at the heart of our Workflow Management Platform(WFM) at Booking.com, enabling our internal users to orchestrate big data workflows on Booking Data Exchange(BDX). High level overview of the talk: Adapting open source Airflow helm chart to spin up Airflow installation in Booking Kubernetes Service (BKS) Coming up with Workflow definition format (yaml) Conversion of workflow.yaml to workflow.py DAGs Usage of Deferrable operators to provide standard step templates to users
11:30 - 12:15.
Track: Sponsored
09/12/2024 11:30 AM 09/12/2024 12:15 PM America/Los_Angeles AS24: To be confirmed

Gen AI has taken the computing world by storm. As Enterprises and Startups have started to experiment with LLM applications, it has become clear that providing the right context to these LLM applications is critical.

This process known as Retrieval augmented generation (RAG) relies on adding custom data to the large language model, so that the efficacy of the response can be improved. Processing custom data and integrating with Enterprise applications is a strength of Apache Airflow.

This talk goes into details about a vision to enhance Apache Airflow to more intuitively support RAG, with additional capabilities and patterns. Specifically, these include the following

  • Support for unstructured data sources such as Text, but also

extending to Image, Audio, Video, and Custom sensor data

  • LLM model invocation, including both external model services

through APIs and local models using container invocation.

  • Automatic Index Refreshing with a focus on unstructured data

lifecycle management to avoid cumbersome and expensive

index creation on Vector databases

  • Templates for hallucination reduction via testing and scoping

strategies

Georgian
Gen AI has taken the computing world by storm. As Enterprises and Startups have started to experiment with LLM applications, it has become clear that providing the right context to these LLM applications is critical. This process known as Retrieval augmented generation (RAG) relies on adding custom data to the large language model, so that the efficacy of the response can be improved. Processing custom data and integrating with Enterprise applications is a strength of Apache Airflow.
12:00 - 12:25.
By Jianlong Zhong
Track: Use cases
09/12/2024 12:00 PM 09/12/2024 12:25 PM America/Los_Angeles AS24: Airflow Unleashed: Making Hundreds of Deployments A Day at Coinbase

At Coinbase, Airflow is the backbone of ELT, supported by a vibrant community of over 500 developers. This vast engagement results in a continuous stream of enhancements, with hundreds of commits tested and released daily. However, this scale of development presents its own set of challenges, especially in deployment velocity.

Traditional deployment methodologies proved inadequate, significantly impeding the productivity of our developers. Recognizing the critical need for a solution that matches our pace of innovation, we developed AirAgent: a bespoke, fully autonomous deployer designed specifically for Airflow. Capable of deploying updates hundreds of times a day on both staging and production environments, AirAgent has transformed our development lifecycle, enabling immediate iteration and drastically improving developer velocity.

This talk aims to unveil the inner workings of AirAgent, highlighting its design principles, deployment strategies, and the challenges we overcame in its implementation. By sharing our journey, we hope to offer insights and strategies that can benefit others in the Airflow community, encouraging a shift towards a high-frequency deployment workflow.

California East
At Coinbase, Airflow is the backbone of ELT, supported by a vibrant community of over 500 developers. This vast engagement results in a continuous stream of enhancements, with hundreds of commits tested and released daily. However, this scale of development presents its own set of challenges, especially in deployment velocity. Traditional deployment methodologies proved inadequate, significantly impeding the productivity of our developers. Recognizing the critical need for a solution that matches our pace of innovation, we developed AirAgent: a bespoke, fully autonomous deployer designed specifically for Airflow.
12:30 - 13:15.
By Cyrus Dukart, David Sacerdote & Jason Bridgemohansingh
Track: Use cases
09/12/2024 12:30 PM 09/12/2024 1:15 PM America/Los_Angeles AS24: Adaptive Memory Scaling for Robust Airflow Pipelines

At Vibrant Planet, we’re on a mission to make the world’s communities and ecosystems more resilient in the face of climate change. Our cloud-based platform is designed for collaborative scenario planning to tackle wildfires, climate threats, and ecosystem restoration on a massive scale.

In this talk we will dive into how we are using Airflow. Particularly we will focus on how we’re making Airflow pipelines smarter and more resilient, especially when dealing with the task of processing large satellite imagery and other geospatial data.

  1. Self-Healing Pipelines:

Discuss our self-healing pipelines which identify likely out-of-memory events and incrementally allocate more memory for task instance retries, ensuring robust and uninterrupted workflow execution.

  1. Initial Memory Recommendations:

We’ll discuss how we set intelligent initial memory allocations for each task instance, enhancing resource efficiency from the outset.

California East
At Vibrant Planet, we’re on a mission to make the world’s communities and ecosystems more resilient in the face of climate change. Our cloud-based platform is designed for collaborative scenario planning to tackle wildfires, climate threats, and ecosystem restoration on a massive scale. In this talk we will dive into how we are using Airflow. Particularly we will focus on how we’re making Airflow pipelines smarter and more resilient, especially when dealing with the task of processing large satellite imagery and other geospatial data.
12:30 - 13:15.
By Shobhit Shah & Sumit Maheshwari
Track: Use cases
09/12/2024 12:30 PM 09/12/2024 1:15 PM America/Los_Angeles AS24: Evolution of Airflow at Uber

Up until a few years ago, teams at Uber used multiple data workflow systems, with some based on open source projects such as Apache Oozie, Apache Airflow, and Jenkins while others were custom built solutions written in Python and Clojure.

Every user who needed to move data around had to learn about and choose from these systems, depending on the specific task they needed to accomplish. Each system required additional maintenance and operational burdens to keep it running, troubleshoot issues, fix bugs, and educate users.

After this evaluation, and with the goal in mind of converging on a single workflow system capable of supporting Uber’s scale, we settled on an Airflow-based system. The Airflow-based DSL provided the best trade-off of flexibility, expressiveness, and ease of use while being accessible for our broad range of users, which includes data scientists, developers, machine learning experts, and operations employees.

This talk will focus on scaling Airflow to Uber’s scale and providing a no-code seamless user experience

Georgian
Up until a few years ago, teams at Uber used multiple data workflow systems, with some based on open source projects such as Apache Oozie, Apache Airflow, and Jenkins while others were custom built solutions written in Python and Clojure. Every user who needed to move data around had to learn about and choose from these systems, depending on the specific task they needed to accomplish. Each system required additional maintenance and operational burdens to keep it running, troubleshoot issues, fix bugs, and educate users.
12:30 - 13:15.
By Constance Martineau & Tzu-ping Chung
Track: New features
09/12/2024 12:30 PM 09/12/2024 1:15 PM America/Los_Angeles AS24: Seeing Clearly with Airflow: The Shift to Data-Aware Orchestration

Join me at this year’s Airflow Summit as we delve into a pivotal evolution for Apache Airflow: The integration of data awareness.

Airflow has long excelled as a workflow orchestration tool, managing complex workflows with ease and efficiency. However, it has operated with limited insight into the data it manipulates or the assets it produces. This talk will explore the implications and benefits of embedding deeper insights about these outputs directly into Airflow.

We’ll start with a retrospective on Airflow’s origins and its task-centric approach, discussing why Airflow has thrived even without a focus on data awareness. We’ll then examine how enhancing the connection between tasks and the assets they produce can significantly boost Airflow’s utility and value for its users. Finally, we’ll consider new features that can be developed with this enhanced level of understanding, empowering data engineers with tools for more efficient, reliable and insightful operations.

Elizabethan A+B
Join me at this year’s Airflow Summit as we delve into a pivotal evolution for Apache Airflow: The integration of data awareness. Airflow has long excelled as a workflow orchestration tool, managing complex workflows with ease and efficiency. However, it has operated with limited insight into the data it manipulates or the assets it produces. This talk will explore the implications and benefits of embedding deeper insights about these outputs directly into Airflow.
12:30 - 13:15.
By Jarek Potiuk
Track: Community
09/12/2024 12:30 PM 09/12/2024 1:15 PM America/Los_Angeles AS24: The Silent Symphony: Keeping Airflow's CI/CD and Dev Tools in Tune

Apache Airflow relies on a silent symphony behind the scenes: its CI/CD (Continuous Integration/Continuous Delivery) and development tooling. This presentation explores the critical role these tools play in keeping Airflow efficient and innovative. We’ll delve into how robust CI/CD ensures bug fixes and improvements are seamlessly integrated, while well-maintained development tools empower developers to contribute effectively.

Airflow’s power comes from a well-oiled machine – its CI/CD and development tools. This presentation dives into the world of these often-overlooked heroes. We’ll explore how seamless CI/CD pipelines catch and fix issues early, while robust development tools empower efficient coding and collaboration. Discover how you can use and contribute to a thriving Airflow ecosystem by ensuring these crucial tools stay in top shape.

California West
Apache Airflow relies on a silent symphony behind the scenes: its CI/CD (Continuous Integration/Continuous Delivery) and development tooling. This presentation explores the critical role these tools play in keeping Airflow efficient and innovative. We’ll delve into how robust CI/CD ensures bug fixes and improvements are seamlessly integrated, while well-maintained development tools empower developers to contribute effectively. Airflow’s power comes from a well-oiled machine – its CI/CD and development tools. This presentation dives into the world of these often-overlooked heroes.
14:00 - 14:55.
By Jed Cunningham
Track: New features
09/12/2024 2:00 PM 09/12/2024 2:55 PM America/Los_Angeles AS24: AIP-63: DAG Versioning - Where are we?

Join us as we check in on the current status of AIP-63: DAG Versioning. This session will explore the motivations behind AIP-63, the challenges faced by Airflow users in understanding and managing DAG history, and how it aims to address them. From tracking TaskInstance history to improving DAG representation in the UI, we’ll examine what we’ve already done and what’s next. We’ll also touch upon the potential future steps outlined in AIP-66 regarding the execution of specific DAG versions.

Elizabethan A+B
Join us as we check in on the current status of AIP-63: DAG Versioning. This session will explore the motivations behind AIP-63, the challenges faced by Airflow users in understanding and managing DAG history, and how it aims to address them. From tracking TaskInstance history to improving DAG representation in the UI, we’ll examine what we’ve already done and what’s next. We’ll also touch upon the potential future steps outlined in AIP-66 regarding the execution of specific DAG versions.
14:00 - 14:25.
By Daniil Dubin
Track: Use cases
09/12/2024 2:00 PM 09/12/2024 2:25 PM America/Los_Angeles AS24: Empowering business analysts with DAG authoring IDE running 8000 workflows

At Wix more often than not business analysts build workflows themselves to avoid data engineers being a bottleneck. But how do you enable them to create SQL ETLs starting when dependencies are ready and sending emails or refreshing Tableau reports when the work is done? One simple answer may be to use Airflow. The problem is every BA cannot be expected to know Python and Git so well that they will create thousands of DAGs easily.

To bridge this gap we have built a web-based IDE, called Quix, that allows simple notebook-like development of Trino SQL workflows and converts them to Airflow DAGs when a user hits the “schedule” button.

During the talk we will go through the problems of building a reliable and extendable DAG generating tool, why we preferred Airflow over Apache Oozie and also tricks (sharding, HA-mode, etc) allowing Airflow to run 8000 active DAGs on a single cluster in k8s.

California East
At Wix more often than not business analysts build workflows themselves to avoid data engineers being a bottleneck. But how do you enable them to create SQL ETLs starting when dependencies are ready and sending emails or refreshing Tableau reports when the work is done? One simple answer may be to use Airflow. The problem is every BA cannot be expected to know Python and Git so well that they will create thousands of DAGs easily.
14:00 - 16:40.
Track: Workshops
09/12/2024 2:00 PM 09/12/2024 4:40 PM America/Los_Angeles AS24: Featured Workshops

We will be offering hands-on workshops so you can get practical experience with Airflow tools and managed offerings.

  • Format & duration: Workshops are instructor led, 2-3 hours long, bring your own device.
  • Only available for participants with a Conference + Workshop pass.
  • Workshops have limited capacity. You can sign up in advance for 2 workshops (one per day) to get a confirmed spot.
  • Workshops will be able to receive walk-ins (people who didn’t sign up in advance) but will have limited spots and do not guarantee they will be able to receive all walk-ins.
Elizabethan C, Elizabethan D, Borgia
We will be offering hands-on workshops so you can get practical experience with Airflow tools and managed offerings. Format & duration: Workshops are instructor led, 2-3 hours long, bring your own device. Only available for participants with a Conference + Workshop pass. Workshops have limited capacity. You can sign up in advance for 2 workshops (one per day) to get a confirmed spot. Workshops will be able to receive walk-ins (people who didn’t sign up in advance) but will have limited spots and do not guarantee they will be able to receive all walk-ins.
14:00 - 14:25.
By Pete Dejoy
Track: Community
09/12/2024 2:00 PM 09/12/2024 2:25 PM America/Los_Angeles AS24: How the Airflow Community Productionizes Generative AI

Every data team out there is being asked from their business stakeholders about Generative AI. Taking LLM centric workloads to production is not a trivial task. At the foundational level, there are a set of challenges around data delivery, data quality, and data ingestion that mirror traditional data engineering problems. Once you’re past those, there’s a set of challenges related to the underlying use case you’re trying to solve. Thankfully, because of how Airflow was already being used at these companies for data engineering and MLOps use cases, it has become the defacto orchestration layer behind many GenAI use cases for startups and Fortune 500s.

This talk will be a tour of various methods, best practices, and considerations used in the Airflow community when taking GenAI use cases to production. We’ll focus on 4 primary use cases; RAG, fine tuning, resource management, and batch inference and take a walk through patterns different members in the community have used to productionize this new, exciting technology.

California West
Every data team out there is being asked from their business stakeholders about Generative AI. Taking LLM centric workloads to production is not a trivial task. At the foundational level, there are a set of challenges around data delivery, data quality, and data ingestion that mirror traditional data engineering problems. Once you’re past those, there’s a set of challenges related to the underlying use case you’re trying to solve. Thankfully, because of how Airflow was already being used at these companies for data engineering and MLOps use cases, it has become the defacto orchestration layer behind many GenAI use cases for startups and Fortune 500s.
14:35 - 15:00.
By Ian Moritz
Track: Airflow & ...
09/12/2024 2:35 PM 09/12/2024 3:00 PM America/Los_Angeles AS24: Airflow-as-an-Engine: Lessons from Open-Source Applications Built On Top of Airflow

Airflow is often used for running data pipelines, which themselves connect with other services through the provider system. However, it is also increasingly used as an engine under-the-hood for other projects building on top of the DAG primitive. For example, Cosmos is a framework for automatically transforming dbt DAGs into Airflow DAGs, so that users can supplement the developer experience of dbt with the power of Airflow.

This session dives into how a select group of these frameworks (Cosmos, Meltano, Chronon) use Airflow as an engine for orchestrating complex workflows their systems depend on. In particular, we will discuss ways that we’ve increased Airflow performance to meet application-specific demands (high-task-count Cosmos DAGs, streaming jobs in Chronon), new Airflow features that will evolve how these frameworks use Airflow under the hood (DAG versioning, dataset integrations), and paths we see these projects taking over the next few years as Airflow grows. Airflow is not just a DAG platform, it’s an application platform!

California West
Airflow is often used for running data pipelines, which themselves connect with other services through the provider system. However, it is also increasingly used as an engine under-the-hood for other projects building on top of the DAG primitive. For example, Cosmos is a framework for automatically transforming dbt DAGs into Airflow DAGs, so that users can supplement the developer experience of dbt with the power of Airflow. This session dives into how a select group of these frameworks (Cosmos, Meltano, Chronon) use Airflow as an engine for orchestrating complex workflows their systems depend on.
14:35 - 15:00.
By Olivier Daneau
Track: Best practices
09/12/2024 2:35 PM 09/12/2024 3:00 PM America/Los_Angeles AS24: Using Airflow operational data to optimize Cloud services

Cost management is a continuous challenge for our data teams at Astronomer. Understanding the expenses associated with running our workflows is not always straightforward, and identifying which process ran a query causing unexpected usage on a given day can be time-consuming.

In this talk, we will showcase an Airflow Plugin and specific DAGs developed and used internally at Astronomer to track and optimize the costs of running DAGs. Our internal tool monitors Snowflake query costs, provides insights, and sends alerts for abnormal usage. With it, Astronomer identified and refactored its most costly DAGs, resulting in an almost 25% reduction in Snowflake spending.

We will demonstrate how to track Snowflake-related DAG costs and discuss how the tool can be adapted to any database supporting query tagging like BigQuery, Oracle, and more.

This talk will cover the implementation details and show how Airflow users can effectively adopt this tool to monitor and manage their DAG costs.

California East
Cost management is a continuous challenge for our data teams at Astronomer. Understanding the expenses associated with running our workflows is not always straightforward, and identifying which process ran a query causing unexpected usage on a given day can be time-consuming. In this talk, we will showcase an Airflow Plugin and specific DAGs developed and used internally at Astronomer to track and optimize the costs of running DAGs. Our internal tool monitors Snowflake query costs, provides insights, and sends alerts for abnormal usage.
15:10 - 15:35.
By Vincent La, Jim Howard & Moulay Zaidane Al Bahi Draidia
Track: Use cases
09/12/2024 3:10 PM 09/12/2024 3:35 PM America/Los_Angeles AS24: Customizing LLMs: Leveraging Technology to tailor GenAI using Airflow

Laurel provides an AI-driven timekeeping solution tailored for accounting and legal firms, automating timesheet creation by capturing digital work activities. This session highlights two notable AI projects:

  1. UTBMS Code Prediction: Leveraging small language models, this system builds new embeddings to predict work codes for legal bills with high accuracy. More details are available in our case study: https://www.laurel.ai/resources-post/enhancing-legal-and-accounting-workflows-with-ai-insights-into-work-code-prediction.

  2. Bill Creation and Narrative Generation: Utilizing Retrieval-Augmented Generation (RAG), this approach transforms users’ digital activities into fully billable entries.

Additionally, we will discuss how we use Airflow for model management in these AI projects:

  • Daily Model Retraining: We retrain our models daily

  • Model (Re)deployment: Our Airflow DAG evaluates model performance, redeploying it if improvements are detected

  • Cost Management: To avoid high costs associated with querying large language models frequently, our DAG utilizes RAG to efficiently summarize daily activities into a billable timesheet at day’s end.

California East
Laurel provides an AI-driven timekeeping solution tailored for accounting and legal firms, automating timesheet creation by capturing digital work activities. This session highlights two notable AI projects: UTBMS Code Prediction: Leveraging small language models, this system builds new embeddings to predict work codes for legal bills with high accuracy. More details are available in our case study: https://www.laurel.ai/resources-post/enhancing-legal-and-accounting-workflows-with-ai-insights-into-work-code-prediction. Bill Creation and Narrative Generation: Utilizing Retrieval-Augmented Generation (RAG), this approach transforms users’ digital activities into fully billable entries.
15:10 - 15:35.
By Michael Robinson
Track: Community
09/12/2024 3:10 PM 09/12/2024 3:35 PM America/Los_Angeles AS24: Lessons from the Ecosystem: What can Airflow Learn from Other Open-source Communities?

The Apache Airflow community is so large and active that it’s tempting to take the view that “if it ain’t broke don’t fix it.”

In a community as in a codebase, however, improvement and attention are essential to sustaining growth. And bugs are just as inevitable in community management as they are in software development. If only the fixes were, too!

Airflow is large and growing because users love Airflow and our community. But what steps could be taken to enhance the typical user’s and developer’s experience of the community?

This talk will provide an overview of potential learnings for Airflow community management efforts, such as project governance and analytics, derived from the speaker’s experience managing the OpenLineage and Marquez open-source communities.

The talk will answer questions such as: What can we learn from other open-source communities when it comes to supporting users and developers and learning from them? For example, what options exist for getting historical data out of Slack despite the limitations of the free tier? What tools can be used to make adoption metrics more reliable? What are some effective supplements to asynchronous governance?

California West
The Apache Airflow community is so large and active that it’s tempting to take the view that “if it ain’t broke don’t fix it.” In a community as in a codebase, however, improvement and attention are essential to sustaining growth. And bugs are just as inevitable in community management as they are in software development. If only the fixes were, too! Airflow is large and growing because users love Airflow and our community.
15:10 - 15:35.
By Venkata Jagannath & Marwan Sarieddine
Track: New features
09/12/2024 3:10 PM 09/12/2024 3:35 PM America/Los_Angeles AS24: Using the power of Apache Airflow and Ray for Scalable AI deployments

Many organizations struggle to create a well-orchestrated AI infrastructure, using separate and disconnected platforms for data processing, model training, and inference, which slows down development and increases costs. There’s a clear need for a unified system that can handle all aspects of AI development and deployment, regardless of the size of data or models.

Join our breakout session to see how our comprehensive solution simplifies the development and deployment of large language models in production. Learn how to streamline your AI operations by implementing an end-to-end ML lifecycle on your custom data, including - automated LLM fine-tuning, LLM evaluation & LLM serving and LoRA deployments

Elizabethan A+B
Many organizations struggle to create a well-orchestrated AI infrastructure, using separate and disconnected platforms for data processing, model training, and inference, which slows down development and increases costs. There’s a clear need for a unified system that can handle all aspects of AI development and deployment, regardless of the size of data or models. Join our breakout session to see how our comprehensive solution simplifies the development and deployment of large language models in production.
15:45 - 16:10.
By Sriram Vamsi Ilapakurthy
Track: Airflow intro talks
09/12/2024 3:45 PM 09/12/2024 4:10 PM America/Los_Angeles AS24: Exploring DAG Design Patterns in Apache Airflow

This talk delves into advanced Directed Acyclic Graph (DAG) design patterns that are pivotal for optimizing data pipeline management and boosting efficiency. We’ll cover dynamic DAG generation, which allows for flexible, scalable workflow creation based on real-time data and configurations. Learn about task grouping and SubDAGs to enhance readability and maintainability of complex workflows. We’ll also explore parameterized DAGs for injecting runtime parameters into tasks, enabling versatile and adaptable pipeline configurations. Additionally, the session will address branching and conditional execution to manage workflow paths dynamically based on data conditions or external triggers. Lastly, understand how to leverage parallelism and concurrency to maximize resource utilization and reduce execution times. This session is designed for intermediate to advanced users who are familiar with the basics of Airflow and looking to deepen their understanding of its more sophisticated capabilities.

This session is crafted to be compelling by focusing on practical, high-impact design patterns that can significantly improve the performance and scalability of Airflow deployments.

Elizabethan A+B
This talk delves into advanced Directed Acyclic Graph (DAG) design patterns that are pivotal for optimizing data pipeline management and boosting efficiency. We’ll cover dynamic DAG generation, which allows for flexible, scalable workflow creation based on real-time data and configurations. Learn about task grouping and SubDAGs to enhance readability and maintainability of complex workflows. We’ll also explore parameterized DAGs for injecting runtime parameters into tasks, enabling versatile and adaptable pipeline configurations.
15:45 - 16:45.
By Freddy Demiane, Rahul Vats & Dennis Ferruzzi
Track: Community
09/12/2024 3:45 PM 09/12/2024 4:45 PM America/Los_Angeles AS24: Hello Quality: Building CIs to run Providers Packages System Tests

Airflow operators are a core feature of Apache Airflow and it’s extremely important that we maintain high quality of operators, prevent regressions and on the other hand we help developers with automated tests results to double check if introduced changes don’t cause regressions or backward incompatible changes and we provide Airflow release managers with information whether a given version of a provider should be released or not yet.

Recently a new approach to assuring production quality was implemented for AWS, Google and Astronomer-provided operators - standalone Continuous Integration processes were configured for them and test results dashboards show the results of the last test runs. What has been working well for these operator providers might be a pattern to follow for others - during this presentation, AWS, Google and Astronomer engineers are going to share the information about the internals of Test Dashboards implemented for AWS, Google and Astronomer-provided operators. This approach might be a a ‘blueprint’ to follow for other providers.

California West
Airflow operators are a core feature of Apache Airflow and it’s extremely important that we maintain high quality of operators, prevent regressions and on the other hand we help developers with automated tests results to double check if introduced changes don’t cause regressions or backward incompatible changes and we provide Airflow release managers with information whether a given version of a provider should be released or not yet. Recently a new approach to assuring production quality was implemented for AWS, Google and Astronomer-provided operators - standalone Continuous Integration processes were configured for them and test results dashboards show the results of the last test runs.
15:45 - 16:10.
By Anant Agarwal
Track: Use cases
09/12/2024 3:45 PM 09/12/2024 4:10 PM America/Los_Angeles AS24: Scaling Airflow for Data Productivity at Instacart

In this talk, we’ll discuss how Instacart leverages Apache Airflow to orchestrate a vast network of data pipelines, powering both our core infrastructure and dbt deployments. As a data-driven company, Airflow plays a critical role in enabling us to execute large and intricate pipelines securely, compliantly, and at scale.

We’ll delve into the following key areas:

a. High-Throughput Cluster Management: We’ll explore how we manage and maintain our Airflow cluster, ensuring the efficient execution of over 2,000 DAGs across diverse use cases.

b. Centralized Airflow Vision: We’ll outline our plans for establishing a company-wide, centralized Airflow cluster, consolidating all Airflow instances at Instacart.

c. Custom Airflow Tooling: We’ll showcase the custom tooling we’ve developed to manage YML-based DAGs, execute DAGs on external ECS workers, leverage Terraform for cluster deployment, and implement robust cluster monitoring at scale.

By sharing our extensive experience with Airflow, we aim to contribute valuable insights to the Airflow community.

California East
In this talk, we’ll discuss how Instacart leverages Apache Airflow to orchestrate a vast network of data pipelines, powering both our core infrastructure and dbt deployments. As a data-driven company, Airflow plays a critical role in enabling us to execute large and intricate pipelines securely, compliantly, and at scale. We’ll delve into the following key areas: a. High-Throughput Cluster Management: We’ll explore how we manage and maintain our Airflow cluster, ensuring the efficient execution of over 2,000 DAGs across diverse use cases.
16:20 - 16:45.
By Nishchay Agrawal
Track: Use cases
09/12/2024 4:20 PM 09/12/2024 4:45 PM America/Los_Angeles AS24: Mastering Data Pipelines: Integrating Apache Airflow with Key Tools for Advanced Analytics

Elevate your data engineering and analytics capabilities in our cutting-edge seminar focused on Apache Airflow. Dive into sophisticated techniques for building and managing data pipelines, emphasizing data ingestion, transformation, and validation. Discover how to enhance your data infrastructure by integrating Airflow with pivotal tools like Databricks, Snowflake, and dbt, alongside robust big data processing solutions like Apache Spark and Hadoop. Leverage dynamic stream processing systems like Apache Kafka, and amplify your analytics prowess by connecting Airflow with BI platforms like Tableau and Power BI.

Delve into comprehensive strategies for monitoring and maintaining Airflow environments, leveraging built-in tools for proactive oversight, and troubleshooting complex issues efficiently. Gain valuable insights from real-world case studies that demonstrate Airflow’s transformative impact on workflow efficiency and data processing capabilities.

California East
Elevate your data engineering and analytics capabilities in our cutting-edge seminar focused on Apache Airflow. Dive into sophisticated techniques for building and managing data pipelines, emphasizing data ingestion, transformation, and validation. Discover how to enhance your data infrastructure by integrating Airflow with pivotal tools like Databricks, Snowflake, and dbt, alongside robust big data processing solutions like Apache Spark and Hadoop. Leverage dynamic stream processing systems like Apache Kafka, and amplify your analytics prowess by connecting Airflow with BI platforms like Tableau and Power BI.
10:10 - 10:30
Morning break
13:15 - 14:00
Lunch
17:00 - 17:30
Wrap up
09:00 - 09:25. Grand Ballroom
By Vikram Koka
Track: Keynote
Apache Airflow has emerged as the defacto standard for data orchestration. Over the last couple of years, Airflow has also seen increasing adoption for ML and AI use cases. It has been almost four years since the release of Airflow 2 and as a community we have agreed that it’s time for a major foundational release in the form of Airflow 3. This talk will introduce the vision behind Airflow 3, including the emerging technology trends in the industry and how Airflow will evolve in response.
09:25 - 10:10. Grand Ballroom
By Madison Swain-Bowden, Kaxil Naik, Michał Modras, Constance Martineau & Shubham Mehta
Track: Keynote
This session would be about presenting the tentative scope for the next generation of Airflow, i.e. Airflow 3.
10:30 - 10:55. California East
By Adam Bayrami
Track: Use cases
NerdWallet’s multi-tenant Airflow setup faced challenges such as slow DAG processing, which resulted in underutilized Snowflake warehouses and elevated costs. Although our transition to Airflow 2 may have been delayed, we recognize that many other teams are also working to get buy-in and resources for their migration efforts. We hope that our story on how unlocking the scheduling capabilities of Airflow 2 helped us reduce our Snowflake spend by half can get you the buy in you need.
10:30 - 11:15. California West
By Briana Okyere, Amogh Desai, Ryan Hatter & Srabasti Banerjee
Track: Community
“Connecting the Dots in Airflow: From User to Contributor” explores the journey of transitioning from an Airflow user to an active project contributor. This talk will cover essential steps, resources, and best practices to effectively engage with the Airflow community and make meaningful contributions. Attendees will gain insights into the collaborative nature of open-source projects and how their involvement can drive both personal growth and project innovation.
10:30 - 10:55. Elizabethan A+B
By Bhavesh Jaisinghani
Track: Use cases
In today’s data-driven era, ensuring data reliability and enhancing our testing and development capabilities are paramount. Local unit testing has its merits but falls short when dealing with the volume of big data. One major challenge is running Spark jobs pre-deployment to ensure they produce expected results and handle production-level data volumes. In this talk, we will discuss how Autodesk leveraged Astronomer to improve pipeline development. We’ll explore how it addresses challenges with sensitive and large data sets that cannot be transferred to local machines or non-production environments.
10:30 - 10:55. Georgian
By Rafal Biegacz
Track: Airflow & ...
11:00 - 11:25. California East
By Jet Mariscal
Track: Use cases
While Airflow is widely known for orchestrating and managing workflows, particularly in the context of data engineering, data science, ML (Machine Learning), and ETL (Extract, Transform, Load) processes, its flexibility and extensibility make it a highly versatile tool suitable for a variety of use cases beyond these domains. In fact, Cloudflare has publicly shared in the past an example on how Airflow was leveraged to build a system that automates datacenter expansions.
11:00 - 11:25. Elizabethan A+B
By Bartosz Jankiewicz
Track: New features
Apache Airflow is the backbone of countless data pipelines, but optimizing performance and resource utilization can be a challenge. This talk introduces a novel performance testing framework designed to measure, monitor, and improve the efficiency of Airflow deployments. I’ll delve into the framework’s modular architecture, showcasing how it can be tailored to various Airflow setups (Docker, Kubernetes, cloud providers). By measuring key metrics across schedulers, workers, triggers, and databases, this framework provides actionable insights to identify bottlenecks and compare performance across different versions or configurations.
11:00 - 11:25. Georgian
By Udit Saxena
Track: Use cases
This talk will explore ASAPP’s use of Apache Airflow to streamline and optimize our machine learning operations (MLOps). Key highlights include: Integrating with our custom Spark solution for achieving speedup, efficiency, and cost gains for generative AI transcription, summarization and intent categorization pipelines Different design patterns of integrating with efficient LLM servers - like TGI/vllm/tensor-RT for Summarization pipelines with/without Spark. An overview of batched LLM inference using Airflow as opposed to real time inference outside of it
11:30 - 11:55. California East
By Madhav Khakhar & Alexander Shmidt
Track: Use cases
The talk will cover how we use Airflow at the heart of our Workflow Management Platform(WFM) at Booking.com, enabling our internal users to orchestrate big data workflows on Booking Data Exchange(BDX). High level overview of the talk: Adapting open source Airflow helm chart to spin up Airflow installation in Booking Kubernetes Service (BKS) Coming up with Workflow definition format (yaml) Conversion of workflow.yaml to workflow.py DAGs Usage of Deferrable operators to provide standard step templates to users
11:30 - 12:15. California West
By Konrad Schieban & Tim Hiatt
Track: Community
DAGify is a highly extensible, template driven, enterprise scheduler migration accelerator that helps organizations speed up their migration to Apache Airflow. While DAGify does not claim to migrate 100% of existing scheduler functionality it aims to heavily reduce the manual effort it takes for developers to convert their enterprise scheduler formats into Python Native Airflow DAGs. DAGify is an open source tool under Apache 2.0 license and available on Github (https://github.
11:30 - 12:15. Elizabethan A+B
By Kaxil Naik & Ash Berlin-Taylor
Track: New features
Gen AI has taken the computing world by storm. As Enterprises and Startups have started to experiment with LLM applications, it has become clear that providing the right context to these LLM applications is critical. This process known as Retrieval augmented generation (RAG) relies on adding custom data to the large language model, so that the efficacy of the response can be improved. Processing custom data and integrating with Enterprise applications is a strength of Apache Airflow.
12:00 - 12:25. California East
By Jianlong Zhong
Track: Use cases
At Coinbase, Airflow is the backbone of ELT, supported by a vibrant community of over 500 developers. This vast engagement results in a continuous stream of enhancements, with hundreds of commits tested and released daily. However, this scale of development presents its own set of challenges, especially in deployment velocity. Traditional deployment methodologies proved inadequate, significantly impeding the productivity of our developers. Recognizing the critical need for a solution that matches our pace of innovation, we developed AirAgent: a bespoke, fully autonomous deployer designed specifically for Airflow.
11:30 - 12:15. Georgian
Track: Sponsored
Gen AI has taken the computing world by storm. As Enterprises and Startups have started to experiment with LLM applications, it has become clear that providing the right context to these LLM applications is critical. This process known as Retrieval augmented generation (RAG) relies on adding custom data to the large language model, so that the efficacy of the response can be improved. Processing custom data and integrating with Enterprise applications is a strength of Apache Airflow.
12:30 - 13:15. California East
By Cyrus Dukart, David Sacerdote & Jason Bridgemohansingh
Track: Use cases
At Vibrant Planet, we’re on a mission to make the world’s communities and ecosystems more resilient in the face of climate change. Our cloud-based platform is designed for collaborative scenario planning to tackle wildfires, climate threats, and ecosystem restoration on a massive scale. In this talk we will dive into how we are using Airflow. Particularly we will focus on how we’re making Airflow pipelines smarter and more resilient, especially when dealing with the task of processing large satellite imagery and other geospatial data.
12:30 - 13:15. California West
By Jarek Potiuk
Track: Community
Apache Airflow relies on a silent symphony behind the scenes: its CI/CD (Continuous Integration/Continuous Delivery) and development tooling. This presentation explores the critical role these tools play in keeping Airflow efficient and innovative. We’ll delve into how robust CI/CD ensures bug fixes and improvements are seamlessly integrated, while well-maintained development tools empower developers to contribute effectively. Airflow’s power comes from a well-oiled machine – its CI/CD and development tools. This presentation dives into the world of these often-overlooked heroes.
12:30 - 13:15. Elizabethan A+B
By Constance Martineau & Tzu-ping Chung
Track: New features
Join me at this year’s Airflow Summit as we delve into a pivotal evolution for Apache Airflow: The integration of data awareness. Airflow has long excelled as a workflow orchestration tool, managing complex workflows with ease and efficiency. However, it has operated with limited insight into the data it manipulates or the assets it produces. This talk will explore the implications and benefits of embedding deeper insights about these outputs directly into Airflow.
12:30 - 13:15. Georgian
By Shobhit Shah & Sumit Maheshwari
Track: Use cases
Up until a few years ago, teams at Uber used multiple data workflow systems, with some based on open source projects such as Apache Oozie, Apache Airflow, and Jenkins while others were custom built solutions written in Python and Clojure. Every user who needed to move data around had to learn about and choose from these systems, depending on the specific task they needed to accomplish. Each system required additional maintenance and operational burdens to keep it running, troubleshoot issues, fix bugs, and educate users.
14:00 - 16:40. Elizabethan C, Elizabethan D, Borgia
Track: Workshops
We will be offering hands-on workshops so you can get practical experience with Airflow tools and managed offerings. Format & duration: Workshops are instructor led, 2-3 hours long, bring your own device. Only available for participants with a Conference + Workshop pass. Workshops have limited capacity. You can sign up in advance for 2 workshops (one per day) to get a confirmed spot. Workshops will be able to receive walk-ins (people who didn’t sign up in advance) but will have limited spots and do not guarantee they will be able to receive all walk-ins.
14:00 - 14:25. California East
By Daniil Dubin
Track: Use cases
At Wix more often than not business analysts build workflows themselves to avoid data engineers being a bottleneck. But how do you enable them to create SQL ETLs starting when dependencies are ready and sending emails or refreshing Tableau reports when the work is done? One simple answer may be to use Airflow. The problem is every BA cannot be expected to know Python and Git so well that they will create thousands of DAGs easily.
14:00 - 14:25. California West
By Pete Dejoy
Track: Community
Every data team out there is being asked from their business stakeholders about Generative AI. Taking LLM centric workloads to production is not a trivial task. At the foundational level, there are a set of challenges around data delivery, data quality, and data ingestion that mirror traditional data engineering problems. Once you’re past those, there’s a set of challenges related to the underlying use case you’re trying to solve. Thankfully, because of how Airflow was already being used at these companies for data engineering and MLOps use cases, it has become the defacto orchestration layer behind many GenAI use cases for startups and Fortune 500s.
14:00 - 14:55. Elizabethan A+B
By Jed Cunningham
Track: New features
Join us as we check in on the current status of AIP-63: DAG Versioning. This session will explore the motivations behind AIP-63, the challenges faced by Airflow users in understanding and managing DAG history, and how it aims to address them. From tracking TaskInstance history to improving DAG representation in the UI, we’ll examine what we’ve already done and what’s next. We’ll also touch upon the potential future steps outlined in AIP-66 regarding the execution of specific DAG versions.
14:35 - 15:00. California East
By Olivier Daneau
Track: Best practices
Cost management is a continuous challenge for our data teams at Astronomer. Understanding the expenses associated with running our workflows is not always straightforward, and identifying which process ran a query causing unexpected usage on a given day can be time-consuming. In this talk, we will showcase an Airflow Plugin and specific DAGs developed and used internally at Astronomer to track and optimize the costs of running DAGs. Our internal tool monitors Snowflake query costs, provides insights, and sends alerts for abnormal usage.
14:35 - 15:00. California West
By Ian Moritz
Track: Airflow & ...
Airflow is often used for running data pipelines, which themselves connect with other services through the provider system. However, it is also increasingly used as an engine under-the-hood for other projects building on top of the DAG primitive. For example, Cosmos is a framework for automatically transforming dbt DAGs into Airflow DAGs, so that users can supplement the developer experience of dbt with the power of Airflow. This session dives into how a select group of these frameworks (Cosmos, Meltano, Chronon) use Airflow as an engine for orchestrating complex workflows their systems depend on.
15:10 - 15:35. California East
By Vincent La, Jim Howard & Moulay Zaidane Al Bahi Draidia
Track: Use cases
Laurel provides an AI-driven timekeeping solution tailored for accounting and legal firms, automating timesheet creation by capturing digital work activities. This session highlights two notable AI projects: UTBMS Code Prediction: Leveraging small language models, this system builds new embeddings to predict work codes for legal bills with high accuracy. More details are available in our case study: https://www.laurel.ai/resources-post/enhancing-legal-and-accounting-workflows-with-ai-insights-into-work-code-prediction. Bill Creation and Narrative Generation: Utilizing Retrieval-Augmented Generation (RAG), this approach transforms users’ digital activities into fully billable entries.
15:10 - 15:35. California West
By Michael Robinson
Track: Community
The Apache Airflow community is so large and active that it’s tempting to take the view that “if it ain’t broke don’t fix it.” In a community as in a codebase, however, improvement and attention are essential to sustaining growth. And bugs are just as inevitable in community management as they are in software development. If only the fixes were, too! Airflow is large and growing because users love Airflow and our community.
15:10 - 15:35. Elizabethan A+B
By Venkata Jagannath & Marwan Sarieddine
Track: New features
Many organizations struggle to create a well-orchestrated AI infrastructure, using separate and disconnected platforms for data processing, model training, and inference, which slows down development and increases costs. There’s a clear need for a unified system that can handle all aspects of AI development and deployment, regardless of the size of data or models. Join our breakout session to see how our comprehensive solution simplifies the development and deployment of large language models in production.
15:45 - 16:10. California East
By Anant Agarwal
Track: Use cases
In this talk, we’ll discuss how Instacart leverages Apache Airflow to orchestrate a vast network of data pipelines, powering both our core infrastructure and dbt deployments. As a data-driven company, Airflow plays a critical role in enabling us to execute large and intricate pipelines securely, compliantly, and at scale. We’ll delve into the following key areas: a. High-Throughput Cluster Management: We’ll explore how we manage and maintain our Airflow cluster, ensuring the efficient execution of over 2,000 DAGs across diverse use cases.
15:45 - 16:45. California West
By Freddy Demiane, Rahul Vats & Dennis Ferruzzi
Track: Community
Airflow operators are a core feature of Apache Airflow and it’s extremely important that we maintain high quality of operators, prevent regressions and on the other hand we help developers with automated tests results to double check if introduced changes don’t cause regressions or backward incompatible changes and we provide Airflow release managers with information whether a given version of a provider should be released or not yet. Recently a new approach to assuring production quality was implemented for AWS, Google and Astronomer-provided operators - standalone Continuous Integration processes were configured for them and test results dashboards show the results of the last test runs.
15:45 - 16:10. Elizabethan A+B
By Sriram Vamsi Ilapakurthy
Track: Airflow intro talks
This talk delves into advanced Directed Acyclic Graph (DAG) design patterns that are pivotal for optimizing data pipeline management and boosting efficiency. We’ll cover dynamic DAG generation, which allows for flexible, scalable workflow creation based on real-time data and configurations. Learn about task grouping and SubDAGs to enhance readability and maintainability of complex workflows. We’ll also explore parameterized DAGs for injecting runtime parameters into tasks, enabling versatile and adaptable pipeline configurations.
16:20 - 16:45. California East
By Nishchay Agrawal
Track: Use cases
Elevate your data engineering and analytics capabilities in our cutting-edge seminar focused on Apache Airflow. Dive into sophisticated techniques for building and managing data pipelines, emphasizing data ingestion, transformation, and validation. Discover how to enhance your data infrastructure by integrating Airflow with pivotal tools like Databricks, Snowflake, and dbt, alongside robust big data processing solutions like Apache Spark and Hadoop. Leverage dynamic stream processing systems like Apache Kafka, and amplify your analytics prowess by connecting Airflow with BI platforms like Tableau and Power BI.