Program

Welcome to the session program for Airflow Summit 2024.

If you prefer, you can also see this as sessionize layout or list of sessions.

Tuesday, September 10, 2024

09:00

Welcome

10:05

10:30

Morning break

11:00

12:00

13:00

Lunch

14:00

14:35

15:10

15:35

Afternoon break

16:00

16:35

17:10

17:35

Event reception

10 years of Airflow: history, insights, and looking forward

By Viraj Parekh, John Jackson, Marc Lamberti, Rafal Biegacz, Ash Berlin-Taylor & Elad Kalif

Track: Keynote

Room: Grand Ballroom

09/10/2024 9:10 AM 09/10/2024 10:05 AM America/Los_Angeles AS24: 10 years of Airflow: history, insights, and looking forward

10 years after its creation, Airflow is stronger than ever: in last year’s Airflow survey, 81% of users said Airflow is important or very important to their business, 87% said their Airflow usage has grown over time, and 92% said they would recommend Airflow. In this panel discussion, we’ll celebrate a decade of Airflow and delve into how it became the highly recommended industry standard it is today, including history, pivotal moments, and the role of the community. Our panel of seasoned experts will also talk about where Airflow is going next, including future use cases like generative AI and the highly anticipated Airflow 3.0. Don’t miss this insightful exploration into one of the most influential tools in the data landscape.

Grand Ballroom

By Michael Winser & Jarek Potiuk

Track: Keynote

Room: Grand Ballroom

09/10/2024 10:05 AM 09/10/2024 10:30 AM America/Los_Angeles AS24: Security United: collaborative effort on securing Airflow ecosystem with Alpha-Omega, PSF & ASF

Airflow’s power comes from its vast ecosystem, but securing this intricate web requires a united front. This talk unveils a groundbreaking collaborative effort between the Python Software Foundation (PSF), the Apache Software Foundation (ASF), the Airflow Project Management Committee (PMC), and Alpha-Omega Fund - aimed at securing not only Airflow, but the whole ecosystem. We’ll explore this new project dedicated to improving security across the Airflow landscape.

Grand Ballroom

By Bonnie Why

Track: Use cases

Room: California East

09/10/2024 11:00 AM 09/10/2024 11:45 AM America/Los_Angeles AS24: Airflow at Burns & McDonnell | Orchestration from zero to 100

As the largest employee-owned engineering and construction firm in the United States, Burns & McDonnell has a massive amount of data. Not only that, it’s hard to pinpoint which source system has the data we need. Our solution to this challenge is to build a unified information platform — a single source of truth where all of our data is searchable, trustworthy, and accessible to our employee-owners and the projects that need it.

Everyone’s data is important and everyone’s use case is a priority, so how can we get this done quickly? In this session, I will tell you all about how we went from having zero knowledge in Airflow to ingesting many unique and disconnected data sources into our data lakehouse in less than a day.

Come hear the story about how our data team at Burns & McDonnell is using Airflow as an orchestrator to create a scalable, trustworthy data platform that will empower our system to evolve with the ever-changing technology landscape.

California East

Airflow UI Roadmap

By Brent Bovenzi

Track: New features

Room: Elizabethan A+B

09/10/2024 11:00 AM 09/10/2024 11:45 AM America/Los_Angeles AS24: Airflow UI Roadmap

Soon we will finally switch to a 100% React UI with a full separation between the API and UI as well. While we are doing such a big change, let’s also take the opportunity to imagine whole new interfaces vs just simply modernizing the existing views. How can we use design to help you better understand what is going on with your DAG?

Come listen to some of our proposed ideas and bring your own big ideas as the second half will be an open discussion.

Elizabethan A+B

Come listen to some of our proposed ideas and bring your own big ideas as the second half will be an open discussion.

By Lewis Macdonald & Ethan Stone

Track: Airflow & ...

Room: California West

09/10/2024 11:00 AM 09/10/2024 11:45 AM America/Los_Angeles AS24: Building on Cosmos: Making dbt on Airflow Easy

Balyasny Asset Management (BAM) is a diversified global investment firm founded in 2001 with over $20 billion in assets under management.

As dbt took hold at BAM, we had multiple teams building dbt projects against Snowflake, Redshift, and SQL Server. The common question was: How can we quickly and easily productionise our projects?

Airflow is the orchestrator of choice at BAM, but our dbt users ranged from Airflow power users to people who’d never heard of Airflow before. We built a single solution on top of Cosmos that allowed us to:

Decouple the dbt project from the Airflow repository
Have each dbt node run as a separate Airflow task
Allow users to run dbt with little to no Airflow knowledge
Enable users to have fine-grained control over how dbt is run and to combine it with other Airflow tasks
Provide observability, monitoring, and alerting.

California West

Balyasny Asset Management (BAM) is a diversified global investment firm founded in 2001 with over $20 billion in assets under management.

As dbt took hold at BAM, we had multiple teams building dbt projects against Snowflake, Redshift, and SQL Server. The common question was: How can we quickly and easily productionise our projects?

By John Jackson

Track: Best practices

Room: Georgian

09/10/2024 11:00 AM 09/10/2024 11:45 AM America/Los_Angeles AS24: Event-driven Data Pipelines with Apache Airflow

Airflow is all about schedules…we use CRON strings and Timetable to define schedules, and there’s an Airflow Scheduler component that manages those timetables, and a lot more, to ensure that DAGs and tasks are addressed based on those schedules.

But what do you do if your data isn’t available on a schedule? What if data is coming from many sources, at varying times, and your job is to make sure it’s all as up-to-date as possible? An event-driven data pipeline may be the answer.

An event-driven architecture (or EDA) is an architecture pattern that uses events to decouple an application’s components. It relies on external events, not an internal schedule, to create loosely coupled data pipelines that determine when to take action, and what actions to take. In this session, we will discuss the design considerations when using Airflow in an EDA and the tools Airflow has to make this happen, including Datasets, REST API, Dynamic Task Mapping, custom Timetables, Sensors, and queues.

Georgian

By Kacper Muda

Track: Best practices

Room: Georgian

09/10/2024 12:00 PM 09/10/2024 12:45 PM America/Los_Angeles AS24: Activating operational metadata with Airflow, Atlan and OpenLineage

OpenLineage is an open standard for lineage data collection, integrated into the Airflow codebase, facilitating lineage collection across providers like Google, Amazon, and more.

Atlan Data Catalog is a 3rd generation active metadata platform that is a single source of trust unifying cataloging, data discovery, lineage, and governance experience.

We will demonstrate what OpenLineage is and how, with minimal and intuitive setup across Airlfow and Atlan, it presents unified workflows view, efficient cross-platform lineage collection, including column level, in various technologies (Python, Spark, dbt, SQL etc.) and clouds (AWS, Azure, GCP, etc.) - all orchestrated by Airflow.

This integration enables further use case unlocks on automated metadata management by making the operational pipelines dataset-aware for self-service exploration. It also will demonstrate real world challenges and resolutions for lineage consumers in improving audit and compliance accuracy through column-level lineage traceability across the data estate.

The talk will also briefly overview the most recent OpenLineage developments and planned future enhancements.

Georgian

OpenLineage is an open standard for lineage data collection, integrated into the Airflow codebase, facilitating lineage collection across providers like Google, Amazon, and more.

Atlan Data Catalog is a 3rd generation active metadata platform that is a single source of trust unifying cataloging, data discovery, lineage, and governance experience.

By Gunnar Lykins

Track: Use cases

Room: California East

09/10/2024 12:00 PM 09/10/2024 12:45 PM America/Los_Angeles AS24: Orchestrating & Optimizing a Batch Ingestion Data Platform for Americas #1 Sportsbook

FanDuel Group, an industry leader in sports-tech entertainment, is proud to be recognized as the #1 sports betting company in the US as of 2023 with 53.4% market share. With a workforce exceeding 4,000 employees, including over 100 data engineers, FanDuel Group is at the forefront of innovation in batch processing orchestration platforms. Currently, our platform handles over 250,000 DAG runs & executes ~3 million tasks monthly across 17 deployments. It provides a standardized framework for pipeline development, structured observability, monitoring, & alerting. It also offers automated data processing managed by an in-house team, enabling stakeholders to concentrate on core business objectives. Our batch ingestion platform is the backbone of endless use cases, facilitating the landing of data into storage at scheduled intervals, real-time ingestion of micro batches triggered by events, standardization processes, & ensuring data availability for downstream applications. Our proposed session also delves into our forward-looking tech strategy as well as addressing the expansion of orchestration diversity by integrating scheduled jobs from various domains into our robust data platform.

California East

By Ash Berlin-Taylor & Vikram Koka

Track: New features

Room: Elizabethan A+B

09/10/2024 12:00 PM 09/10/2024 12:45 PM America/Los_Angeles AS24: Running Airflow Tasks Anywhere, in any Language

Imagine a world where writing Airflow tasks in languages like Go, R, Julia, or maybe even Rust is not just a dream but a native capability. Say goodbye to BashOperators; welcome to the future of Airflow task execution.

Here’s what you can expect to learn from this session:

Multilingual Tasks: Explore how we empower DAG authors to write tasks in any language while retaining seamless access to Airflow Variables and Connections.
Simplified Development and Testing: Discover how a standardized interface for task execution promises to streamline development efforts and elevate code maintainability.
Enhanced Scalability and Remote Workers: Learn how enabling tasks to run on remote workers opens up possibilities for seamless deployment on diverse platforms, including Windows and remote Spark or Ray clusters. Experience the convenience of effortless deployments as we unlock new avenues for Airflow usage.

Join us as we embark on an exploratory journey to shape the future of Airflow task execution. Your insights and contributions are invaluable as we refine this vision together. Let’s chart a course towards a more versatile, efficient, and accessible Airflow ecosystem.

Elizabethan A+B

Here’s what you can expect to learn from this session:

Multilingual Tasks: Explore how we empower DAG authors to write tasks in any language while retaining seamless access to Airflow Variables and Connections.
Simplified Development and Testing: Discover how a standardized interface for task execution promises to streamline development efforts and elevate code maintainability.
Enhanced Scalability and Remote Workers: Learn how enabling tasks to run on remote workers opens up possibilities for seamless deployment on diverse platforms, including Windows and remote Spark or Ray clusters. Experience the convenience of effortless deployments as we unlock new avenues for Airflow usage.

By Sriharsh Adari, Jeetendra Vaidya & Joseph Morotti

Track: Airflow & ...

Room: California West

09/10/2024 12:00 PM 09/10/2024 12:45 PM America/Los_Angeles AS24: Unleash the Power of AI: Streamlining Airflow DAG Development with AI-Driven Automation

Nowadays, conversational AI is no longer exclusive to large enterprises. It has become more accessible and affordable, opening up new possibilities and business opportunities. In this session, discover how you can leverage Generative AI as your AI pair programmer to suggest DAG code and recommend entire functions in real-time, directly from your editor. Visualize how to harness the power of ML, trained on billions of lines of code, to transform natural language prompts into coding suggestions. Seamlessly cycle through lines of code, complete function suggestions, and choose to accept, reject, or edit them. Witness firsthand how Generative AI provides recommendations based on the project’s context and style conventions. The objective is to equip you with techniques that allow you to spend less time on boilerplate and repetitive code patterns, and more time on what truly matters: building exceptional orchestration software.

California West

By Luan Moreno Medeiros Maciel

Track: Airflow Intro talks

Room: Elizabethan A+B

09/10/2024 2:00 PM 09/10/2024 2:25 PM America/Los_Angeles AS24: dbt-Core & Airflow 101: Building Data Pipelines Demystified

dbt became the de facto for data teams building reliable and trustworthy SQL code leveraging a modern data stack architecture. The dbt logic needs to be orchestrated, and jobs scheduled to meet business expectations. That’s where Airflow comes into play.

In this quick introduction session, you’ll gonna learn:

How to leverage dbt-Core & Airflow to orchestrate pipelines
Write DAGs in a Pythonic way
Apply best practices on your jobs

Elizabethan A+B

In this quick introduction session, you’ll gonna learn:

How to leverage dbt-Core & Airflow to orchestrate pipelines
Write DAGs in a Pythonic way
Apply best practices on your jobs

By Sumit Maheshwari

Track: Best practices

Room: Georgian

09/10/2024 2:00 PM 09/10/2024 2:25 PM America/Los_Angeles AS24: Optimize Your DAGs: Embrace Dag Params for Efficiency and Simplicity

In the realm of data engineering, there is a prevalent tendency for professionals to develop similar Directed Acyclic Graphs (DAGs) to manage analogous tasks. Leveraging Dag Params presents an effective strategy for mitigating redundancy within these DAGs. Moreover, the utilization of Dag Params facilitates seamless enforcement of user inputs, thereby streamlining the process of incorporating validations into the DAG codebase.

Georgian

By Zhang Zhang & Jenny Gao

Track: Use cases

Room: California East

09/10/2024 2:00 PM 09/10/2024 2:25 PM America/Los_Angeles AS24: Streamlining a Mortgage ETL Pipeline with Apache Airflow

At Bloomberg, it is our team’s responsibility to ensure the timely delivery to our clients worldwide of a vast dataset comprising approximately 5 billion data points on roughly 50 million loans and over 1.4 million securities, disclosed twice a month by three major government-sponsored mortgage entities. Ingesting this data so we can create and derive complex data structures to be consumed by our applications for our clients has been our biggest challenge. In this talk, we will discuss our transition from a manually-managed spreadsheet-based system to an automated centralized orchestration tool, and how Apache Airflow has helped make the process more transparent, predictable, and visible.

California East

By Parnab Basak

Track: Airflow & ...

Room: California West

09/10/2024 2:00 PM 09/10/2024 2:25 PM America/Los_Angeles AS24: Unlocking FMOps/LLMOps with Airflow: A guide to operationalizing and managing Large Language Models

In the last few years Large Language Models (LLMs) have risen to prominence as outstanding tools capable of transforming businesses. However, bringing such solutions and models to the business-as-usual operations is not an easy task. In this session, we delve into the operationalization of generative AI applications using MLOps principles, leading to the introduction of foundation model operations (FMOps) or LLM operations using Apache Airflow. We further zoom into aspects of expected people and process mindsets, new techniques for model selection and evaluation, data privacy, and model deployment. Additionally, know how you can use the prescriptive features of Apache Airflow to aid your operational journey. Whether you are building using out of the box models (open-source or proprietary), creating new foundation models from scratch, or fine-tuning an existing model, with the structured approaches described you can effectively integrate LLMs into your operations, enhancing efficiency and productivity without causing disruptions in the cloud or on-premises.

California West

By Jennifer Melot

Track: Use cases

Room: California East

09/10/2024 2:35 PM 09/10/2024 3:00 PM America/Los_Angeles AS24: Data Orchestration for Emerging Technology Analysis

The Center for Security and Emerging Technology is a think tank at Georgetown University that studies security implications of emerging technologies, including data-driven analyses across bibliometric, patenting, and investment datasets. This talk will describe CSET’s data infrastructure which uses Airflow to orchestrate data ingestion, model deployment, webscraping, and manual data curation pipelines. We’ll also discuss how outputs from these pipelines are integrated into public-facing web applications and written reports, and some lessons learned from building and maintaining data pipelines on a data team with a diverse skill set.

California East

By Nathan Hadfield

Track: Airflow & ...

Room: California West

09/10/2024 2:35 PM 09/10/2024 3:00 PM America/Los_Angeles AS24: From Oops to Ops: Smart Task Failure Diagnosis with OpenAI

This session reveals an experimental venture integrating OpenAI’s AI technologies with Airflow, aimed at advancing error diagnosis.

Through the application of AI, our objective is to deepen the understanding of issues, provide comprehensive insights into task failures, and suggest actionable solutions, thereby augmenting the resolution process. This method seeks to not only enhance diagnostic efficiency but also to equip data engineers with AI-informed recommendations.

Participants will be guided through the integration journey, illustrating how AI can refine error analysis and potentially simplify troubleshooting workflows.

California West

This session reveals an experimental venture integrating OpenAI’s AI technologies with Airflow, aimed at advancing error diagnosis.

Participants will be guided through the integration journey, illustrating how AI can refine error analysis and potentially simplify troubleshooting workflows.

By Taylor Facen

Track: Best practices

Room: Georgian

09/10/2024 2:35 PM 09/10/2024 3:00 PM America/Los_Angeles AS24: From Tech Specs to Business Impact: How to Design A Truly End-to-End Airflow Project

There are many Airflow tutorials. However, many don’t show the full process of sourcing, transforming, testing, alerting, documenting, and finally supplying data. This talk with go over how to piece together an end-to-end Airflow project that transforms raw data to be consumable by the business. It will include how various technologies can all be orchestrated by Airflow to satisfy the needs of analysts, engineers, and business stakeholders.

The talk will be divided into the following sections:

Introduction: Introducing the business problem and how we came up with the solution design
Data sourcing: Fetching and storing API data using basic operators and hooks
Transformation and Testing: How to use dbt to build and test models based on the raw data
Alerting: Alerting the necessary parties when any part of this DAG fails using Slack
Consumption: How to make dynamic data accessible to business stakeholders

Georgian

The talk will be divided into the following sections:

By Daniel Standish

Track: Airflow Intro talks

Room: Elizabethan A+B

09/10/2024 2:35 PM 09/10/2024 3:00 PM America/Los_Angeles AS24: Managing version upgrades without feelings of terror

Airflow version upgrades can be challenging. Maybe you upgrade and your dags fail to parse (that’s an easy fix). Or maybe you upgrade and everything looks fine, but when your dag runs, you can no longer connect to mysql because the TLS version changed. In this talk I will provide concrete strategies that users can put into practice to make version upgrades safer and less painful. Topics may include:

What semver means and what it implies for the upgrade process
Using integration test dags, unit tests, and a test cluster to smoke out problems
Strategies around constraints files / pinning, and managing providers vs core versions
Using db clean prior to upgrade to reduce table size
Rollback strategies
What to do about warnings (e.g. deprecation warnings)?

I’ll also focus on keeping it simple. Sometimes things like “integration tests” and “CI” can be scary for people. Even without having set up anything automated, there are still things you can do to make management of upgrades a little less painful and risky.

Elizabethan A+B

What semver means and what it implies for the upgrade process

By Ephraim Anierobi

Track: Airflow Intro talks

Room: Elizabethan A+B

09/10/2024 3:10 PM 09/10/2024 3:35 PM America/Los_Angeles AS24: A deep dive into Airflow configuration options for scalability

Apache Airflow has a lot of configuration options. A change in some of these options can affect the performance of Airflow.

If you are wondering why your Airflow instance is not running the number of tasks you expected it to run, after this talk, you will have a better understanding of the configuration options available for improving the number of tasks your Airflow instance can run.

We will talk about the DAG parsing configuration options, options for scheduler scalability, etc., and the pros and cons of these options.

Elizabethan A+B

Apache Airflow has a lot of configuration options. A change in some of these options can affect the performance of Airflow.

We will talk about the DAG parsing configuration options, options for scheduler scalability, etc., and the pros and cons of these options.

By Nawfel Bacha & Andrea Bombino

Track: Best practices

Room: Georgian

09/10/2024 3:10 PM 09/10/2024 3:35 PM America/Los_Angeles AS24: Airflow Datasets and Pub/Sub for Dynamic DAG Triggering

Looking for a way to streamline your data workflows and master the art of orchestration? As we navigate the complexities of modern data engineering, Airflow’s dynamic workflow and complex data pipeline dependencies are starting to become more and more common nowadays. In order to empower data engineers to exploit Airflow as the main orchestrator, Airflow Datasets can be easily integrated in your data journey.

This session will showcase the Dynamic Workflow orchestration in Airflow and how to manage multi-DAGs dependencies with Multi-Dataset listening.

We’ll take you through a real-time data pipeline with Pub/Sub messaging integration and dbt in Google Cloud environment, to ensure data transformations are triggered only upon new data ingestion, moving away from rigid time-based scheduling or the use of sensors and other legacy ways to trigger a DAG.

Georgian

This session will showcase the Dynamic Workflow orchestration in Airflow and how to manage multi-DAGs dependencies with Multi-Dataset listening.

By Elad Yaniv

Track: Airflow & ...

Room: California West

09/10/2024 3:10 PM 09/10/2024 3:35 PM America/Los_Angeles AS24: Elevating Machine Learning Deployment: Unleashing the Power of Airflow in Wix's ML Platform

In his presentation, Elad will provide a novel take on Airflow, highlighting its versatility beyond conventional use for scheduled pipelines. He’ll discuss its application as an on-demand tool for initiating and halting jobs, mainly in the Data Science fields, like dataset enrichment and batch prediction via API calls, complete with real-time status tracking and alerts. The talk aims to encourage a fresh approach to Airflow utilization but will also delve into the technical aspects of implementing DAG triggering and cancellation logic.

What will the audience learn:

Real-life use case of leveraging Airflow capabilities beyond traditional pipeline scheduling, with innovative integration as the infrastructure for ML Platform.
Trigger on-demand DAGs through API.
Cancel running DAGs.
Demonstration of an end-to-end ML pipeline utilizing AWS Sagemaker for batch predictions.
Some more Airflow best practices.

Join us to learn from Wix’s experience and best practices!

California West

How Panasonic Leverages Airflow

By Michael Atondo

Track: Use cases

Room: California East

09/10/2024 3:10 PM 09/10/2024 3:35 PM America/Los_Angeles AS24: How Panasonic Leverages Airflow

Using various operators to perform daily routines.

Integration with Technologies:

Redis: Acts as a caching mechanism to optimize data retrieval and processing speed, enhancing overall pipeline performance.

MySQL: Utilized for storing metadata and managing task state information within Airflow’s backend database.

Tableau: Integrates with Airflow to generate interactive visualizations and dashboards, providing valuable insights into the processed data.

Amazon Redshift: Panasonic leverages Redshift for scalable data warehousing, seamlessly integrating it with Airflow for data loading and analytics.

Foundry: Integrated with Airflow to access and process data stored within Foundry’s data platform, ensuring data consistency and reliability.

Plotly Dashboards: Employed for creating custom, interactive web-based dashboards to visualize and analyze data processed through Airflow pipelines.

GitLab CI/CD Pipelines: Utilized for version control and continuous integration/continuous deployment (CI/CD) of Airflow DAGs (Directed Acyclic Graphs), ensuring efficient development and deployment of workflows.

California East

Using various operators to perform daily routines.

Integration with Technologies:

Redis: Acts as a caching mechanism to optimize data retrieval and processing speed, enhancing overall pipeline performance.

MySQL: Utilized for storing metadata and managing task state information within Airflow’s backend database.

Tableau: Integrates with Airflow to generate interactive visualizations and dashboards, providing valuable insights into the processed data.

Amazon Redshift: Panasonic leverages Redshift for scalable data warehousing, seamlessly integrating it with Airflow for data loading and analytics.

By Shahar Epstein

Track: Use cases

Room: California East

09/10/2024 4:00 PM 09/10/2024 4:25 PM America/Los_Angeles AS24: Airflow at NCR Voyix: Streamlining ML workflows development with Airflow

NCR Voyix Retail Analytics AI team offers ML products for retailers while embracing Airflow as its MLOps Platform. As the team is small and there have been twice as many data scientists as engineers, we encountered challenges in making Airflow accessible to the scientists:

As they come from diverse programming backgrounds, we needed an architecture enabling them to develop production-ready ML workflows without prior knowledge of Airflow.
Due to dynamic product demands, we had to implement a mechanism to interchange Airflow operators effortlessly.
As workflows serve multiple customers, they should be easily configurable and simultaneously deployable.

We came up with the following architecture to deal with the above:

Enabling our data scientists to formulate ML workflows as structured Python files.
Seamlessly converting the workflows into Airflow DAGs while aggregating their steps to be executed on different Airflow operators.
Deploying DAGs via CI/CD’s UI to the DAGs folder for all customers while considering definitions for each in their configuration files.

In this session, we will cover Airflow’s evolution in our team and review the concepts of our architecture.

California East

As they come from diverse programming backgrounds, we needed an architecture enabling them to develop production-ready ML workflows without prior knowledge of Airflow.
Due to dynamic product demands, we had to implement a mechanism to interchange Airflow operators effortlessly.

By Spencer Tollefson

Track: Airflow Intro talks

Room: Elizabethan A+B

09/10/2024 4:00 PM 09/10/2024 4:25 PM America/Los_Angeles AS24: Empowering More Teams in your Organization to Self-service their Airflow Needs

Does your organization feel like the responsibility to write Airflow DAGs, handle the Airflow infrastructure administration, debug failing tasks, and keep up with new features and best practices is too much for too few people? Perhaps you only have one data team that owns all of that; or you have too many teams that have too many permissions into other teams’ DAGs.

The topic of this talk is how Rakuten Kobo enables self-service for various teams within its organization to build their own DAGs in Airflow. The talk will include how we delineate the Airflow responsibilities of various teams, build guard rails for new Airflow developers, how different teams automatically have permissions required for their “own” DAGs (but not others), the unique responsibilities of Operations and Data Engineering teams, and how it is done in a scalable manner.

Maybe you’ll be inspired to make changes in your own organization, or have some tips of your own to share! Depending on questions, we could discuss some of the technical details as well.

Elizabethan A+B

By Pankaj Singh & Pankaj Koti

Track: Best practices

Room: Georgian

09/10/2024 4:00 PM 09/10/2024 4:25 PM America/Los_Angeles AS24: Optimizing Airflow Performance: Strategies, Techniques, and Best Practices

Airflow, an open-source platform for orchestrating complex data workflows, is widely adopted for its flexibility and scalability. However, as workflows grow in complexity and scale, optimizing Airflow performance becomes crucial for efficient execution and resource utilization. This session delves into the importance of optimizing Airflow performance and provides strategies, techniques, and best practices to enhance workflow execution speed, reduce resource consumption, and improve system efficiency. Attendees will gain insights into identifying performance bottlenecks, fine-tuning workflow configurations, leveraging advanced features, and implementing optimization strategies to maximize pipeline throughput. Whether you’re a seasoned Airflow user or just getting started, this session equips you with the knowledge and tools needed to optimize your Airflow deployments for optimal performance and scalability. We’ll also explore topics such as DAG writing best practices, monitoring and updating Airflow configurations, and database performance optimization, covering unused indexes, missing indexes, and minimizing table and index bloat.

Georgian

By Neha Singla & Sathish kumar Thangaraj

Track: Airflow & ...

Room: California West

09/10/2024 4:00 PM 09/10/2024 4:25 PM America/Los_Angeles AS24: Streamline data science workflow development using Jupyter notebooks and Airflow

Jupyter Notebooks are widely used by data scientists and engineers to prototype and experiment with data. However these engineers are often required to work with other data or platform engineers to productionize these experiments due to the complexity in navigating infrastructure and systems. In this talk, we will deep dive into this PR https://github.com/apache/airflow/pull/34840 and share how airflow can be leveraged as a platform to execute notebook pipelines (python, scala or spark) in dynamic environments like Kubernetes for various heterogeneous use cases. We will demonstrate how data scientists can use a Jupyter extension to easily build and manage such pipelines which are executed using Airflow streamlining data science workflow development and supercharging productivity

California West

By Austin Bennett

Track: Best practices

Room: Georgian

09/10/2024 4:35 PM 09/10/2024 5:00 PM America/Los_Angeles AS24: Automated Testing and Deployment of DAGs

DAG integrity is critical. So are coding conventions, consistency in standards for the group. In this talk, we will share the various lessons learned for testing/verifying our DAGs as part of our GitHub workflows [ for testing as part of the pull request process, and for automated deployment - eventually to production - once merged ]. We will dig into how we have unlocked additional efficiencies, catch errors before they get deployed, and generally how we are better off for having both Airflow & plenty of checks in our CI, before we merge/deploy.

Georgian

By Ozcan Ilikhan & Amit Kumar

Track: Use cases

Room: California East

09/10/2024 4:35 PM 09/10/2024 5:00 PM America/Los_Angeles AS24: Evolution of Orchestration at GoDaddy: A Journey from On-prem to Cloud-based Single Pane Model

Explore the evolutionary journey of orchestration within GoDaddy, tracing its transformation from initial on-premise deployment to a robust cloud-based Apache Airflow orchestration model. This session will detail the pivotal shifts in design, organizational decisions, and governance that have streamlined GoDaddy’s Data Platform and enhanced overall governance.

Attendees will gain insights valuable for optimizing Airflow deployments and simplifying complex orchestration processes.

Recap of the transformation journey and its impact on GoDaddy’s data operations.

Future directions and ongoing improvements in orchestration at GoDaddy.

This session will benefit attendees by providing a comprehensive case study on optimizing orchestration in a complex enterprise environment, emphasizing practical insights and scalable solutions.

California East

Attendees will gain insights valuable for optimizing Airflow deployments and simplifying complex orchestration processes.

Recap of the transformation journey and its impact on GoDaddy’s data operations.

By piotr-lesniak & Rafal Biegacz

Track: Airflow & ...

Room: California West

09/10/2024 4:35 PM 09/10/2024 5:00 PM America/Los_Angeles AS24: Orchestration of ML workloads via Airflow & GKE Batch

During this talk we are going to given an overview of different orchestration approaches (Kubeflow, Ray, Airflow, etc.) when running ML workloads on Kubernetes and specifically we will focus on how to use Kubernetes Batch API and Kubernetes Operators to run complex ML workloads.

California West

By Shubham Mehta & Rajesh Bishundeo

Track: Airflow Intro talks

Room: Elizabethan A+B

09/10/2024 4:35 PM 09/10/2024 5:00 PM America/Los_Angeles AS24: Scaling AI Workloads with Apache Airflow

AI workloads are becoming increasingly complex, with unique requirements around data management, compute scalability, and model lifecycle management. In this session, we will explore the real-world challenges users face when operating AI at scale. Through real-world examples, we will uncover common pitfalls in areas like data versioning, reproducibility, model deployment, and monitoring. Our practical guide will highlight strategies for building robust and scalable AI platforms leveraging Airflow as the orchestration layer and AWS for its extensive AI/ML capabilities. We will showcase how users have tackled these challenges, streamlined their AI workflows, and unlocked new levels of productivity and innovation.

Elizabethan A+B

By Maxime Beauchemin

Track: Community

Room: Elizabethan A+B

09/10/2024 5:10 PM 09/10/2024 5:35 PM America/Los_Angeles AS24: AI Reality Checkpoint: The Good, the Bad, and the Overhyped

In the past 18 months, artificial intelligence has not just entered our workspaces – it has taken over. As we stand at the crossroads of innovation and automation, it’s time for a candid reflection on how AI has reshaped our professional lives, and to talk about where it’s been a game changer, where it’s falling short, and what’s about to shift dramatically in the short term.

Since the release of ChatGPT in December 2022, I’ve developed a “first-reflex” to augment and accelerate nearly every task with AI. As a founder and CEO, this spans a wide array of responsibilities from fundraising, internal communications, legal, operations, product marketing, finance, and beyond. In this keynote, I’ll cover diverse use cases across all areas of business, offering a comprehensive view of AI’s impact.

Join me as I sort out through this new reality and try and forecast the future of AI in our work. It’s time for a radical checkpoint. Everything’s changing fast. In some areas, AI has been a slam dunk; in others, it’s been frustrating as hell. And once a few key challenges are tackled, we’re on the cusp of a tsunami of transformation.

3 major milestones are right around the corner: top-human-level reasoning, solid memory accumulation and recall, and proper executive skills. How is this going to affect all of us?

Elizabethan A+B

By Kevin Wang, Palanieppan Muthiah & Peiqiu Tian

Track: Use cases

Room: Georgian

09/10/2024 5:10 PM 09/10/2024 5:35 PM America/Los_Angeles AS24: Optimizing Critical Operations: Enhancing Robinhood's Workflow Journey with Airflow

Airflow is widely used within Robinhood. In addition to traditional offline analytics use cases (to schedule ingestion and analytics workloads that populate our data lake), we also use Airflow in our backend services to orchestrate various workflows that are highly critical for the business, e.g: compliance and regulatory reporting, user facing reports and more.

As part of this, we have evolved what we believe is a unique deployment architecture for Airflow. We have central schedulers that are responsible for workloads from multiple different teams, but the workflow tasks themselves run on workers owned by respective teams that are highly coupled with their backend services and codebase.

Furthermore, Robinhood augmented Airflow with a bunch of customizations — airflow worker template for Kubernetes, enhanced observability, enhanced SLA detection, and a collection of operators, sensors, and plugins to tailor Airflow to their exact needs.

This session is going to walk through how we grew our architecture and adapted Airflow to fit Robinhood’s variety of needs and use cases.

Georgian

By Amogh Desai & Shubham Raj

Track: Airflow & ...

Room: California West

09/10/2024 5:10 PM 09/10/2024 5:35 PM America/Los_Angeles AS24: Overcoming Custom Python Package Hurdles in Airflow

DAG Authors, while constructing DAGs, generally use native libraries provided by Airflow in conjunction with python libraries available over public PyPI repositories.

But sometimes, DAG authors need to construct DAG using libraries that are either in-house or not available over public PyPI repositories. This poses a serious challenge for users who want to run their custom code with Airflow DAGs, particularly when Airflow is deployed in a cloud-native fashion.

Traditionally, these packages are baked in Airflow Docker images. This won’t work post deployment and is super impractical if your library is under development.

We propose a solution that creates a dedicated Airflow global python environment that dynamically generates the requirements, establishes a version-compatible pyenv adhering to Airflow’s policies, and manages custom pip repository authentication seamlessly. Importantly, the service executes these steps in a fail-safe manner, not compromising core components.

Join us as we discuss the solution to this common problem, touching upon the design, and seeing the solution in action. We also candidly discuss some challenges, and the shortcomings of the proposed solution.

California West

DAG Authors, while constructing DAGs, generally use native libraries provided by Airflow in conjunction with python libraries available over public PyPI repositories.

Traditionally, these packages are baked in Airflow Docker images. This won’t work post deployment and is super impractical if your library is under development.

By Howie Wang

Track: Use cases

Room: California East

09/10/2024 5:10 PM 09/10/2024 5:35 PM America/Los_Angeles AS24: Streamlining DAG Creation with YAML in Large Organizations

As organizations grow, the task of creating and managing Airflow DAGs efficiently becomes a challenge.

In this talk, we will delve into innovative approaches to streamlining Airflow DAG creation using YAML. By leveraging YAML configuration, we allow users to dynamically generate Airflow DAGs without requiring Python expertise or deep knowledge of Airflow primitives.

We will showcase the significant benefits of this approach, including eliminating duplicate configurations, simplifying DAG management for a large group of workflows, and ultimately enhancing productivity within large organizations.

Join us to learn practical strategies to optimize workflow orchestration, reduce development overhead, and facilitate seamless collaboration across teams.

California East

As organizations grow, the task of creating and managing Airflow DAGs efficiently becomes a challenge.

9:00 - 9:10

Welcome

10:30 - 11:00

Morning break

13:00 - 14:00

Lunch

15:35 - 16:00

Afternoon break

17:35 - 19:40

Event reception

09:10 - 10:05. Grand Ballroom

10 years of Airflow: history, insights, and looking forward

By Viraj Parekh, John Jackson, Marc Lamberti, Rafal Biegacz, Ash Berlin-Taylor & Elad Kalif

Track: Keynote

Program

Welcome to the session program for Airflow Summit 2024.

If you prefer, you can also see this as sessionize layout or list of sessions.

Tuesday, September 10, 2024

Wednesday, September 11, 2024

This talk is presented by BMC Software

This talk is presented by BMC Software

This talk is presented by BMC Software

Thursday, September 12, 2024