These are the sessions that were presented at Airflow Summit 2022.

You can also check the archive of sessions for previous editions.

Title Speaker(s) Recording Slides

Dynamic Dags -- The New Horizon

In Airflow 2.3 the ability to change the number of tasks dynamically opens up some exciting new ways of building DAGs and lets us create new patterns that just weren’t possible before. In this session I will cover a little bit about AIP-42 and the interface for Dynamic Task Mapping, and cover some common use cases and patterns.
Ash Berlin-Taylor

How to eliminate Data Downtime & start trusting your data

Broken data is costly, time-consuming, and nowadays, an all-too-common reality for even the most advanced data teams. In this talk, I’ll introduce this problem, called “data downtime” — periods of time when data is partial, erroneous, missing or otherwise inaccurate — and discuss how to eliminate it in your data ecosystem with end-to-end data observability. Drawing corollaries to application observability in software engineering, data observability is a critical component of the modern DataOps workflow and the key to ensuring data trust at scale.
Barr Moses

Manage Dags at scale (Dags versioning & package management)

This talk is all about how we at Jagex manage DAGs at scale. Focusing of following challenges faced and how we resolved them; Keeping track of airflow state Keeping track of each DAG state DAGs as git submodules Updating airflow with new dags How to seamlessly automate airflow deployment How to avoid package dependency conflicts
Anum Sheraz

Using Apache Airflow to orchestrate workflows across hybrid environments

According to analysts, 87 percent of enterprises have already adopted hybrid cloud strategies (https://www.flexera.com/blog/industry-trends/trend-of-cloud-computing-2020/). Customers have many reasons why they need to support hybrid environments, from maximising the value from heritage systems, to meeting local compliance and data processing regulations. As they build their data pipelines, they increasingly need to be able to orchestrate those across on-premesis and cloud environments. In this session, I will share how you can leverage Apache Airflow to orchestrate a workflow using data sources inside and outside the cloud.
Ricardo Sueiras

Choosing Apache Airflow over other Proprietary Tools for your Orchestration needs

Organizations need to effectively manage large volumes of complex, business-critical workloads across multiple applications and platforms. Choosing the right workflow orchestration tool is important as it can help teams effectively automate the configuration, coordination, integration, and data management processes on several applications and systems. Currently there are a lot of tools (both open sourced and proprietary) available for orchestrating tasks and data workflows with automation features. Each of them claim to focus on ensuring a centralized, repeatable, reproducible, and efficient workflows coordination.
Parnab Basak

Airflow at Shopify: Keeping Users Happy while Running Airflow at Scale

Two years after starting our Airflow adoption, we’re running over 10,000 DAGs in production. On this journey we’ve learned a lot about Airflow management and stewardship and developed some unique tools to help us scale. We’re excited to share our experience and some of the lessons we’ve picked up along the way. In this talk we’ll cover: The history of Airflow at Shopify Our infrastructure and architecture Custom tools and procedures we’ve adopted to keep Airflow running smoothly and our users happy
Sam Wheating & Megan Parker

What is data lineage and why should I care?

If a job fails, how can you learn about downstream datasets that have become out-of-date? Can you be confident that jobs are consuming fresh, high-quality data from their upstream sources? How might you predict the impact of a planned change on distant corners of the pipeline? These questions become easier once you have a complete understanding of data lineage, the complex set of relationships between all of your jobs and datasets.
Ross Turk

Vega: Unifying Machine Learning Workflows at Credit Karma using Apache Airflow

At Credit Karma, we enable financial progress for more than 100 million of our members by recommending them personalized financial products when they interact with our application. In this talk we are introducing our machine learning platform to build interactive and production model-building workflows to serve relevant financial products to Credit Karma users. Vega, Credit Karma’s Machine Learning Platform, has 3 major components: 1) QueryProcessor for feature and training data generation, backed by Google BigQuery, 2) PipelineProcessor for feature transformations, offline scoring and model-analysis, backed by Apache Beam 3) ModelProcessor for running Tensorflow and Scikit models, backed by Google AI Platform, which provides data scientists the flexibility to explore different kinds of machine learning or deep learning models, ranging from gradient boosted trees to neural network with complex structures
Debasish Das, Raj Katakam & Nicholas Pataki

Preventative Metadata: Building for data reliability with DataHub, Airflow, & Great Expectations

Recently there has been much discussion around data monitoring, particularly in regards to reducing time to mitigate data quality problems once they’ve been detected. The problem with reactive or periodic monitoring as the de-facto standard for maintaining data quality is that it’s expensive. By the time a data problem has been identified, it’s effects may have been amplified across a myriad of downstream consumers, leaving you (a data engineer) with a big mess to clean-up.
John Joyce & Tamás Németh

Airflow at high scale for Autonomous Driving

This talk highlights a large-scale use case of Airflow to orchestrate workflows for an Autonomous Driving project based in Germany. To support our customer in the aim of producing their first Level-3 Autonomous Driving vehicle in Germany, we are utilising Airflow as a state-of-the-art tool to orchestrate workloads running on a large-scale HPC platform. In this talk, we will describe our Airflow setup deployed on OpenShift that is capable of running thousands of tasks in parallel and contains various custom improvements optimised for our use case.
Philipp Lang & Anton Ivanov

Happy DAGs + Happy Teammates: How a little CI/CD can go a long way

With a small amount of Cloud Build automation and the use of GitHub version control, your Airflow DAGs will always be tested and in sync no matter who is working on them. Leah will walk you through a sample CICD workflow for keeping your Airflow DAGs tested and in sync between environments and teammates.
Leah Cole

A look under the hood of the Airflow logging subsystem

The task logging subsystem is one of most flexible, yet complex and misunderstood components of Airflow. In this talk, we will take a look at the various task log handlers that are part of the core Airflow distribution, and dig a bit deeper in the interfaces they implement and discuss how those can be used to roll your own logging implementation.
Philippe Gagnon

Keep Calm & Query On: Debugging Broken Data Pipelines with Airflow

“Why is my data missing?” “Why didn’t my Airflow job run?” “What happened to this report?” If you’ve been on the receiving end of any of these questions, you’re not alone. As data pipelines become increasingly complex and companies ingest more and more data, data engineers are on the hook for troubleshooting where, why, and how data quality issues occur, and most importantly, fixing them so systems can get up and running again.
Francisco Alberini

How to Achieve Reliable Data in your Airflow Pipelines with Databand

Have data quality issues? What about reliability problems? You may be hearing a lot of these terms, along with many others, that describe issues you face with your data. What’s the difference, which are you suffering from, and how do you tackle both? Knowing that your Airflow DAGs are green is not enough. It’s time to focus on data reliability and quality measurements to build trust in your data platform.
Josh Benamram

Future of the Airflow UI

Sneak peek at the future of the Airflow UI. In Airflow 2.3 with the Tree -> Grid view changes, we began to swap out parts of the Flask app with React. This was one step towards AIP-38, to build a fully modern UI for Airflow. Come check out what is in store after Grid view in the current UI. Discuss the possibilities to rethink Airflow with a brand new UI down the line.
Brent Bovenzi

Leveraging Open Source Projects For Personal Development

Have you ever wondered what is next after learning the basics of software development, how you can improve your programming skills and gain more experience? These questions trouble a lot of people new to software development. They are not aware that they can leverage open-source projects to build their careers and land their dream job. In this session, I will share how you can leverage open-source projects to improve your skills, the challenges you would likely encounter, and how to overcome them and become a successful software engineer.
Ephraim Anierobi

All About Deferrables

Airflow 2.2 introduced Deferrable Tasks (sometimes called “async operators”), a new mechanism to efficiently run tasks that depend on external activity. But when should you use them, how do they work, what do you need to do to make Operators support them, and what else could we do in Airflow with this model?
Andrew Godwin

git push your data stack with Airbyte, Airflow and dbt

The use of version control and continuous deployment in a data pipeline is one of the biggest features unlocked by the modern data stack. In this talk, I’ll demonstrate how to use Airbyte to pull data into your data warehouse, dbt to generate insights from your data, and Airflow to orchestrate every step of the pipeline. The complete project will be managed by version control and continuously deployed by Github. This talk will share how to achieve a more secure, scalable, and manageable workflow for your data projects.
Evan Tahler & Marcos Marx

Managing Apache Airflow at Scale

In this session we’ll be discussing the considerations and challenges when running Apache Airflow at scale. We’ll start by defining what it means to run Airflow at scale. Then we’ll dive deep into understanding limitations of the Airflow architecture, Scheduler processes, and configuration options. We’ll then define scaling workloads via containers and leveraging pools and priority, followed by scaling DAGs via dDynamic DAGs/DAG factories, CI/CD, and DAG access control. Finally we’ll get into managing Multiple Airflow Environments, how to split up workloads, and provide central governance for Airflow environment creation and monitoring with an example of Distributing workloads across environments.
John Jackson

Wisdoms learnt when contributing to Apache Airflow

In this talk, I am going to share things that I learned while contributing to Apache Airflow. I am an Outreachy Intern for Apache Airflow. I made my first contribution to Open Source in the Apache Airflow project. I will also add a short description about myself and my experience working in Software Engineering and how i needed help in contributing to open source and ended up as an Intern for Outreachy.
Bowrna Prabhakaran

Love for writing deferrable operators. Why and how to defer?

Have you faced a scenario where 100 worker slots are available to run the Tasks, but you have 100 DAGs waiting on a Sensor that’s currently running but idle, waiting for something to happen? Ultimately, you got frustrated as you could not run anything else - even though your entire Airflow cluster was essentially idle. This is exactly where the concept of Deferrable Operators is very useful. This talk aims to give a brief introduction to solving this problem using Deferrable or Async Operators and how to implement it for your use case.
Ankit Chaurasia

Automatic Speech Recognition at Scale Using Tensorflow, Kubernetes and Airflow

Automatic Speech Recognition is quite a compute intensive task, which depends on complex Deep Learning models. To do this at scale, we leveraged the power of Tensorflow, Kubernetes and Airflow. In this session, you will learn about our journey to tackle this problem, main challenges, and how Airflow made it possible to create a solution that is powerful, yet simple and flexible.
Rafael Pierre

Multitenancy is coming

This session is about the state and future plans of the multi-tenancy feature of Airflow. Airflow has traditionally been single-tenant product. Mutliple instances could be bound together to provide a multi-tenant implementation and when using a modern infrastructure - Kubernetes - you could even reuse resources between those - but it was not a true “multi-tenant” solution. But Airflow becomes more of a platform now and the needs for multi-tenancy as a feature of the platform are highly expected by a number of users.
Jarek Potiuk & Mateusz Henc

Airflow in the Cloud: Lessons from the Field

Airflow users love to run Airflow in public clouds and on distributed infrastructures like Kubernetes. Running Airflow environments is easier than ever - community offers Helm-based installation for self-managed Airflow and there are many offerings of Airflow-based managed services. Commoditization of Airflow and broader Airflow user base brings new challenges. This talk presents observations of the Airflow service provider delivering “Airflow as a Service'' to cloud users (very technical, less technical and not technical at all).
Rafal Biegacz & Filip Knapik

OpenLineage & Airflow - data lineage has never been easier

OpenLineage is an open standard for metadata and lineage collection designed to instrument jobs as they are running. The standard has become remarkably adept at understanding the lifecycle of data within an organization. Additionally, Airflow lets you make use of OpenLineage with a convenient integration. Gathering data lineage has never been easier. In this talk, we’ll provide an update-to-date report on OpenLineage features and the Airflow integration – essential information for data governance architects & engineers.
Maciej Obuchowski & Pawel Leszczynski

How DAG Became a Test - Airflow System Tests Redefined

Nothing is perfect, but it doesn’t mean we shouldn’t seek perfection. After some time spent with Airflow system tests, we have recognized numerous places in which we can make significant improvements. We decided to rediscover them. The new design started with the establishment of goals. Tests need to: be easy to write, read, run and maintain, be as close as possible to how Airflow runs in practice, be fast, reliable and verifiable, assure high quality of Airflow Operators.
Bartłomiej Hirsz, Eugene Kosteev & Mateusz Nojek

Running +150 production Airflow on Kubernetes, is that HARD ?

This talk will cover the challenges we can face managing a large number of Airflow instances on private environment. Monitoring and metrics layers for production environment. Collecting and customizing logs. Resource consumption and green IT. Providing support for users and shared responsibility. Pain points
Alaeddine Maaoui & Prekshi Vyas

On-Demand DAG through the REST API

In this talk we want to present how Airbnb extends the REST api to support on-demand workload. A DAG object is created from a local environment like Jupyter notebook, serialized into binary and transported to the API. The API persists the DAG object into the meta DB and Airflow scheduler and worker are extended to process this new kind of DAG.
Mocheng Guo

Airflow at Pinterest

Pinterest has been part of the Airflow community for two years and has worked on many custom solutions to address usability, scalability, and efficiency constraints. This session is to discuss how Pinterest has further expanded on those previous solutions. We will discuss how we work to further reduce system latencies, improve user development through added search features, support for cross cluster operations, and improved debuggability tooling, and system level efficiency improvements to auto retry failed tasks that meet certain criteria.
Ace Haidrey, Yulei Li & Dinghang Yu

Using the Fivetran Airflow Provider

Fivetran’s Airflow provider allows Recharge to manage our connector syncs alongside our other DAGs orchestrating related components of our core data pipelines. The provider has enabled increased flexibility on sync schedules, custom alerting, and quicker response times to failures.
Annie Kaufman & Spencer Weeks

The tale of a startup's data journey and its growing need for orchestration

This talk tells the story of how we have approached data and analytics as a startup at Preset and how the need for a data orchestrator grew over time. Our stack is (loosely) Fivetran/Segment/dbt/BigQuery/Hightouch, and we finally got to a place where we suffer quite a bit from not having an orchestrator and are bringing in Airflow to address our orchestration needs. This talk is about how startups approach solving data challenges, the shifting role of the orchestrator in the modern data stack, and the growing need for an orchestrator as your data platform becomes more complex.
Maxime Beauchemin

Introducing Astro Python SDK: The next generation of DAG authoring

Imagine if you could chain together SQL models using nothing but python, write functions that treat Snowflake tables like dataframes and dataframes like SQL tables. Imagine if you could write a SQL airflow DAG using only python or without using any python at all. With Astro SDK, we at Astronomer have gone back to the drawing board around fundamental questions of what DAG writing could look like. Our goal is to empower Data Engineers, Data Scientists, and even the Business Analysts to write Airflow DAGs with code that reflects the data movement, instead of the system configuration.
Daniel Imberman

Automating Airflow Backfills with Marquez

As a data engineer, backfilling data is an important part of your day-to-day work. But, backfilling interdependent DAGs is time-consuming and often associated with an unpleasant experience. For example, let’s say you were tasked with backfilling a few months worth of data. You’re given the start and end date for the backfill that will be used to run an ad-hoc backfilling script that you have painstakingly crafted locally on your machine.
Willy Lulciuc

Airflow & Zeppelin: Better together

Airflow is the almost de-facto standard job orchestration tool that is used in the production stage. But moving your job from the development stage in other tools to the production stage in Airflow is usually a big pain for lots of users. A major reason is due to the environment inconsistency between the development environment and the production environment. Apache Zeppelin is a web-based notebook that is integrated seamlessly with lots of popular big data engines, such as Spark, Flink, Hive, Presto and etc.
Jeff Zhang

Airflow / Kubernetes: Running on and using k8s

Apache Airflow and Kubernetes work well together. Not only does Airflow have native support for running tasks on Kubernetes, there is also an official helm chart that makes it easy to run Airflow itself on Kubernetes! Confused on the differences between KubernetesExecutor and KubernetesPodOperator? What about CeleryKubernetesExecutor? Or the new LocalKubernetesExecutor? After this talk you will understand how they all fit in the ecosystem. We will talk about the ways you can run Airflow on Kubernetes, run tasks on Kubernetes, or do both.
Jed Cunningham

Skip tasks to make your debugging easy

In Apple, we are building a self-serve data platform based on Airflow. Self-serve means users can create, deploy and run their DAGs freely. With provided logs and metrics, users are able to test or troubleshot DAGs on their own. Today, a common use case is, users want to test one or a few tasks in their DAG. However, when they trigger the DAG, all tasks instead of just the ones people are interested will run.
Howie Wang

The SLAyer your Data Pipeline Needs

Airflow has an inherent SLA alert mechanism. When the scheduler sees such an SLA miss for some task, it sends an alert by email. The problem is, that this email is nice, but we can’t really know when each task is eventually successful. Moreover, even if there is such an email upon success following an SLA miss, it does not give us a good view of the current status at any given time.
Eden Gluska

Managing Multiple ML Models For Multiple Clients : Steps For Scaling Up

For most ML-based SaaS companies, the need to fulfill each customer’s KPI will usually be addressed by matching a dedicated model. Along with the benefits of optimizing the model’s performance, a model per customer solution carries a heavy production complexity with it. In this manner, incorporating up-to-date data as well as new features and capabilities as part of a model’s retraining process can become a major production bottleneck. In this talk, we will see how Riskified scaled up modeling operations based on MLOps ideas, and focus on how we used Airflow as our ML pipeline orchestrator.
Ori Peri

Lets use Airflow differently: let's talk load tests

Numeric results with bulletproof confidence: this is what companies actually sell when promoting their machine learning product. Yet this seems out of reach when the product is both generic and complex, with much of the inner calculations hidden from the end user. So how can code improvements or changes in core component performance be tested at scale? Implementing API and Load Tests is time-consuming, but thorough: defining parameters, building infrastructure and debugging.
Doron Cohen

How to Deploy Airflow From Dev to Prod Like A BOSS

Managing Airflow in large-scale environments is tough. You know this, and I know this. But, what if you had a guide to make development, testing, and production lifecycles more manageable? In this presentation, I will share how we manage Airflow for large-scale environments with friendly deployments at every step. After attending the session, Airflow engineers will: Understand the advantages of each kind of deployment Know the differences between Deployment and Airflow Executor Deploy how to incorporate all kinds of deployments for their day-to-day needs
Evgeny Shulman

Well-Architected Workflows in Apache Airflow

Resilient systems have the capability to recover when stressed by load, bugs in the workflow, and failure of any task. Reliability of the infrastructure or platform is not sufficient to run workflows reliably. It is critical to bring in resiliency practices during the design and build phase of the workflow to improve reliability, performance and operational aspects of the workflow. In this session, We will go through Architecture of the Airflow through the lens of reliability Idempotency Designing for failures Applying back pressure Best practices What we do not cover: Infrastructure/Platform/Product reliability
Uma Ramadoss

Modern Data Orchestration managed by Astronomer

At Astronomer we have been longtime supporters and contributors to open source Apache Airflow. In this session we will present Astronomer’s latest journey, Astro, our cloud-native managed service that simplifies data orchestration and reduces operational overhead. We will also discuss the increasing importance of data orchestration in modern enterprise data platforms, industry trends, and practical problems that arise in the ever expanding heterogeneous environments.
Navid Aghdaie

Airflow extensions for governing a self-serviced data mesh

While many companies set up isolated data teams, Adyen is a strong believer of the data mesh approach, with all our data living in a central place. While our tooling teams provide and operate the on-premise cluster, the product teams are able to take full ownership of their data pipelines. Our 100+ users, spread across 10+ teams, own in total more than 200 dags and 4000 tasks. We use a single Airflow instance with many cross-dag and cross-stream dependencies within these 200 dags.
Jorrick Sleijster

Data Science Platform at PlayStation and Apache Airflow

In this talk, we explain how Apache Airflow is at the center of our Kubernetes-based Data Science Platform at PlayStation. We talk about how we built a flexible development environment for Data Scientists to interact with Apache Airflow and explain the tools and processes we built to help Data Scientists promote their dags from development to production. We will also talk about the impact of containerization and the usage of KubernetesOperator and the new SparkKubernetesOperator and the benefits of deploying Airflow in Kubernetes using the KubernetesExecutor across multiple environments.
Hamed Saljooghinejad & Siraj Malik

What's New with Amazon Managed Workflows for Apache Airflow (MWAA)

In this session we will discuss the latest features of Amazon Managed Workflows for Apache Airflow (MWAA) as well as some tips and tricks to get the most out of the service. We’ll also discuss the AWS commitment to the Apache Airflow project and what we’re doing to stay connected and contribute to the community.
John Jackson

Implementing Event-Based DAGs with Airflow

Needing to trigger DAGs based on external criteria is a common use case for data engineers, data scientists, and data analysts. Most Airflow users are probably aware of the concept of sensors and how they can be used to run your DAGs off of a standard schedule, but sensors are only one of multiple methods available to implement event-based DAGs. In this session, we’ll discuss different ways of implementing event-based DAGs using Airflow 2 features like the API and deferrable operators, with a focus on how to determine which method is the most efficient, scalable, and cost-friendly for your use case.
Kenten Danas

Kyte: Scalable and Isolated DAG Development Experience at Lyft

Developer velocity starts to become an issue as your user base grows and becomes more varied. This is compounded by the fact that it’s not easy to end-to-end test data pipelines as part of continuous integration. In this talk, we’ll go over what we’ve done at Lyft to make an effective development and testing environment, serving over 1000 users who have made over 5000 dags, at a rate of about 50 developer per week.
Max Payton & Paul Dittamo

Data Lineage with Apache Airflow and Apache Spark

Data within today’s organizations has become increasingly distributed and heterogeneous. It can’t be contained within a single brain, a single team, or a single platform…but it still needs to be comprehensible, especially when something unexpected happens. Data lineage can help by tracing the relationships between datasets and providing a cohesive graph that places them in context. OpenLineage provides a standard for lineage collection that spans multiple platforms, including Apache Airflow and Apache Spark.
Michael Collado

What's new in Airflow 2.3?

This session will talk about the awesome new features the community has built that would be part of Airflow 2.3. Highlights: Dynamic Task Mapping DB. Downgrades Pruning old DB records Connections using JSON UI Improvements
Kaxil Naik

Workshop: Contributing to Apache Airflow

Learn how to setup a development environment, how to pick your first issue, how to communicate effectively within the community and how to make your first PR.
Jarek Potiuk & Elad Kalif

Workshop: Running Airflow within Cloud Composer

Hands on workshop showing how easy it is to deploy Airflow in a public Cloud. This workshop is mostly targeted at Airflow newbies and users who would like to learn more about Cloud Composer.
Rafal Biegacz, Leah Cole, Bartosz Jankiewicz, Przemek Więch & Filip Knapik

Airflow and _____: A discussion around utilizing Airflow with other data tools

Come hang with Airflow practitioners from around the world using Airflow AND other data tools to power their data practice. From Databricks to Glue to Azure Data Factory, smart businesses make the right decision to standardize on Airflow for what it’s best at while using the other systems for what they are best at.
Brad Kirn, Sarah Johnson, Jitendra Shah & Alessandro Pregnolato

Beyond Testing: How to Build Circuit Breakers with Airflow

Testing is an important part of the DataOps life cycle, giving teams confidence in the integrity of their data as it moves downstream to production systems. But what happens when testing doesn’t catch all of your bad data and “unknown unknown” data quality issues fall through the cracks? Fortunately, data engineers can apply a thing or two from DevOps best practices to tackle data quality at scale with circuit breakers, a novel approach to stopping bad data from actually entering your pipelines in the first place.
Prateek Chawla

TFX on Airflow with delegation of processing to third party services

Learn how to externalize any TFX heavyweight computing outside Airflow, while maintaining Airflow as the orchestrator for your machine learning pipelines.
Israel Herraiz & Paul Balm

Ingesting Game Telemetry in near Real-time dynamically into Redshift with Airflow (WB Games)

We the Data Engineering Team here at WB Games implemented an internal Redshift Loader DAG(s) on Airflow that allow us to ingest data in near real-time at scale into Redshift, taking into account variable load on the DB and been able to quickly catch up data loads in case of various DB outages or high usage scenarios. Highlights: Handle any type of Redshift outages and system delays dynamically between multiple sources(S3) to sinks(Redshift).
Karthik Kadiyam

An Introduction to Data Lineage with Airflow and Marquez

Learn how to collect and visualize lineage from a basic Airflow pipeline using Marquez. You will need to understand the basics of Airflow, but no experience with lineage is required.
Michael Robinson & Ross Turk

Hey maintainer (and user), exercise your empathy!

This talk is a walk through throug a number of ways maintainers of open-source projects (for example Airflow) can improve the communication with their users by exercising empathy. This subject is often overlooked in the cirriculum of average developer and contributor, but one that can make or break the product you developed, simply because it will become more approachable for users. Maintainers often forget or simply do not realize how many assumptions they have in their head.
Jarek Potiuk