These are the sessions that were presented at Airflow Summit 2021. For previous editions check the archive.

Title Speaker(s) Recording Slides

Contributing to Apache Airflow | Journey to becoming Airflow's leading contributor

From not knowing Python (let alone Airflow), and from submitting the first PR that fixes typo to becoming Airflow Committer, PMC Member, Release Manager, and #1 Committer this year, this talk walks through Kaxil’s journey in the Airflow World. The second part of this talk explains: how you can also start your OSS journey by contributing to Airflow Expanding familiarity with a different part of the Airflow codebase Continue committing regularly & steadily to become Airflow Committer.
Kaxil Naik

Contributing to Apache Airflow: First Steps

Learn to contribute to the Apache Airflow ecosystem both with and without code. Post an article to the Airflow blog, improve documentation, or dive head-first into into Airflow’s free and open source software community.
Ryan Hatter

Discussion panel: Keep your Airflow secure

You might have heard some recent news about ransomware attacks for many companies. Quite recently the U. S. Department of Justice has elevated the priority of investigations of ransomware attacks to the same level as terrorism. Certainly security aspects of running software and so called “supply-chain attacks” have made a press recently. Also, you might have read recently about security researcher who made USD 13,000 via bounties by finding and contacting companies that had old, un-patched versions of Airflow - even if the ASF security process was great and PMC of Airflow has fixed those long time ago.
Tomasz Urbaszek, Ash Berlin-Taylor, Jarek Potiuk & Dolev Farhi

You don’t have to wait for someone to fix it for you

Rachael, a new Airflow contributor, and Leah, an experienced Airflow contributor, share the story of Rachael’s first contribution, highlighting the importance of contributions from new users and the positive impact that non-code contributions have in an open source community.
Rachael Deacon-Smith & Leah Cole

Workshop: Contributing to Apache Airflow

Learn how to become a code contributor to the Apache Airflow project.
Jarek Potiuk & Tomasz Urbaszek

Airflow as the Foundation of a Multi-Faceted Data Platform

A discussion with Jay Sen, Data Platform Architect at Paypal, and Ry Walker, Founder/CTO of Astronomer about the central role Airflow plays within Paypal’s data platform, and the opportunity to build stronger integrations between Airflow and other tools that surround it.
Ry Walker & Jay Sen

Apache Airflow at Apple - Multi-tenant Airflow and Custom Operators

Running a platform where different business units at Apple can run their workloads in isolation and share operators.
Roberto Santamaria & Howie Wang

Lessons Learned while Migrating Data Pipelines from Enterprise Schedulers to Airflow

Digital transformation, application modernization, and data platform migration to the cloud are key initiatives in most enterprises today. These initiatives are stressing the scheduling and automation tools in these enterprises to the point that many users are looking for better solutions. A survey revealed that 88% of users believe that their business will benefit from an improved automation strategy across technology and business. Airflow has an excellent opportunity to capture mindshare and emerge as the leading solution here.
Shivnath Babu & Hari Nair

Airflow Journey @SG

This talk will cover the adoption journey (Technical Challenges & Team Organization) of Apache Airflow (1.8 to 2.0) at Societe Generale. Time line of events: POC with v1.8 to convince our management. Shared infrastructure with v1.10.2. Multiple Infrastructure with v1.10.12. On demand service offer with v2.0 (Challenges & REX)
Ahmed Chakir & Alaeddine Maaoui

Building an Elastic Platform Using Airflow Uniquely as an Orchestrator

At QuintoAndar we seek automation and scalability in our data pipelines and believe that Airflow is the right tool for giving us exactly what we need. However, having all concerns mapped and tooling defined doesn’t necessarily mean success. For months we had struggled with a misconception that Airflow should act as an orchestrator and executor within a monolithic strategy. That could not be further from the truth because of the rise of scalability and performance issues, infrastructure and maintainability costs, and multi-directional impact throughout development teams.
Rafael Ribaldo & Lucas Fonseca

Data Pipeline HealthCheck for Correctness, Performance, and Cost Efficiency

We are witnessing a rapid growth in the number of mission-critical data pipelines that leaders of data products are responsible for. “Are your data pipelines healthy?” This question was posed to more than 200 leaders of data products from various industries. The answers ranged from “unfortunately, no” to “they are mostly fine, but I am always afraid that something or the other will cause a pipeline to break”. This talk presents the concept of Pipeline HealthCheck (PHC) which enables leaders of data products to have high confidence in the correctness, performance, and cost efficiency of their data pipelines.
Shivnath Babu

Looking ahead: What comes after Airflow 2.0?

aizhamal-nurmamat & Ash Berlin-Taylor

Airflow: The Power of Stitching Services Together

Apache Airflow is known to be a great orchestration tool that enables use cases that would not be possible otherwise. One of the great features that Airflow has is the possibility to “glue” together totally separate services to establish bigger functionalities. In this talk you will learn about various Airflow usages that let Airflow users to automate their critical company processes and even establish businesses. The examples provided will be based on Airflow used in the context of Cloud Composer which is a managed service to provision and manage Airflow instances.
Rafal Biegacz & Filip Knapik

Pinterest’s Migration Journey

Last year, we were able to share why we have selected Airflow to be our next generation workflow system. This year, we will dive into the journey of migrating over 3000+ workflows and 45000+ tasks to Airflow. We will discuss the infrastructure additions to support such loads, the partitioning and prioritization of different workflow tiers defined in house, the migration tooling we built to get users to onboard, the translation layers between our old DSLs and the new, our internal k8s executor to leverage Pinterest’s kubernetes fleet, and more.
Ace Haidrey, Euccas Chen, Yulei Li, Dinghang Yu & Ashim Shrestha

The Newcomer's Guide to Airflow's Architecture

Airflow has a lot of moving parts, and it can be a little overwhelming as a new user - as I was not too long ago. Join me as we go though Airflow’s architecture at a high level, explore how DAGs work and run, and look at some of the good, the bad, and the unexpected things lurking inside.
Andrew Godwin

Dataclasses as Pipeline Definitions in Airflow

We will describe how we were able to build a system in Airflow for MySQL to Redshift ETL pipelines defined in pure Python using dataclasses. These dataclasses are then used to dynamically generate DAGs depending on pipeline type. This setup allows us to implement robust testing, validation, alerts, and documentation for our pipelines. We will also describe the performance improvements we achieved by upgrading to Airflow 2.0.
Madison Swain-Bowden

Creating Data Pipelines with Elyra, a visual DAG composer and Apache Airflow

This presentation will detail how Elyra creates Jupyter Notebook, Python and R script- based pipelines without having to leave your web browser. The goal of using Elyra is to help construct data pipelines by surfacing concepts and patterns common in pipeline construction into a familiar, easy to navigate interface for Data Scientists and Engineers so they can create pipelines on their own. In Elyra’s Pipeline Editor UI, portions of Apache Airflow’s domain language are surfaced to the user and either made transparent or understandable through the use of tooltips or helpful notes in the proper context during pipeline construction.
Alan Chin

Apache Airflow and Ray: Orchestrating ML at Scale

As the Apache Airflow project grows, we seek both ways to incorporate rising technologies and novel ways to expose them to our users. Ray is one of the fastest-growing distributed computation systems on the market today. In this talk, we will introduce the Ray decorator and Ray backend. These features, built with the help of the Ray maintainers at Anyscale, will allow Data Scientists to natively integrate their distributed pandas, XGBoost, and TensorFlow jobs to their airflow pipelines with a single decorator.
Daniel Imberman

Event-based Scheduling Based on Airflow

Airflow scheduler uses DAG definitions to monitor the state of tasks in the metadata database, and triggers the task instances whose dependencies have been met. It is based on state of dependencies scheduling. The idea of event based scheduling is to let the operators send events to the scheduler to trigger a scheduling action, such as starting jobs, stopping jobs and restarting jobs. Event based scheduling allows potential support for richer scheduling semantics such as periodic execution and manual trigger at per operator granularity.
Wuchao Chen

Provision as a Service: Automating data center operations with Airflow at Cloudflare

Cloudflare’s network keeps growing, and that growth doesn’t just come from building new data centers in new cities. We’re also upgrading the capacity of existing data centers by adding newer generations of servers — a process that makes our network safer, faster, and more reliable for our users. In this talk, I’ll share how we’re leveraging Apache Airflow to build our own Provision-as-a-Service (PraaS) platform and cut by 90% the amount of time our team spent on mundane operational tasks.
Jet Mariscal

Introducing Viewflow: a framework for writing data models without writing Airflow code

In this talk, we present Viewflow, an open-source Airflow-based framework that allows data scientists to create materialized views in SQL, R, and Python without writing Airflow code. We will start by explaining what problem does Viewflow solve: writing and maintaining complex Airflow code instead of focusing on data science. Then we will see how Viewflow solves that problem. We will continue by showing how to use VIewflow with several real-world examples.
Gaëtan Podevijn

Create Your Custom Secrets Backend for Apache Airflow - A guided tour into Airflow codebase

This talk aims to share how Airflow’s secrets backend works, and how users can create their custom secret backends for their specific use cases & technology stack.
Xiaodong DENG

SciDAP: Airflow and CWL-powered bioinformatics platform

Reproducibility is the fundamental principle of a scientific research. This also applies to the computational workflows that are used to process research data. Common Workflow Language (CWL) is a highly formalized way to describe pipelines that was developed to achieve reproducibility and portability of computational analysis. However, there were only few workflow execution platforms that could run CWL pipelines. Here, we present CWL-Airflow – an extension for Airflow to execute CWL pipelines.
Nick Luckey & Michael Kotliar

Robots are your friends - using automation to keep your Airflow operators up to date

As part of my role at Google, maintaining samples for Cloud Composer, hosted managed Airflow, is crucial. It’s not feasible for me to try out every sample every day to check that it’s working. So, how do I do it? Automation! While I won’t let the robots touch everything, they let me know when it’s time to pay attention. Here’s how: Step 0: An update for the operators is released Step 1: A GitHub bot called Renovate Bot opens up a PR to a special requirements file to make this update Step 2: Cloud build runs unit tests to make sure none of my DAGs immediately break Step 3: PR is approved and merged to main Step 4: Cloud Build updates my dev environment Step 5: I look at my DAGs in dev to make sure all is well.
Leah Cole

Airflow and Analytics Engineering - Dos and don'ts

Considering that the role of Analytics Engineering has emerged in the last few years within data and analytics teams, it is important for me to highlight what role an Analytics engineer has and how the Dos and Don’ts from my perspective can contribute to a team and boost their day-to-day work with the help of Airflow.
Sergio Camilo Fandiño Hernández

Next-Gen Astronomer Cloud

Astronomer founders Ry Walker and Greg Neiheisel will preview the upcoming next-gen Astronomer Cloud product offering.
Ry Walker & Greg Neiheisel

Apache Airflow at Wise

Wise (previously TransferWise) is a London-based fin-tech company. We build a better way of sending money internationally. At Wise we make great use of Airflow. More than 100 data scientists, analysts and engineers use Airflow every day to generate reports, prepare data, (re)train machine learning models and monitor services. My name is Alexandra, I’m a Machine Learning Engineer at Wise. Our team is responsible for building and maintaining Wise’s Airflow instances.
Alexandra Abbas

Orchestrating ELT with Fivetran and Airflow

At Fivetran, we are seeing many organizations adopt the Modern Data Stack to suit the breadth of their data needs. However, as incoming data sources begin to scale, it can be hard to manage and maintain the environment, with more time spent repairing and reengineering old data pipelines than building new ones. This talk will introduce a number of new Airflow Providers, including the airflow-provider-fivetran, and discuss some of the benefits and considerations we are seeing data engineers, data analysts, and data scientist experience in doing so.
Nick Acosta

Upgrading to Apache Airflow 2

Airflow 2.0 was a big milestone for the Airflow community. However, companies and enterprises are still facing difficulties in upgrading to 2.0. In this talk, I would like to focus and highlight the ideal upgrade path and talk about upgrade_check CLI tool separation of providers registering connections types important 2.0 Airflow configs DB Migration deprecated feature around Airflow Plugins
Kaxil Naik

Guaranteeing pipeline SLAs and data quality standards with Databand

We’ve all heard the phrase “data is the new oil.” But really imagine a world where this analogy is more real, where problems in the flow of data - delays, low quality, high volatility - could bring down whole economies? When data is the new oil with people and businesses similarly reliant on it, how do you avoid the fires, spills, and crises? As data products become central to companies’ bottom line, data engineering teams need to create higher standards for the availability, completeness, and fidelity of their data.
Josh Benamram & Vinoo Ganesh

Deep dive in to the Airflow scheduler

The scheduler is the core of Airflow, and it’s a complex beast. In this session we will go through the scheduler in some detail; how it works; what the communication paths are and what processing is done where.
Ash Berlin-Taylor

Running Big Data Applications in production with Airflow + Firebolt

In this talk we’ll see some real world examples from Firebolt customers demonstrating how Airflow is used to orchestrate operational data analytics applications with large data volumes, while keeping query latency low.
Boaz Farkash

Writing Dry Code in Airflow

Engineering teams leverage the factory coding pattern to write easy-to-read and repeatable code. In this talk, we’ll outline how data engineering teams can do the same with Airflow by separating DAG declarations from business logic, abstracting task declarations from task dependencies, and creating a code architecture that is simple to understand for new team members. This approach will set analytics teams up for success as team and Airflow DAG sizes grow exponentially.
Sarah Krasnik

Building a robust data pipeline with the dAG stack: dbt, Airflow, Great Expectations

Data quality has become a much discussed topic in the fields of data engineering and data science, and it has become clear that data validation is absolutely crucial to ensuring the reliability of any data products and insights produced by an organization’s data pipelines. This session will outline patterns for combining three popular open source tools in the data ecosystem - dbt, Airflow, and Great Expectations - and use them to build a robust data pipeline with data validation at each critical step.
Sam Bail

An On-Demand Airflow Service for Internet Scale Gameplay Pipelines

EA Games have very dynamic and federated needs on their data processing pipelines. Many individual studios within EA build and manage the data pipelines for their games iterating rapidly through game development cycles. Developer productivity around orchestrating these pipelines is as critical as providing a robust production quality orchestration service. With these in mind, we re-engineered our Airflow service ground up to cater to our large internal user base (1000s) and internet scale data processing systems (Petabytes of data).
Nitish Victor & Yuanmeng Zeng

Airflow Extensions for Streamlined ETL Backfilling

Using Airflow as our scheduling framework, we ETL data generated by tens of millions of transactions every day to build the backbone for our reports, dashboards, and training data for our machine learning models. There are over 500 (and growing) such ingested and aggregated tables owned by multiple teams that contain intricate dependencies between one another. Given this level of complexity, it can become extremely cumbersome to coordinate backfills for any given table, when also taking into account all its downstream dependencies, aggregation intervals, and data availability.
Ravi Autar

The new modern data stack - Airbyte, Airflow, DBT

In this talk, I’ll describe how you can leverage 3 open-source standards - workflow management with Airflow, EL with Airbyte, transformation with DBT - to build your next modern data stack. I’ll explain how to configure your Airflow DAG to trigger Airbyte’s data replication jobs and DBT’s transformation one with a concrete use case.
Michel Tricot

Building the Data Science Platform with Airflow @Near

At Near we work on TBs of Location data with close to real time modelling to generate key consumer insights and estimates for our clients across the globe. We have hundreds of country specific models deployed and managed through airflow to achieve this goal. Some of the workflows that we have deployed our schedule based, some are dynamic and some are trigger based. In this session I would be discussing some of the workflows that are being scheduled and monitored using airflow and the key benefits and also the challenges that we have faced in our production systems.
Manmeet Kaur

Modernize a decade old pipeline with Airflow 2.0

As a follow up for https://airflowsummit.org/sessions/teaching-old-dag-new-tricks/, in this talk, we would like to share a happy ending story on how Scribd fully migrated its data platform to the cloud and Airflow 2.0. We will talk about data validation tools and task trigger customizations the team built to smooth out the transition. We will share how we completed the Airflow 2.0 migration started with an unsupported MySQL version and metrics to prove why everyone should perform the upgrade.
QP Hou, Kuntal Basu, Stas Bytsko & Dmitry Suvorov

Building the AirflowEventStream

Or how to keep our traditional java application up-to-date on everything big data. At Adyen we process tens of millions of transactions a day, a number that rises every day. This means that generating reports, training machine learning models or any other operation that requires a bird’s eye view on weeks or months of data requires the use of Big Data technologies. We recently migrated to Airflow for scheduling all batch operations on our on-premise Big Data cluster.
Jelle Munk

Dynamic Security Roles in Airflow for Multi-Tenancy

Multi-tenant Airflow instances can help save costs for an organization. This talk will walk through how we dynamically assigned roles to users based on groups in Active Directory so that teams would have access to DAGs they created in the UI on our multi-tenant Airflow instance. We created our own custom AirflowSecurityManager class in order to achieve this that ultimately ties LDAP and RBAC together.
Mark Merling & Sean Lewis

Apache Airflow 2.0 on Amazon MWAA

In this session we will discuss Amazon Managed Workflows for Apache Airflow (MWAA), how Apache Airflow (and specifically version 2.0) is implemented in the service, best practices for deployment and operations, and the Amazon MWAA team’s commitment to open source usage and contributions.
John Jackson & Sam Dengler

Clearing Airflow obstructions

Apache Airflow aims to speed the development of workflows, but developers are always ready to add bugs here and there. This talk illustrates a few pitfalls faced while developing workflows at the BBC to build machine learning models. The objective is to share some lessons learned and, hopefully, save others time. Some of the topics covered, with code examples: Tasks unsuitable to be run from within Airflow executors Plugins misusage Inconsistency while using an operator (Mis)configuration What to avoid during a workflow deployment Consequences of non-idempotent tasks
Tatiana Al-Chueyr Martins

Building Providers & DAGs in the Airflow Ecosystem

Learn how to use Airflow’s robust ecosystem of providers to construct secure, high-quality DAGs.
Plinio Guzman

Reverse ETL on Airflow

At Snowflake you can imagine we do a lot of data pipelines and tables curating metrics metrics for all parts of the business. These are the lifeline of Snowflake’s business decisions. We also have a lot of source systems that display and make these metrics accessible to end users. So what happens when your data model does not match your system? For example your bookings numbers in salesforce do not match your data model that curates bookings metrics.
Russell Dervay

Productionizing ML Pipelines with Airflow, Kedro, and Great Expectations

Machine Learning models can add value and insight to many projects, but they can be challenging to put into production due to problems like lack of reproducibility, difficulty maintaining integrations, and sneaky data quality issues. Kedro, a framework for creating reproducible, maintainable, and modular data science code, and Great Expectations, a framework for data validations, are two great open-source Python tools that can address some of these problems. Both integrate seamlessly with Airflow for flexible and powerful ML pipeline orchestration.
Kenten Danas

Data Lineage with Apache Airflow using OpenLineage

If you manage a lot of data, and you’re attending this summit, you likely rely on Apache Airflow to do a lot of the heavy lifting. Like any powerful tool, Apache Airflow allows you to accomplish what you couldn’t before… but also creates new challenges. As DAGs pile up, complexity layers on top of complexity and it becomes hard to grasp how a failed or delayed DAG will affect everything downstream.
Julien Le Dem & Willy Lulciuc

Drift Bio: The Future of Microbial Genomics with Apache Airflow

In recent years, the bioinformatics world has seen an explosion in genomic analysis as gene sequencing technologies have become exponentially cheaper. Tests that previously would have cost tens of thousands of dollars will soon run at pennies per sequence. This glut of data has exposed a notable bottleneck in the current suite of technologies available to bioinformaticians. At Drift Biotechnologies, we use Apache Airflow to transition traditionally on-premise large scale data and deep learning workflows for bioinformatics to the cloud, with an emphasis on workflows and data from next generation sequencing technologies.
Eli Scheele

Usability Improvements: Debugging & Inspection Tooling

The two most common user questions at Pinterest are: 1) why is my workflow running so long? 2) why did my workflow fail - is it my issue, or a platform issue? As with any big data organization, the workflow platform is just the orchestrator but the “real” work is done on another layer, managed by another platform. There can be plenty of these, and the challenges of figuring out the root cause of an issue can be mundane and time consuming.
Ace Haidrey, Euccas Chen, Yulei Li, Dinghang Yu & Ashim Shrestha

Autoscaling in Airflow - Lessons learned

Autoscaling in Airflow - what we learnt based on Cloud Composer case. We would like to present how we approach the autoscaling problem for Airflow running in Kubernetes in Cloud Composer: how we calculate our autoscaling metric, what problem we had for scaling down and how did we solve it. Also we share an ideas on what and how we could improve the current solution
Mateusz Henc & Anita Fronczak

Building a Scalable & Isolated Architecture for Preprocessing Medical Records

After performing several experiments with Airflow, we reached the best architectural design for processing text medical records in scale. Our hybrid solution uses Kubernetes, Apache Airflow, Apache Livy, and Apache cTAKES. Using Kubernetes’ containers has the benefit of having a consistent, portable, and isolated environment for each component of the pipeline. With Apache Livy, you can run tasks in a Spark Cluster at scale. Additionally, Apache cTAKES helps with the extraction of information from electronic medical records clinical free-text by using natural language processing techniques to identify codable entities, temporal events, properties, and relations.
Mikaela Pisani & Anthony Figueroa

Advanced Superset for Engineers (API’s, Version Controlled Dashboards, & more)

Apache Superset is a modern, open-source data exploration & visualization platform originally created by Maxime Beauchemin. In this talk, I will showcase advanced technical Superset features like the rich Superset API, how to version control dashboards using Github, embedding Superset charts in other applications, and more. This talk will be technical and hands-on, and I will share all code examples I use so you can play with them yourself afterwards!
Srini Kadamati

Operating contexts: patterns around defining how a DAG should behave in dev, staging, prod & beyond

As people define and publish a DAG, it can be really useful to make it clear how this DAG should behave under different “operating contexts”. Common operating contexts may match your different environments (dev / staging / prod) and/or match your operating needs (quick run, full backfill, test run, …). Over the years, patterns have emerged around workflow authors, teams and organizations, and little has been shared as to how to approach this.
Maxime Beauchemin

Airflow loves Kubernetes

In this talk Jarek and Kaxil will talk about official, community support for running Airflow in the Kubernetes environment. The full support for Kubernetes deployments was developed by the community for quite a while and in the past users of Airflow had to rely on 3rd-party images and helm-charts to run Airflow on Kubernetes. Over the last year community members made an enormous effort to provide robust, simple and versatile support for those deployments that would respond to all kinds of Airflow users.
Jarek Potiuk & Kaxil Naik

Customizing Xcom to enhance data sharing between tasks

In Apache Airflow, Xcom is the default mechanism for passing data between tasks in a DAG. In practice, this has been restricted to small data elements, since the Xcom data is persisted in the Airflow metadatabase and is constrained by database and performance limitations. With the new TaskFlow API introduced in Airflow 2.0, it is seamless to pass data between tasks and the use of Xcom is invisible. However, the ability to pass data is restricted to a relatively small set of data types which can be natively converted in JSON.
Vikram Koka & Ephraim Anierobi

MWAA: Design Choices and Road Ahead

An informal and fun chat about the journey that we took and the decisions that we made in building Amazon Managed Workflows for Apache Airflow. We will talk about Our first tryst with understanding Airflow Talking to Amazon Data Engineers and how they ran workflows at scale Key design decisions and the reasons behind them Road ahead, and what we dream about for future of Apache Airflow. Open-Source tenets and commitment from the team We will leave time at the end for a short AMA/Questions.
Subash Canapathy & John Jackson