Welcome to the session program for Airflow Summit 2023.

Filter by track

All

Filter by day

Tuesday, September 19, 2023

09:00
09:00 - 09:30.
by Rich Bowen
Room: Ballroom A-B
For those of us who already know how important open source is, it can be challenging to persuasively make the case to management, because we assume that everyone already knows the basics. This can work against us, confusing our audience and making us come across as condescending or concerned about irrelevant lofty philosophical points. In this talk, we take it back to the basics. What does management actually need to know about open source, why it matters, and how to make decisions about consuming open source, contributing to open source, and open sourcing company code?
09:30
09:30 - 10:00.
by Dustin Ingram
Room: Ballroom A-B
We’ve heard a lot in the last few years about insecurity in the open source software ecosystem, whether it be vulnerabilities, supply chain attacks or malware. Has open source become suddenly fraught with security problems? Or is it maybe, possibly… actually doing great? Let’s delve into the collaborative nature of our open-source ecosystems, and explore how transparency, peer review, and community have created a robust security posture. We’ll examine real-world examples, dispel myths, and reveal the inherent strengths of open source in fostering a secure and resilient software ecosystem.
10:00
10:00 - 11:00.
by Kaxil Naik, Pierre Jeambrun, Jarek Potiuk, Ash Berlin-Taylor & Marc Lamberti
Room: Ballroom A-B
Airflow is almost 10 years old! Since starting out at AirBnB, the project has taken all sorts of twists and turns before getting to where it is now. Through its lifecycle, Airflow has seen an explosion of contributors (over 2400 and counting), end users, use cases, and so much more. This panel, moderated by Marc Lamberti, will be about some of the faces that have helped make Airflow what it is.
11:00
11:00 - 11:30
Morning break
11:00

Ballroom A-B

Ballroom C-D

Ballroom crush

York

11:30
11:30 - 11:55.
by Marc Lamberti
Room: Ballroom A-B
Airflow is a powerful tool for orchestrating complex data workflows, which have undergone significant changes over the past two years. Since the Airflow release cycle has accelerated, you may struggle to keep up with the continuous flow of new features and improvements, which can lead to miss opportunities for addressing new use cases or solving your existing ones more efficiently. This presentation is intended to give you a solid update on the possibilities of Airflow and address misconceptions you may have heard or still believe that used to be valid but no longer are.
11:30 - 11:55.
by Jose Puertos
Room: Ballroom C-D
The purpose of this session is to indicate how we leverage airflow in a federated way across all our business units to perform a cost-effective platform that accommodates different patterns of data integration, replication and ML tasks in a flexible way providing DevOps tunning of DAGs across environments that integrate to our open-source observability strategy that allows our SREs to have a consistent metrics, monitoring and alerting of data tasks.
11:30 - 11:55.
by Kenten Danas
Room: Ballroom crush
Astronomer has hosted over 100 Airflow webinars designed to educate and inform the community on best practices, use cases, and new features. The goal of these events is to increase Airflow’s adoption and ensure everybody, from new users to experienced power users, can keep up with a project that is evolving faster than ever. When new releases come out every few months, it can be easy to get stuck in past versions of Airflow.
11:30 - 11:55.
by Utkarsh Sharma
Room: York
Making a contribution to or becoming a committer on Airflow can be a daunting task, even for experienced Python developers and Airflow users. The sheer size and complexity of the code base may discourage potential contributors from taking the first steps. To help alleviate this issue, this session is designed to provide a better understanding of how Airflow works and build confidence in getting started. During the session, we will introduce the main components of Airflow, including the Web Server, Scheduler, and Workers.
12:00
12:00 - 12:25.
by Niko Oliveira
Room: Ballroom A-B
Executors are a core concept in Apache Airflow and are an essential piece to the execution of DAGs. They have seen a lot of investment over the year and there are many exciting advancements that will benefit both users and contributors. This talk will briefly discuss executors, how they work and what they are responsible for. It will then describe Executor Decoupling (AIP-51) and how this has fully unlocked development of third-party executors.
12:00 - 12:25.
by Rajesh Gundugollu
Room: Ballroom C-D
Workload Orchestration is at the heart of a successful Data lakehouse implementation. Especially for the “house” part which represents the Datawarehouse workloads which often are complex because of the very nature of warehouse data, which have dependency orchestration problems. We at Asurion have spent years in perfecting the Airflow solution to make it a super power for our Data Engineers. We have innovated in key areas like single operator for all use cases, auto DAG code generation, custom UI components for Data Engineers, monitoring tools etc.
12:00 - 12:25.
by Laura Zdanski
Room: Ballroom crush
Open Source doc edits provide a low-stakes way for new users to first contribute. Ideally, new users find opportunities and feel welcome to fix docs as they learn, engaging with the community from the start. But, I found that contributing docs to Airflow had some surprising obstacles. In this talk, I’ll share my first docs contribution journey, including problems and fixes. For example, you must understand how Airflow uses Sphinx and know when to choose to edit in the GitHub UI or locally.
12:00 - 12:25.
by Kaxil Naik
Room: York
New users starting with Airflow frequently encounter several challenges, ranging from the complexities of Containers and virtual environments to the Python dependency hell. Moreover, their familiarity with tools such as Docker, docker-compose, and Helm might be somewhat limited and even overkill. In contrast, seasoned Airflow users encounter their problems, encompassing configuration conflics with ongoing Airflow projects and intricacies stemming from Docker and docker-compose configurations and lack of visibility into all the projects.
12:30
12:30 - 12:55.
by Brent Bovenzi
Room: Ballroom A-B
We are continuing to modernize the Airflow UI to make it easier to manage all aspects of your DAGs. See a demo of the latest updates and improve your workflows with new tips and tricks. Then get a preview of what else will be coming soon. Followed up by Q&A for people to field their own use-cases and explore new ideas on how to improve the user experience.
12:30 - 12:55.
by Stanislaw Smyl & Hoa Nguyen
Room: Ballroom C-D
Productive cross-team collaboration between data engineers and analysts is the goal of all data teams, however, fulfilling on that mission can be challenging given the diverse set of skills that each group brings. In this talk we present an example of how one team tackled this topic by creating a flexible, dynamic and extensible framework using Airflow and cloud services that allowed engineers and analysts to jointly create data-centric micro-services to serve up projections and other robust analysis for use in the organization.
12:30 - 12:55.
by Shubham Mehta & Uma Ramadoss
Room: York
Apache Airflow is a popular workflow platform, but it often faces critiques that may not paint the whole picture. In this talk, we will unpack the critiques of Apache Airflow and provide a balanced analysis. We will highlight the areas where these critiques correctly point out Airflow’s weaknesses, debunk common myths, and showcase where competitors like Dagster and Prefect are excelling. By understanding the pros and cons of Apache Airflow, attendees will be better equipped to make informed decisions about whether Airflow is the right choice for their use cases.
13:00
13:00 - 14:00
Lunch
11:00

Ballroom A-B

Ballroom C-D

Ballroom crush

York

14:00
14:00 - 14:25.
by Bolke de Bruin
Room: Ballroom A-B
Operators form the core of the language of Airflow. In this talk I will argue that while they have served their purpose, they are holding back the development of Airflow and if Airflow wants to stay relevant in the world of the ’new’ data stack (hint: it isn’t currently considered to be part of it) self-service data mesh it needs to kill its darling.
14:00 - 14:25.
by Dave Milmont & Branden West
Room: Ballroom C-D
We would love to speak about our experience upgrading our old airflow 1 infrastructure to airflow 2 on kubernetes and how we orchestrated the migration of approximately 1500 DAGs that were owned by multiple teams in our organization. We had some interesting challenges along the way and can speak about our solutions. Points we can talk about: Old airflow 1 infrastructure and why we decided to move to kubernetes for airflow 2.
14:00 - 14:25.
by John Jackson
Room: Ballroom crush
Amazon Managed Workflows for Apache Airflow (MWAA) was released in November 2020. Throughout MWAA’s design we held the tenets that this service would be open-source first, not forking or deviating from the project, and that the MWAA team would focus on improving Airflow for everyone—whether they run Airflow on MWAA, on AWS, or anywhere else. This talk will cover some of the design choices made to facilitate those tenets, how the organization was set up to contribute back to the community, what those contributions look like today, how we’re getting those contributions in the hands of users, and our vision for future engagement with the community.
14:00 - 14:25.
by Jed Cunningham
Room: York
New to Airflow or haven’t followed any of the recent DAG authoring enhancements? This talk is for you! We will go through various DAG authoring features like Setup/Teardown tasks (~2.7), Datasets (2.4), Dynamic Tasks (2.3) and Async tasks (2.2). You won’t be an expert after this short talk, however, you’ll have a head start when you write your next DAG, no hacks required.
14:30
14:30 - 14:55.
by Bas Harenslak
Room: Ballroom A-B
Have you ever added a DAG file and had no clue what happened to it? You’re not alone! With default settings, Airflow can wait up to 5 minutes before processing new DAG files. In this talk, I’ll discuss the implementation of an event-based DAG parser that immediately processes changes in the DAGs folder. As a result, changes are reflected immediately in the Airflow UI. In this talk I will cover:
14:30 - 14:55.
by Ivan Sayapin & Gabby Clavell
Room: Ballroom C-D
Bloomberg’s Data Platform Engineering team powers some of the most valuable business and financial data on which Bloomberg clients rely. We recently built a configuration-driven system that allows non-engineers to onboard alternative datasets into the company’s ecosystem. This system uses Apache Airflow to orchestrate the data flow across different applications and Bloomberg Terminal functions. We are unique in that we have over 1500 dynamic DAGs tailored for each dataset’s needs (which very few Airflow users have).
14:30 - 14:55.
by Niko Oliveira
Room: Ballroom crush
Apache Airflow is one of the largest Apache projects by many metrics but it ranks particularly high in the number of contributors involved in the project. This leads to hundreds of Github Issues, Pull Requests and Discussions being submitted to the project every month. So it is critical to have an ample number of Committers to support the community. In this talk I will summarize my personal experience working towards, and ultimately achieving, committer status in Apache Airflow.
14:30 - 14:55.
by Rafal Biegacz & Filip Knapik
Room: York
DAG Authoring - learn how to go beyond the basics and best practices when implementing Airflow DAGs. It will be a survival guide for Airflow DAG developers who need to cope with hundreds of Airflow operators. This session will go beyond 101 or “for dummies” session and will be of interest to both those who are just starting to develop Airflow DAGs and Airflow experts, as it will help them improve their productivity.
15:00
15:00 - 15:25.
by Dennis Ferruzzi & Howard Yoo
Room: Ballroom A-B
OpenTelemetry is a vendor-neutral open-source (CNCF) observability framework that is supported by many vendors industry-wide. It is used for instrumenting, generation, collection, and exporting of data within systems which then are ingested by analytics tools that can provide tracing, metrics, and logs. It has long been the plan to adopt the OTel standard within Airflow, allowing builders and users to take advantage of valuable data that could help improve the efficiency, cost and performance of their systems.
15:00 - 15:25.
by Sung Yun
Room: Ballroom C-D
As a team that has built a Time-Series Data Lakehouse at Bloomberg, we looked for a workflow orchestration tool that could address our growing scheduling requirements. We needed a tool that was reliable and scalable, but also could alert on failures and delays to enable users to recover quickly from them. From using triggers over simple sensors to implementing custom SLA monitoring operators, we explore our choices in designing Airflow DAGs to create a reliable data delivery pipeline that is optimized for failure detection and remediation.
15:00 - 15:25.
by Madison Swain-Bowden
Room: Ballroom crush
As a data engineer, I’ve used Airflow extensively over the last 5 years: across 3 jobs, several different roles; for side projects, for critical infrastructure; for manually triggered jobs, for automated workflows; for IT (Ookla/Speedtest.net), for science (Allen Institute for Cell Science), for the commons (Openverse), for liberation (Orca Collective). Authoring a DAG has changed dramatically since 2018, thanks to improvements in both Airflow and the Python language. In this session, we’ll take a trip back in time to see how DAGs looked several years ago, and what the same DAGs might look like now.
15:00 - 15:25.
by Parnab Basak
Room: York
Today, all major cloud service providers and 3rd party providers include Apache Airflow as a managed service offering in their portfolios. While these cloud based solutions help with the undifferentiated heavy lifting of environment management, some data teams are also looking to operate self-managed Airflow instances to satisfy specific differentiated capabilities. In this session, we would talk about: Why should you might need to run self managed Airflow The available deployment options (with emphasis on Airflow on Kubernetes)
15:30
15:30 - 15:55.
by Jens Scheffler & Christian Schilling
Room: Ballroom A-B
As user of Airflow we often use DagRun.conf attributes to control content and flow of a DAG run. Previously the Airflow UI only allowed to launch via JSON in the UI. This was technically feasible but not user friendly. A user needs to model, check and understand the JSON and enter parameters manually without the option to validate before trigger. Similar like Jenkins or Github/Azure pipelines we desire an UI option to trigger with a UI and specifying parameters.
15:30 - 15:55.
by Varun Srinivas & Raj Ramalingam
Room: Ballroom C-D
In this presentation, we discuss how we built a fully managed workflow orchestration system at Salesforce using Apache Airflow to facilitate dependable data lake infrastructure on the public cloud. We touch upon how we utilized kubernetes for increased scalability and resilience, as well as the most effective approaches for managing and scaling data pipelines. We will also talk about how we addressed data security and privacy, multitenancy, and interoperability with other internal systems.
15:30 - 15:55.
by Viraj Parekh & Pete DeJoy
Room: Ballroom crush
Data platform teams often find themselves in a situation where they have to provide Airflow as a service to downstream teams, as more users and use cases in their organization require an orchestrator. In these situations, it’s giving each team it’s own Airflow environment can unlock velocity and actually be lower overhead to maintain than a monolithic environment. This talk will be about things to keep in mind when building an Airflow service that supports several environments, persona of users, and use cases.
15:30 - 15:55.
by Jarek Potiuk
Room: York
Apache Airflow has over 650 Python dependencies. In case you did not know already, dependencies in Python are difficult subject. And Airflow has its own, custom ways of managing the dependencies. Airflow has a rather complex system to manage dependencies in their CI system, but this talk is not about it. This talk is directed to the users of Airflow who want to keep their dependencies updated, describing ways they can do it.
16:00
16:00 - 16:15
Afternoon break
16:15

Ballroom A-B

Ballroom C-D

Ballroom crush

York

16:15
16:15 - 16:40.
by Xiaodong Deng
Room: Ballroom A-B
Airflow’s KubernetesExecutor has supported multi_namespace_mode for long time. This feature is great at allowing Airflow jobs to run in different namespaces on the same Kubernetes clusters for better isolation and easier management. However, this feature requires cluster-role for the Airflow scheduler, which can create security problems or be a blocker for some users. PR https://github.com/apache/airflow/pull/28047 , which will become available in Airflow 2.6.0, resolves this issue by allowing Airflow users to specify multi_namespace_mode_namespace_list when using multi_namespace_mode, so that no cluster-role is needed and user only needs to ensure the Scheduler has permissions to certain namespaces rather than all namespaces on the Kubernetes cluster.
16:15 - 16:40.
by Ramajayam Gopithirumal
Room: Ballroom C-D
How we migrated from Autosys with 1000s of jobs with 800+ dependencies with SLA to be met every hour in a Canada Prominent Bank. Use case to migrate from enterprise scheduler $ spent for every license and renewal cost SLA,Monitoring,Auditing,Devops Integration Vendor lockin 4.Integration to multiple providers
16:15 - 16:40.
by Diana Vazquez Romo
Room: Ballroom crush
How to submit an Issue for community to fix To ensure a quality product the Airflow community relies on bug reports from Airflow users. Often times bug reports are incomplete or fail to include steps for observed bug to be re-created. This workshop will present an example of bug-to-issue process, namely, how to rule out non-Airflow issues and once an Airflow issue is suspected, how to submit an issue for community to see.
16:15 - 16:40.
by Viraj Parekh & Vikram Koka
Room: York
Over the last few years, we’ve spent countless hours talking to data engineers from everywhere from Fortune 500s to seed stage startups. In doing so, we’ve learned all about what it takes to deliver a world class Airflow service perfect for everyone. We’ve packaged all that up into The Astro Hypervisor, a new part of our platform that gives users a whole new level of control in Airflow. We’ll talk through how we’ve built this hypervisor and how our customers will be able to use it for autoscaling, tracking the health of Airflow environments and so much more.
16:45
16:45 - 17:10.
by Philippe Gagnon
Room: Ballroom A-B
Cluster Policies are an advanced Airflow feature composed of a set of hooks that allow cluster administrators to implement checks and mutations against certain core Airflow constructs (DAGs, Tasks, Task Instances, Pods). In this talk, we will discuss how cluster administrators can leverage these functions in order to better govern the workloads that are running in their environments.
16:45 - 17:10.
by Zhengyi Liu, Yuri Desyatnik, Han Gan & Nanxi Chen
Room: Ballroom C-D
We will cover how Snap (parent company of Snapchat) has been using Airflow since 2016. How we built a secure deployment on GCP that integrates with internal tools for workload authorization, RBAC and more. We made permissions for DAGs easy to use for customers using k8s workload identity binding and tight UI integration. How are we migrating 2500+ DAGs from Airflow V1, Python 2 to V2 Python 3 using tools + automations.
16:45 - 17:10.
by Vincent Beck
Room: Ballroom crush
System tests are executable DAGs for example and testing purposes. With a simple pytest command, you can run an entire DAG. From a provider point of view, they can be viewed as integration tests for all provider related operators and sensors. Running these system tests frequently and monitoring the results allow us to enforce stability amongst many other benefits. In this presentation we will explore how AWS built their system test environment, from the GitHub fork to the health dashboard that exists today…but more importantly, why you should do it as well!
16:45 - 17:10.
by Daniel Imberman
Room: York
As much as we love airflow, local development has been a bit of a white whale through much of its history. Until recently, Airflow’s local development experience has been hindered by the need to spin up a scheduler and webserver. In this talk, we will explore the latest innovation in Airflow local development, namely the “dag.test()” functionality introduced in Airflow 2.5. We will delve into practical applications of “dag.test()”, which empowers users to locally run and debug Airflow DAGs on a single python process.
17:15
17:15 - 17:40.
by Raphaël Vandon
Room: Ballroom A-B
As big Airflow users grow their usage to hundreds of DAGs, parsing them can become a performance bottleneck in the scheduler. In this talk, we’ll explore how this situation was improved by using caching techniques and pre-processing of DAGs to minimize the overhead of parsing them at runtime. We’ll also touch on how the the performance of the existing code was analyzed to find points of improvement. We may include a section on how to configure airflow to benefit from those recent changes, and some tips on how to make DAGs that are quick to parse, but this will not be the core of the talk.
17:15 - 17:40.
by M Waqas Shahid
Room: Ballroom C-D
Ever thought how airflow could play a pivotal role in data mesh architecture, hosting thousands of DAGs and hundreds of thousands daily running tasks, let’s find out! Delivery Hero delivers food in 70 countries with 12 different brands and platforms. With thousands of engineers, analysts and data scientists spread across many countries running analytics and ML services for all these orders delivered. Serving the workflow orchestration needs for such a massive group becomes a challenge.
17:15 - 17:40.
by Julien Le Dem
Room: Ballroom crush
Nurturing a healthy open source community is hard work and requires discipline. There’s a lot of inherent friction in a distributed community that makes it difficult for the many participants to collaborate. It is challenging to contribute something when you’re new to a community. It is overwhelming for maintainers to give enough attention to contributors when a project becomes popular. In this talk, I’ll go over the common pitfalls of open source communities and will go over practices that contribute to keeping your community healthy.
17:15 - 17:40.
by Victor Chiapaikeo & Aldo Orozco Gomez
Room: York
For the dag owner, testing Airflow DAGs can be complicated and tedious. Kubectl cp your dag from local to pod, exec into the pod, and run a command? Install breeze? Why pull the Airflow image and start up the webserver / scheduler / triggerer if all we want is to test the addition of a new task? It doesn’t have to be this hard. At Etsy, we’ve simplified testing dags for the dag owner with dagtest.
17:45
17:45 - 18:10.
by Jarek Potiuk, Mateusz Henc & Vincent Beck
Room: Ballroom A-B
This sesion is about the current state of implementation for multi-tenancy feature of Airflow. This is a long-term feature that involves multiple changes, separate AIPs to implement, with the long-term vision of having single Airflow instance supporting multiple, independed teams using it - either from the same company or as part of Airflow-As-A-Service implementation.
17:45 - 18:10.
by Kunal Haria & Ramya Pappu
Room: Ballroom C-D
We will share the case study of Airflow at StyleSeat, where within a year our data grew from 2 million data points per day to 200 million. Our original solution for orchestrating this data was not enough, so we migrated to an Airflow based solution. Previous implementation Our tasks were orchestrated with hourly triggers on AWS Cloudwatch rules in their own log groups. Each task was a lambda individually defined as a task and executed python code from a docker image.
17:45 - 18:10.
by Jonathan Leek
Room: Ballroom crush
Volunteers in Saint Louis are using Airflow to build an open source data warehouse of real estate data (permits, assessments, violations, etc), with an eye towards creating a national open data standard. This talk will focus on the unique challenges of running an open source data warehouse, and what it looks like to work with volunteers to create data pipelines.
17:45 - 18:10.
by Zachary Bannor
Room: York
At Condé Nast, we have heavily leveraged async/deferrable operators to reduce our Airflow-associated costs. By implementing async/deferrable operators in all of our pipelines, we have been able to realize a cost reduction of 54% compared with our previous usage of non-async/deferrable operators.

Wednesday, September 20, 2023

09:00
09:00 - 09:30.
by Kaxil Naik & Julian LaNeve
Room: Ballroom A-B
Behind the growing interest in Generate AI and LLM-based enterprise applications lies an expanded set of requirements for data integrations and ML orchestration. Enterprises want to use proprietary data to power LLM-based applications that create new business value, but they face challenges in moving beyond experimentation. The pipelines that power these models need to run reliably at scale, bringing together data from many sources and reacting continuously to changing conditions.
09:30
09:30 - 10:00.
by Clayton Coleman
Room: Ballroom A-B
It should be no surprise to the Airflow community that the hype around generative large language models (LLMs) and their wildly-inventive chat front ends have brought significant attention to growing these models and feeding them on a steady diet of data. For many communities in the infrastructure, orchestration, and data landscape this is an opportunity to think big, help our users scale, and make the right foundational investments to sustain that growth over the long term.
10:00
10:00 - 11:00.
by Jarek Potiuk, Viraj Parekh, Rafal Biegacz & John Jackson
Room: Ballroom A-B
There are so many ways to run Airflow and as such, lots of folks responsible for running Airflow for downstream users. While many orgs go with a managed service(Astronomer, AWS MWAA, Google Cloud Composer), many also prefer running an Airflow platform themselves. This panel will be about what users want from a managed Airflow service, from the perspective of those charged with providing one. We’ll talk about use cases, roadmaps, and best practices that’ve been accumulated along the way.
11:00
11:00 - 11:30
Morning break
11:00

Ballroom A-B

Ballroom C-D

Ballroom crush

York

11:30
11:30 - 11:55.
by Kunal Jain
Room: Ballroom A-B
In an environment with multiple Airflow instances , how we build custom operators and framework to share events across the instances and trigger dags based on those events
11:30 - 11:55.
by Ahuitz Rojas
Room: Ballroom C-D
In large organizations, data workflows can be complex and interconnected, with multiple dependencies and varied runtime requirements. To ensure efficient and timely execution of workflows, it is important to understand the factors that affect the performance of the system, such as network congestion, resource availability, and DAG structure. In this talk, we will explore how delay modeling and DAG connectivity analysis can be used to optimize Airflow performance in large organizations.
11:30 - 11:55.
by Luan Moreno Medeiros Maciel
Room: Ballroom crush
ETL data pipelines are the bread and butter of data teams that must design, develop, and author DAGs to accommodate the various business requirements. dbt is becoming one of the most used tools to perform SQL transformations on the Data Warehouse, allowing teams to harness the power of queries at scale. Airflow users are constantly finding new ways to integrate dbt with the Airflow ecosystem and build a single pane of glass where Data Engineers can manage and administer their pipelines.
11:30 - 11:55.
by Diederik van Liere
Room: York
This talk is speculative: orchestration tools like Airflow have it made it very easy to pull and push data from anywhere to everywhere. But we don’t know what data we are pushing around. What if we have a schema language that we could use to describe this data? Not in terms of data type but in terms of sensitivity and instructions on how to handle this? This talk is about the headaches companies are facing day to day and that maybe there’s an opportunity for the Airflow community to help solve this problem.
12:00
12:00 - 12:25.
by Anthony Kalsatos & Akshay Battaje
Room: Ballroom A-B
A steady rise in users and business critical workflows poses challenges to development and production workflows. The solution: enable multi-tenancy on our single Airflow instance. We needed to enable teams to manage their python requirements, and ensure DAGs were insulated from each other. To achieve this we divided our monolithic setup into three parts: Infrastructure (with common code packaging), Workspace Creation, and CI/CD to manage deployments. Backstage templates enable teams to create isolated development environments that resemble our production environment, ensuring consistency.
12:00 - 12:25.
by Rafay Aleem & Victoria Varney
Room: Ballroom C-D
Data science and machine learning are at the heart of Faire’s industry-celebrated marketplace (a16z top-ranked marketplace) and drive powerful search, navigation, and risk functions which are powered by ML models that are trained on 3000+ features defined by our data scientists. Previously, defining, backfilling and maintaining feature lifecycle was error-prone. Having a framework built on top of Airflow has empowered them to maintain and deploy their changes independently. We will explore:
12:00 - 12:25.
by Savin Goyal & Ryan Delgado
Room: Ballroom crush
Airflow is a household brand in data engineering: It is readily familiar to most data engineers, quick to set up, and, as proven by millions of data pipelines powered by it since 2014, it can keep DAGs running. But with the increasing demands of ML, there is a pressing need for tools that meet data scientists where they are and address two pressing issues - improving the developer experience & minimizing operational overhead.
12:00 - 12:25.
by Shirshanka Das
Room: York
Data contracts have been much discussed in the community of late, with a lot of curiosity around how to approach this concept in practice. We believe data contracts need a harmonizing layer to manage data quality in a uniform manner across a fragmented stack. We are calling this harmonizing layer the Control Plane for Data - powered by the common thread across these systems: metadata. For teams already orchestrating pipelines with Airflow, data contacts can be an effective way to process data that meets preset quality standards.
12:30
12:30 - 12:55.
by Filip Kunčar & Stanislav Repka
Room: Ballroom A-B
Kiwi.com started using Airflow in June 2016 as an orchestrator for several people in the company. The need for the tool grew and the monolithic instance was used by 30+ teams having 500+ DAGs active resulting in 3.5 million tasks/month successfully finished. At first, we moved to using a monolithic Airflow environment, but our needs quickly changed as we wanted to support a data mesh architecture within kiwi.com. By leveraging Astronomer on GCP, we were able to move from a monolithic Airflow environment to many smaller instances of Airflow.
12:30 - 12:25.
by Amit Kumar
Room: Ballroom C-D
Discover the transformation of Airflow at GoDaddy: from its initial deployment on-prem to its migration to the cloud, and finally to a Single Pane Orchestration Model. This evolution has streamlined our Data Platform and improved governance. Our experience will be beneficial for anyone seeking to optimize their Airflow implementation and simplify their orchestration processes. History and Use-cases Design, Organization decisions, and Governance: Examining the decision-making process and governance structure. Migration to Cloud:Process of transitioning Airflow from on-premises to the cloud.
12:30 - 12:55.
by Maxime Beauchemin
Room: Ballroom crush
Change management in data teams can be challenging to say the least. Not only you have to evolve your data pipelines, data structures, and datasets themselves across environments, you also have to keep data exploration and visualizations tools in sync. In this talk, we’ll be exploring how to do this best across environments (ie: dev, staging and prod), talking about how CI/CD can help, implementing good data ops practices and cranking up the level of rigor where it matters.
12:30 - 12:55.
by Michal Modras
Room: York
The session will cover capabilities of data lineage in Apache Airflow, how to use them, and motivations for it. It will present the technical know-how of integrating data lineage solutions with Apache Airflow, and provisioning DAGs metadata to fuel lineage functionalities in a way transparent to the user, limiting the setup friction. It will include Google’s Cloud Composer lineage integration implemented through the current Airflow’s data lineage architecture, and our approach to the lineage evolution strategy.
13:00
13:00 - 14:00
Lunch
11:00

Ballroom A-B

Ballroom C-D

Ballroom crush

York

14:00
14:00 - 14:25.
by Syed Hussain
Room: Ballroom A-B
Deep dive into how AWS is developing Deferrable Operators for the Amazon Provider Package to help users realize the potential cost-savings provided by Deferrable Operators, and promote their usage.
14:00 - 14:25.
by Alaeddine Maaoui & Jilan Kothakota
Room: Ballroom C-D
Apache Airflow is Scalable, Dynamic, Extensible , Elegant and can it be a lot more ? We have taken Airflow to the next level, using it as hybrid cloud data service accelerating our transformation. During this talk we will present the implementation of Airflow as an orchestration solution between LEGACY, PRIVATE and PUBLIC cloud (AWS / AZURE) : Comparison between public/private offers. Harness the power of Hybric cloud orchestrator to meet the regulatory requirements (European Financial Institutions) Real production use cases
14:00 - 14:25.
by Andrea Bombino
Room: Ballroom crush
This talk will cover in high overview the architecture of a data product DAG, the benefits in a data mesh world and how to implement it easily. Airflow is the de-facto orchestrator we use at Astrafy for all our data engineering projects. Over the years we have developed deep expertise in orchestrating data jobs and recently we have adopted the “data mesh” paradigm of having one Airlfow DAG per data product.
14:00 - 14:25.
by Nathan Hadfield
Room: York
At King, data is fundamental in helping us deliver the best possible experiences for the players of our games while continually bringing them new, innovative and evolving gameplay features. Data has to be “always-on”, where downtime and accuracy is treated with the same level of diligence as any of our games and success is measured against internal SLAs. How is King using ‘data reliability engineering as code’ tools such as SodaCore within Airflow pipelines to detect, diagnose and inform about data issues to create coverage, improve quality & accuracy and help eliminate data downtime?
14:30
14:30 - 14:55.
by Dipankar Ghosal
Room: Ballroom A-B
Learn how to convert Oozie Workflows into Airflow DAG and run it on Amazon EMR. The utility supports Airflow 2.4.3. This utility is built on top of https://github.com/GoogleCloudPlatform/oozie-to-airflow
14:30 - 14:55.
by Jonathan Rainer & Ed Sparkes
Room: Ballroom C-D
As a bank Monzo has seen exponential growth in active users, from 1.6 million in 2019 to 5.8 million in 2022. At the same time the number of data users and analysts has expanded from an initial team of 4 to 132. Alongside this growth, our infrastructure and tooling have had to evolve to deliver the same value at a new scale. From an Airflow installation deployed on a single monolithic instance we now deploy atop Kubernetes and have integrated our Airflow setup into the bank’s backend systems.
14:30 - 14:55.
by CJ Jameson & Mauricio De Diana
Room: Ballroom crush
You’ve got your pipelines flowing … how much do you know about the data inside? Most teams have some coverage with unit/contract/expectations tests, and you might have other quality checks. But it can be very ad-hoc and disorganized. You want to do more to beef up data quality and observability … does that mean you just need to write more tests and assertions? Come learn about the best way to see your data’s quality alongside DAGs in a familiar context.
14:30 - 14:55.
by Maciej Obuchowski
Room: York
With native support for OpenLineage in Airflow, users can now observe and manage their data pipelines with ease. This talk will cover the benefits of using OpenLineage, how it is implemented in Airflow, practical examples of how to take advantage of it, and what’s in our roadmap. Whether you’re an Airflow user or provider maintainer, this session will give you the knowledge to make the most of this tool.
15:00
15:00 - 15:25.
by Vikram Koka
Room: Ballroom A-B
Introduced in Airflow 2.4, Datasets are a foundational feature for authoring modular data pipelines. As DAGs grow to encompass a larger number of data sources and encompass multiple data transformation steps, they typically become less predictable in the timeliness of execution and less efficient. This talk focuses on leveraging Datasets to enable predictable and more efficient DAGs, by leveraging patterns from microservice architectures. Just as large monolithic applications were decomposed into micro-services to deliver more efficient scalability and faster development cycles, micropipelines have the same potential to radically transform data pipeline efficiency and development velocity.
15:00 - 15:25.
by Jan Pawlowski & Jędrzej Matuszak
Room: Ballroom C-D
Representing the Murex Reporting team at UniCredit we would like to present our journey with Airflow, and how over the past two years it enabled us to automate and simplify our batch workflows. Comparing to our previous rigid mainframe scheduling approach, we have created a robust and scalable framework complete with a CI/CD process, bringing our time to market of scheduling changes down from 3 days to 1. Basing our solution on DAG networks joined by ResumeDagRunOperators and an array of custom-built plugins (such as static time predecessors) we were able to replicate the scheduling of our overnight ETL processes (consisting of approx.
15:00 - 15:25.
by Jonathan Talmi
Room: Ballroom crush
Airflow is a popular choice for organizations looking to integrate open-source dbt within their existing data infrastructure. This talk will explore two primary methods of running dbt in Airflow: job-level and model-level. We’ll discuss the tradeoffs associated with each approach, highlighting the simplicity and efficiency of job-level orchestration, contrasted with the enhanced observability and control provided by model-level orchestration. We’ll also explain how the balance has shifted in recent years, with improvements to dbt core making model-level more efficient and innovative Airflow extensions like Cosmos making it easier to implement.
15:00 - 15:25.
by Michael Robinson
Room: York
Airflow uses SQLAlchemy under the hood but up to this point has not exploited the tool’s capacity to produce detailed metadata about queries, tables, columns, and more. In fact, SQLAlchemy ships with an event listener that, in conjunction with OpenLineage, offers tantalizing possibilities for enhancing the development process – specifically in the areas of monitoring and debugging. SQLAlchemy’s event system features a Session object and ORMExecuteState mapped class that can be used to intercept statement executions and emit OpenLineage RunEvents as executions occur.
15:30
15:30 - 15:55.
by Bartosz Jankiewicz
Room: Ballroom A-B
Reliability is a complex and important topic. I will focus on both reliability definition and best practices. I will begin by reviewing the Apache Airflow components that impact reliability. I will subsequently examine those aspects, showing the single points of failure, mitigations, and tradeoffs. The journey starts with the scheduling process. I will focus on the aspects of Scheduler infrastructure and configuration that address reliability improvements. It doesn’t run in a vacuum therefore I’ll share my observations on the reliability aspect of Scheduler infrastructure.
15:30 - 15:55.
by Jianlong Zhong
Room: Ballroom C-D
At Coinbase, Airflow is adopted by a wide range of applications, and used by nearly all the engineering and data science teams. In this session, we will share our journey in improving the productivity of Airflow users at Coinbase. The presentation will focus on three main topics: Monorepo based architecture: our approach of using a monorepo to simplify DAG development and enable developers from across the company to work more efficiently and collaboratively.
15:30 - 15:55.
by Zohar Donenhirsh & Alina Aven
Room: Ballroom crush
High-scale orchestration of genomic algorithms using Airflow workflows, AWS Elastic Container Service (ECS), and Docker. Genomic algorithms are highly demanding of CPU, RAM, and storage. Our data science team requires a platform to facilitate the development and validation of proprietary algorithms. The Data engineering team develops a research data platform that enables Data Scientists to publish docker images to AWS ECR and run them using Airflow DAGS that provision AWS’s ECS compute power of EC2 and Fargate.
15:30 - 15:55.
by Russell Lamb
Room: York
Discover PepsiCo’s dynamic data quality strategy in a multi-cloud landscape. Join me, the Director of Data Engineering, as I unveil our Airflow utilization, custom operator integration, and the power of Great Expectations. Learn how we’ve harmonized Data Mesh into our decentralized development for seamless data integration. Explore our journey to maintain quality and enhance data as a strategic asset at PepsiCo.
16:00
16:00 - 16:15
Afternoon break
11:00

Ballroom A-B

Ballroom C-D

Ballroom crush

York

16:15
16:15 - 16:40.
by Pádraic Slattery
Room: Ballroom A-B
Airflow, traditionally used by Data Engineers, is now popular among Analytics Engineers who aim to provide analysts with high-quality tooling while adhering to software engineering best practices. dbt, an open-source project that uses SQL to create data transformation pipelines, is one such tool. One approach to orchestrating dbt using Airflow is using dynamic task mapping to automatically create a task for each sub-directory inside dbt’s staging, intermediate, and marts directories. This enables analysts to write SQL code that is automatically added as a dedicated task in Airflow at runtime.
16:15 - 16:40.
by Ritika Jain
Room: Ballroom C-D
Twitch, the world’s leading live streaming platform, has a massive user base of over 140 million active users and an incredibly complex recommendation system to deliver a personalized and engaging experience to its users. In this talk, we will dive into how Twitch leverages the power of Apache Airflow to manage and orchestrate the training and deployment of its recommendation models. You will learn about the scale of Twitch’s reach and the challenges we faced in building a scalable, reliable, and developer-friendly recommendation system.
16:15 - 16:40.
by Amogh Desai & Shubham Raj
Room: Ballroom crush
Cloudera Data Engineering (CDE) is a serverless service for Cloudera Data Platform that allows you to submit various Spark jobs and Airflow DAGs to an auto-scaling cluster. Running your workloads as Python DAG files may be the usual, but not the most convenient way for some users as it involves a lot of background around syntaxes, the programming language, aesthetics of Airflow, etc. The DAG Authoring UI is a tool built on top of Airflow APIs to allow one to use a graphical user interface to create, manage, and destroy complex DAGs.
16:15 - 16:40.
by Roy Noyman
Room: York
Are you tired of spending hours on Airflow migrations and wondering how to make them more accessible? Would you like to be able to test your code on different Airflow versions? or are you struggling to set up a reliable local development environment? These are some of the top pain points for data engineers working with Airflow. But fear not – Wix Data Engineering has some best practices to share that will make your life easier.
16:45
16:45 - 17:10.
by John Jackson
Room: Ballroom A-B
Airflow DAGs are Python code (which can pretty much do anything you want) and Airflow has hundreds configuration options (which can dramatically change Airflow behavior). Those two facts contribute to endless combinations that can run the same workloads, but only a precious few are efficient. The rest will result in failed tasks and excessive compute usage, costing time and money. This talk will demonstrate how small changes can yield big dividends, and reveals some code improvements and Airflow configurations that can reduce costs and maximize performance.
16:45 - 17:10.
by Wanda Kinasih
Room: Ballroom C-D
With millions of orders per day, Gojek needs a data processing solution that can handle a high volume of data. Airflow is a scalable tool that can handle large volumes of data and complex workflows, making it an ideal solution for Gojek’s needs. With Airflow, we can create automated data pipelines to extract data from various sources, transform it, and load it into dashboards such as Tableau for analysis and visualization.
16:45 - 17:10.
by Iddo Avneri
Room: Ballroom crush
Are you tired of spending countless hours testing your data pipelines, only to find that they don’t work as expected? Do you wish there was a better way to manage your data versions and streamline your testing processes? If so, this presentation is for you! Join us as we explore the problem domain of testing environments for data pipelines and take a deep dive into the available tools currently in use.
16:45 - 17:10.
by Soren Archibald & Jay Thomas
Room: York
The ability to create DAGs programmatically opens up new possibilities for collaboration between Data Science and Data Engineering. Engineering and DevOPs are typically incentivized by stability whereas Data Science is typically incentivized by fast iteration and experimentation. With Airflow, it becomes possible for engineers to create tools that allow Data Scientists and Analysts to create robust no-code/low-code data pipelines for feature stores. We will discuss Airlow as a means of bridging the gap between data infrastructure and modeling iteration as well as examine how a Qbiz customer did just this by creating a tool which allows Data Scientists to build features, train models and measure performance, using cloud services, in parallel.
17:15
17:15 - 17:40.
by Ryan Hatter
Room: Ballroom A-B
Much of the world sees Airflow as a hammer and ETL tasks as nails, but in reality, Airflow is much more of a sophisticated multitool, capable of orchestrating a wide variety of complex workflows. Astronomer’s Customer Reliability Engineering (CRE) team is leveraging this potential in its development of Airline, a tool powered by Airflow that monitors Airflow deployments and sends alerts proactively when issues arise. In this talk, Ryan Hatter from Astronomer will give an overview of Airline.
17:15 - 17:40.
by Vladi Nekolov & Zdravko Hvarlingov
Room: Ballroom C-D
Inside the Financial Times, we’ve been gradually moving our batching data processing from a custom solution to Airflow. To enable various teams within the company to use Airflow more effectively, we’ve been working on extending the system’s self-service capabilities. This includes giving ownership to teams of their DAGs and separating resources such as connections. The batch data ingestion processes are the main ETL - like jobs that we run on Airflow.
17:15 - 17:40.
by Jack Lockyer-Stevens
Room: Ballroom crush
In 2022, cloud data centres accounted for up to 3.7% of global greenhouse gas emissions, exceeding those of aviation and shipping. Yet in the same year, Britain wasted 4 Terawatt hours of renewable energy because it couldn’t be transported from where it was generated to where it was needed. So why not move the cloud to the clean energy? VertFlow is an Airflow operator that deploys workloads to the greenest Google Cloud data centre, based on the realtime carbon intensity of electricity grids worldwide.
17:15 - 17:40.
by Ben Chen
Room: York
In this session, we’ll explore the inner workings of our warehouse allocation service and its many benefits. We’ll discuss how you can integrate these principles into your own workflow and provide real-world examples of how this technology has improved our operations. From reducing queue times to making smart decisions about warehouse costs, warehouse allocation has helped us streamline our operations and drive growth. With its seamless integration with Airflow, building an in-house warehouse allocation pipeline is simple and can easily fit into your existing workflow.
17:45
17:45 - 18:45.
Room: Ballroom A-B
We will have close Airflow Summit with lightning talks (5 minutes each). You will be able to sign up during the event. We will only have space for 10 talks.

Thursday, September 21, 2023

09:00
09:00 - 11:30.
by Filip Knapik, Michal Modras, Rafal Biegacz, Bartosz Jankiewicz, Leah Cole, Victor Aoqui & Arun Joy Vattoly
Room: Trinity 1-2
Hands on workshop for medium/advanced Airflow users who would like to know more about Airflow and Composer and use features like data lineage to enhance observability and disaster recovery procedures.
09:00 - 11:30.
by Eric Jones
Room: Trinity 3-4
Hands on workshop showing how data observability can work within your Airflow and Modern Data Stack.
09:00 - 11:30.
by Rishi Kar & George Yates
Room: Trinity 5
In this workshop you will learn how to simplify your data pipelines in the Snowflake Data Cloud.
09:00 - 11:30.
by Aneel Murari & Parnab Basak
Room: York
Learn how to optimize your Apache Airflow environment. You will get hands-on experience implementing techniques and best practices and see how they improve the performance of the Airflow environment.
11:30
11:30 - 12:00
Lunch
12:00
12:00 - 14:30.
by M Waqas Shahid
Room: Trinity 1-2
In this workshop you will learn why and how to set up a data mesh architecture based on Apache Airflow.
12:00 - 14:30.
by Luan Moreno Medeiros Maciel & Tatiana Al-Chueyr Martins
Room: Trinity 3-4
The main objective of this workshop is to demonstrate how Apache Airflow, together with the Astro Python SDK, can be used to orchestrate data pipelines and perform ETL processes in a scalable and performant way for professionals. During the workshop, you will learn how to easily create pipelines in Apache Airflow with the Astro Python SDK, fully open-source and accelerated by Astronomer company. Acquiring the skills from this workshop will enable you to implement data pipelines with few lines of code, using the best practices and recommendations in the market.
12:00 - 14:30.
by Jarek Potiuk, Hussein Awala & Elad Kalif
Room: Trinity 5
Learn how you can become a contributor to Apache Airflow. From setting up an environment to making your first pull request.
12:00 - 14:30.
by Marc Lamberti
Room: York
During Airflow Summit you can take an Airflow Certification exam at no additional cost. We will have beginner and advanced level certifications available.