Times should show up in your local timezone

Color Codes:
Keynote
Technical
Use case / adoption journey
Community
Lightning talk
Workshop

 

2021-07-08T16:00:00.000Z

Add to your calendar 07/08/2021 4:00 PM 07/08/2021 4:50 PM UTC Airflow Summit: Contributing to Apache Airflow | Journey …

From not knowing Python (let alone Airflow), and from submitting the first PR that fixes typo to becoming Airflow Committer, PMC Member, Release Manager, and #1 Committer this year, this talk walks through Kaxil’s journey in the Airflow World.

The second part of this talk explains:

  • how you can also start your OSS journey by contributing to Airflow
  • Expanding familiarity with a different part of the Airflow codebase
  • Continue committing regularly & steadily to become Airflow Committer. (including talking about current Guidelines of becoming a Committer)
  • Different mediums of communication (Dev list, users list, Slack channel, Github Discussions etc)
https://airflowsummit.org/live

Contributing to Apache Airflow | Journey to becoming Airflow's leading contributor

by Kaxil Naik

From not knowing Python (let alone Airflow), and from submitting the first PR that fixes typo to becoming Airflow Committer, PMC Member, Release Manager, and #1 Committer this year, this talk walks through Kaxil’s journey in the Airflow World. The second part of this talk explains: how you can also start your OSS journey by contributing to Airflow Expanding familiarity with a different part of the Airflow codebase Continue committing regularly & steadily to become Airflow Committer.
Add to your calendar 07/08/2021 4:50 PM 07/08/2021 5:00 PM UTC Airflow Summit: You don’t have to wait for someone to …

Rachael, a new Airflow contributor, and Leah, an experienced Airflow contributor, share the story of Rachael’s first contribution, highlighting the importance of contributions from new users and the positive impact that non-code contributions have in an open source community.

https://airflowsummit.org/live

You don’t have to wait for someone to fix it for you

by Rachael Deacon-Smith & Leah Cole

Rachael, a new Airflow contributor, and Leah, an experienced Airflow contributor, share the story of Rachael’s first contribution, highlighting the importance of contributions from new users and the positive impact that non-code contributions have in an open source community.
Add to your calendar 07/08/2021 5:00 PM 07/08/2021 5:25 PM UTC Airflow Summit: Contributing to Apache Airflow: First …

Learn to contribute to the Apache Airflow ecosystem both with and without code. Post an article to the Airflow blog, improve documentation, or dive head-first into into Airflow’s free and open source software community.

https://airflowsummit.org/live

Contributing to Apache Airflow: First Steps

by Ryan Hatter

Learn to contribute to the Apache Airflow ecosystem both with and without code. Post an article to the Airflow blog, improve documentation, or dive head-first into into Airflow’s free and open source software community.

2021-07-09T16:00:00.000Z

Add to your calendar 07/09/2021 4:00 PM 07/09/2021 4:50 PM UTC Airflow Summit: Airflow as the Foundation of a …

A discussion with Jay Sen, Data Platform Architect at Paypal, and Ry Walker, Founder/CTO of Astronomer about the central role Airflow plays within Paypal’s data platform, and the opportunity to build stronger integrations between Airflow and other tools that surround it.

https://airflowsummit.org/live

Airflow as the Foundation of a Multi-Faceted Data Platform

by Ry Walker & Jay Sen

A discussion with Jay Sen, Data Platform Architect at Paypal, and Ry Walker, Founder/CTO of Astronomer about the central role Airflow plays within Paypal’s data platform, and the opportunity to build stronger integrations between Airflow and other tools that surround it.
Add to your calendar 07/09/2021 4:50 PM 07/09/2021 5:00 PM UTC Airflow Summit: Apache Airflow at Apple - Multi-tenant …

Running a platform where different business units at Apple can run their workloads in isolation and share operators.

https://airflowsummit.org/live

Apache Airflow at Apple - Multi-tenant Airflow and Custom Operators

by Roberto Santamaria & Howie Wang

Running a platform where different business units at Apple can run their workloads in isolation and share operators.
Add to your calendar 07/09/2021 5:00 PM 07/09/2021 5:25 PM UTC Airflow Summit: Lessons Learned while Migrating Data …

Digital transformation, application modernization, and data platform migration to the cloud are key initiatives in most enterprises today. These initiatives are stressing the scheduling and automation tools in these enterprises to the point that many users are looking for better solutions. A survey revealed that 88% of users believe that their business will benefit from an improved automation strategy across technology and business. Airflow has an excellent opportunity to capture mindshare and emerge as the leading solution here. At Unravel, we are seeing the trend where many of our enterprise customers are at various stages of migrating to Airflow from their enterprise schedulers or ETL/ELT orchestration tools like Autosys, Informatica, Oozie, Pentaho, and Tidal.

In this talk, we will share lessons learned and best practices found in the entire pipeline migration life-cycle which includes:

(i) The evaluation process which led to picking Airflow, including certain aspects where Airflow can do better

(ii) The challenges in discovering and understanding all components and dependencies that need to be considered in the migration

(iii) The challenges arising during the pipeline code and data migration, especially, in getting a single-pane-of-glass and apples-to-apples views to track the progress of the migration

(iv) The challenges in ensuring that the pipelines that have been migrated to Airflow are able to perform and scale on par or better compared to what existed previously

https://airflowsummit.org/live

Lessons Learned while Migrating Data Pipelines from Enterprise Schedulers to Airflow

by Shivnath Babu

Digital transformation, application modernization, and data platform migration to the cloud are key initiatives in most enterprises today. These initiatives are stressing the scheduling and automation tools in these enterprises to the point that many users are looking for better solutions. A survey revealed that 88% of users believe that their business will benefit from an improved automation strategy across technology and business. Airflow has an excellent opportunity to capture mindshare and emerge as the leading solution here.
Add to your calendar 07/09/2021 5:30 PM 07/09/2021 5:55 PM UTC Airflow Summit: Airflow Journey @SG

This talk will cover the adoption journey (Technical Challenges & Team Organization) of Apache Airflow (1.8 to 2.0) at Societe Generale.

Time line of events:

  • POC with v1.8 to convince our management.
  • Shared infrastructure with v1.10.2.
  • Multiple Infrastructure with v1.10.12.
  • On demand service offer with v2.0 (Challenges & REX)
https://airflowsummit.org/live

Airflow Journey @SG

by Ahmed Chakir Alaoui & Alaeddine Maaoui

This talk will cover the adoption journey (Technical Challenges & Team Organization) of Apache Airflow (1.8 to 2.0) at Societe Generale. Time line of events: POC with v1.8 to convince our management. Shared infrastructure with v1.10.2. Multiple Infrastructure with v1.10.12. On demand service offer with v2.0 (Challenges & REX)
Add to your calendar 07/09/2021 6:00 PM 07/09/2021 6:25 PM UTC Airflow Summit: Building an Elastic Platform Using …

At QuintoAndar we seek automation and scalability in our data pipelines and believe that Airflow is the right tool for giving us exactly what we need. However, having all concerns mapped and tooling defined doesn’t necessarily mean success.

For months we had struggled with a misconception that Airflow should act as an orchestrator and executor within a monolithic strategy. That could not be further from the truth because of the rise of scalability and performance issues, infrastructure and maintainability costs, and multi-directional impact throughout development teams.

Employing Airflow, though, as an orchestration-only solution may help teams deliver value to end users in a more efficient, reliable and performant manner, where data pipelines can be executed anywhere with proper resources and optimizations.

Those are the reasons we have shifted from an orchestrate-execute strategy to an orchestrate-only one, in order to leverage the full power of data pipeline management in Airflow. Straightaway the separation of data processing and pipeline coordination brought not only a finer resource tuning and better maintainability, but also a tremendous scalability on both ends.

https://airflowsummit.org/live

Building an Elastic Platform Using Airflow Uniquely as an Orchestrator

by Rafael Ribaldo & Lucas Fonseca

At QuintoAndar we seek automation and scalability in our data pipelines and believe that Airflow is the right tool for giving us exactly what we need. However, having all concerns mapped and tooling defined doesn’t necessarily mean success. For months we had struggled with a misconception that Airflow should act as an orchestrator and executor within a monolithic strategy. That could not be further from the truth because of the rise of scalability and performance issues, infrastructure and maintainability costs, and multi-directional impact throughout development teams.
Add to your calendar 07/09/2021 4:00 PM 07/09/2021 7:00 PM UTC Airflow Summit: Workshop: Contributing to Apache Airflow

Participation in this workshop requires previous registration and has limited capacity. Get your ticket at https://ti.to/airflowsummit/2021-contributor

By attending this workshop, you will learn how you can become a contributor to the Apache Airflow project. You will learn how to setup a development environment, how to pick your first issue, how to communicate effectively within the community and how to make your first PR - experienced committers of Apache Airflow project will give you step-by-step instructions and will guide you in the process. When you finish the workshop you will be equipped with everything that is needed to make further contributions to the Apache Airflow project.

Prerequisites:

  • You need to have Python experience.
  • Previous experience in Airflow is nice-to-have.
  • The session is geared towards Mac and Linux users. If you are a Windows user, it is best if you install Windows Subsystem for Linux (WSL).

In preparation for the class, please make sure you have set up the following prerequisites:

https://airflowsummit.org/live

Workshop: Contributing to Apache Airflow

by Jarek Potiuk & Tomasz Urbaszek

Learn how to become a code contributor to the Apache Airflow project.

2021-07-12T16:00:00.000Z

Add to your calendar 07/12/2021 4:00 PM 07/12/2021 4:50 PM UTC Airflow Summit: Looking ahead: What comes after Airflow … https://airflowsummit.org/live

Looking ahead: What comes after Airflow 2.0?

by aizhamal-nurmamat & Ash Berlin-Taylor

Add to your calendar 07/12/2021 5:00 PM 07/12/2021 5:50 PM UTC Airflow Summit: Session presented by Google https://airflowsummit.org/live
Add to your calendar 07/12/2021 6:00 PM 07/12/2021 6:50 PM UTC Airflow Summit: Pinterest’s Migration Journey

Last year, we were able to share why we have selected Airflow to be our next generation workflow system. This year, we will dive into the journey of migrating over 3000+ workflows and 45000+ tasks to Airflow. We will discuss the infrastructure additions to support such loads, the partitioning and prioritization of different workflow tiers defined in house, the migration tooling we built to get users to onboard, the translation layers between our old DSLs and the new, our internal k8s executor to leverage Pinterest’s kubernetes fleet, and more. We want to share the challenges both technically and usability wise to get such large migrations over the course of a year, and how we overcame it to successfully migrate 100% of the workflows to our inhouse workflow platform branded Spinner.

https://airflowsummit.org/live

Pinterest’s Migration Journey

by Ace Haidrey

Last year, we were able to share why we have selected Airflow to be our next generation workflow system. This year, we will dive into the journey of migrating over 3000+ workflows and 45000+ tasks to Airflow. We will discuss the infrastructure additions to support such loads, the partitioning and prioritization of different workflow tiers defined in house, the migration tooling we built to get users to onboard, the translation layers between our old DSLs and the new, our internal k8s executor to leverage Pinterest’s kubernetes fleet, and more.
Add to your calendar 07/12/2021 7:00 PM 07/12/2021 7:25 PM UTC Airflow Summit: The Newcomer's Guide to Airflow's …

Airflow has a lot of moving parts, and it can be a little overwhelming as a new user - as I was not too long ago.

Join me as we go though Airflow’s architecture at a high level, explore how DAGs work and run, and look at some of the good, the bad, and the unexpected things lurking inside.

https://airflowsummit.org/live

The Newcomer's Guide to Airflow's Architecture

by Andrew Godwin

Airflow has a lot of moving parts, and it can be a little overwhelming as a new user - as I was not too long ago. Join me as we go though Airflow’s architecture at a high level, explore how DAGs work and run, and look at some of the good, the bad, and the unexpected things lurking inside.

2021-07-13T04:00:00.000Z

Add to your calendar 07/13/2021 4:00 AM 07/13/2021 4:25 AM UTC Airflow Summit: Dataclasses as Pipeline Definitions in …

We will describe how we were able to build a system in Airflow for MySQL to Redshift ETL pipelines defined in pure Python using dataclasses. These dataclasses are then used to dynamically generate DAGs depending on pipeline type. This setup allows us to implement robust testing, validation, alerts, and documentation for our pipelines. We will also describe the performance improvements we achieved by upgrading to Airflow 2.0.

https://airflowsummit.org/live

Dataclasses as Pipeline Definitions in Airflow

by Madison Swain-Bowden

We will describe how we were able to build a system in Airflow for MySQL to Redshift ETL pipelines defined in pure Python using dataclasses. These dataclasses are then used to dynamically generate DAGs depending on pipeline type. This setup allows us to implement robust testing, validation, alerts, and documentation for our pipelines. We will also describe the performance improvements we achieved by upgrading to Airflow 2.0.
Add to your calendar 07/13/2021 4:30 AM 07/13/2021 4:55 AM UTC Airflow Summit: Creating Data Pipelines with Elyra, a …

This presentation will detail how Elyra creates Jupyter Notebook, Python and R script- based pipelines without having to leave your web browser. The goal of using Elyra is to help construct data pipelines by surfacing concepts and patterns common in pipeline construction into a familiar, easy to navigate interface for Data Scientists and Engineers so they can create pipelines on their own. In Elyra’s Pipeline Editor UI, portions of Apache Airflow’s domain language are surfaced to the user and either made transparent or understandable through the use of tooltips or helpful notes in the proper context during pipeline construction. With these features, Elyra can rapidly prototype data workflows without the need to know or write any pipeline code. Lastly, we will look at what features we have planned on our roadmap for Airflow, including more robust Kubernetes integration and support for runtime specific components/operators. Project Home: https://github.com/elyra-ai/elyra

https://airflowsummit.org/live

Creating Data Pipelines with Elyra, a visual DAG composer and Apache Airflow

by Alan Chin

This presentation will detail how Elyra creates Jupyter Notebook, Python and R script- based pipelines without having to leave your web browser. The goal of using Elyra is to help construct data pipelines by surfacing concepts and patterns common in pipeline construction into a familiar, easy to navigate interface for Data Scientists and Engineers so they can create pipelines on their own. In Elyra’s Pipeline Editor UI, portions of Apache Airflow’s domain language are surfaced to the user and either made transparent or understandable through the use of tooltips or helpful notes in the proper context during pipeline construction.
Add to your calendar 07/13/2021 5:00 AM 07/13/2021 5:25 AM UTC Airflow Summit: Apache Airflow and Ray: Orchestrating ML …

As the Apache Airflow project grows, we seek both ways to incorporate rising technologies and novel ways to expose them to our users. Ray is one of the fastest-growing distributed computation systems on the market today. In this talk, we will introduce the Ray decorator and Ray backend. These features, built with the help of the Ray maintainers at Anyscale, will allow Data Scientists to natively integrate their distributed pandas, XGBoost, and TensorFlow jobs to their airflow pipelines with a single decorator. By merging the orchestration of Airflow and the distributed computation of Ray, this coordination of technologies opens Airflow users to a whole host of new possibilities when designing their pipelines.

https://airflowsummit.org/live

Apache Airflow and Ray: Orchestrating ML at Scale

by Daniel Imberman

As the Apache Airflow project grows, we seek both ways to incorporate rising technologies and novel ways to expose them to our users. Ray is one of the fastest-growing distributed computation systems on the market today. In this talk, we will introduce the Ray decorator and Ray backend. These features, built with the help of the Ray maintainers at Anyscale, will allow Data Scientists to natively integrate their distributed pandas, XGBoost, and TensorFlow jobs to their airflow pipelines with a single decorator.
Add to your calendar 07/13/2021 5:30 AM 07/13/2021 5:55 AM UTC Airflow Summit: Event-based Scheduling Based on Airflow

Airflow scheduler uses DAG definitions to monitor the state of tasks in the metadata database, and triggers the task instances whose dependencies have been met. It is based on state of dependencies scheduling. The idea of event based scheduling is to let the operators send events to the scheduler to trigger a scheduling action, such as starting jobs, stopping jobs and restarting jobs. Event based scheduling allows potential support for richer scheduling semantics such as periodic execution and manual trigger at per operator granularity.

https://airflowsummit.org/live

Event-based Scheduling Based on Airflow

by Wuchao Chen

Airflow scheduler uses DAG definitions to monitor the state of tasks in the metadata database, and triggers the task instances whose dependencies have been met. It is based on state of dependencies scheduling. The idea of event based scheduling is to let the operators send events to the scheduler to trigger a scheduling action, such as starting jobs, stopping jobs and restarting jobs. Event based scheduling allows potential support for richer scheduling semantics such as periodic execution and manual trigger at per operator granularity.
Add to your calendar 07/13/2021 6:00 AM 07/13/2021 6:25 AM UTC Airflow Summit: Provision as a Service: Automating data …

Cloudflare’s network keeps growing, and that growth doesn’t just come from building new data centers in new cities. We’re also upgrading the capacity of existing data centers by adding newer generations of servers — a process that makes our network safer, faster, and more reliable for our users.

In this talk, I’ll share how we’re leveraging Apache Airflow to build our own Provision-as-a-Service (PraaS) platform and cut by 90% the amount of time our team spent on mundane operational tasks.

https://airflowsummit.org/live

Provision as a Service: Automating data center operations with Airflow at Cloudflare

by Jet Mariscal

Cloudflare’s network keeps growing, and that growth doesn’t just come from building new data centers in new cities. We’re also upgrading the capacity of existing data centers by adding newer generations of servers — a process that makes our network safer, faster, and more reliable for our users. In this talk, I’ll share how we’re leveraging Apache Airflow to build our own Provision-as-a-Service (PraaS) platform and cut by 90% the amount of time our team spent on mundane operational tasks.
Add to your calendar 07/13/2021 6:30 AM 07/13/2021 6:55 AM UTC Airflow Summit: Introducing Viewflow: a framework for …

In this talk, we present Viewflow, an open-source Airflow-based framework that allows data scientists to create materialized views in SQL, R, and Python without writing Airflow code.

We will start by explaining what problem does Viewflow solve: writing and maintaining complex Airflow code instead of focusing on data science. Then we will see how Viewflow solves that problem. We will continue by showing how to use VIewflow with several real-world examples. Finally, we will see what the upcoming features of Viewflow are!

Resources:

https://airflowsummit.org/live

Introducing Viewflow: a framework for writing data models without writing Airflow code

by Gaëtan Podevijn

In this talk, we present Viewflow, an open-source Airflow-based framework that allows data scientists to create materialized views in SQL, R, and Python without writing Airflow code. We will start by explaining what problem does Viewflow solve: writing and maintaining complex Airflow code instead of focusing on data science. Then we will see how Viewflow solves that problem. We will continue by showing how to use VIewflow with several real-world examples.
Add to your calendar 07/13/2021 7:00 AM 07/13/2021 7:25 AM UTC Airflow Summit: Create Your Custom Secrets Backend for …

This talk aims to share how Airflow’s secrets backend works, and how users can create their custom secret backends for their specific use cases & technology stack.

https://airflowsummit.org/live

Create Your Custom Secrets Backend for Apache Airflow - A guided tour into Airflow codebase

by Xiaodong DENG

This talk aims to share how Airflow’s secrets backend works, and how users can create their custom secret backends for their specific use cases & technology stack.

2021-07-13T16:00:00.000Z

Add to your calendar 07/13/2021 4:00 PM 07/13/2021 4:30 PM UTC Airflow Summit: Airflow and Analytics Engineering - Dos …

Considering that the role of Analytics Engineering has emerged in the last few years within data and analytics teams, it is important for me to highlight what role an Analytics engineer has and how the Dos and Don’ts from my perspective can contribute to a team and boost their day-to-day work with the help of Airflow.

https://airflowsummit.org/live

Airflow and Analytics Engineering - Dos and don'ts

by Sergio Camilo Fandiño Hernández

Considering that the role of Analytics Engineering has emerged in the last few years within data and analytics teams, it is important for me to highlight what role an Analytics engineer has and how the Dos and Don’ts from my perspective can contribute to a team and boost their day-to-day work with the help of Airflow.
Add to your calendar 07/13/2021 4:00 PM 07/13/2021 4:30 PM UTC Airflow Summit: Robots are your friends - using …

As part of my role at Google, maintaining samples for Cloud Composer, hosted managed Airflow, is crucial. It’s not feasible for me to try out every sample every day to check that it’s working. So, how do I do it?

Automation! While I won’t let the robots touch everything, they let me know when it’s time to pay attention. Here’s how:

Step 0: An update for the operators is released Step 1: A GitHub bot called Renovate Bot opens up a PR to a special requirements file to make this update Step 2: Cloud build runs unit tests to make sure none of my DAGs immediately break Step 3: PR is approved and merged to main Step 4: Cloud Build updates my dev environment Step 5: I look at my DAGs in dev to make sure all is well. If there is a problem, I need to resolve it manually and revert my requirements file. Step 6: I manually update my prod PyPI packages

I’ll discuss what automation tools I choose to use and why, and the places where I intentionally leave manual steps to ensure proper oversight.

https://airflowsummit.org/live

Robots are your friends - using automation to keep your Airflow operators up to date

by Leah Cole

As part of my role at Google, maintaining samples for Cloud Composer, hosted managed Airflow, is crucial. It’s not feasible for me to try out every sample every day to check that it’s working. So, how do I do it? Automation! While I won’t let the robots touch everything, they let me know when it’s time to pay attention. Here’s how: Step 0: An update for the operators is released Step 1: A GitHub bot called Renovate Bot opens up a PR to a special requirements file to make this update Step 2: Cloud build runs unit tests to make sure none of my DAGs immediately break Step 3: PR is approved and merged to main Step 4: Cloud Build updates my dev environment Step 5: I look at my DAGs in dev to make sure all is well.
Add to your calendar 07/13/2021 4:00 PM 07/13/2021 4:30 PM UTC Airflow Summit: SciDAP: Airflow and CWL-powered …

Reproducibility is the fundamental principle of a scientific research. This also applies to the computational workflows that are used to process research data. Common Workflow Language (CWL) is a highly formalized way to describe pipelines that was developed to achieve reproducibility and portability of computational analysis. However, there were only few workflow execution platforms that could run CWL pipelines. Here, we present CWL-Airflow – an extension for Airflow to execute CWL pipelines. CWL-Airflow serves as a processing engine for Scientific Data Analysis Platform (SciDAP) – a data analysis platform that makes complex computational workflows both user-friendly and reproducible. In our presentation we are going to explain why we see Airflow as the perfect backend for running scientific workflows, what problems we encountered in extending Airflow to run CWL pipelines and how we solved them. We will also discuss what are the pros and cons of limiting our platform to CWL pipelines and potential applications of CWL-Airflow outside the realm of biology.

https://airflowsummit.org/live

SciDAP: Airflow and CWL-powered bioinformatics platform

by Nick Luckey & Michael Kotliar

Reproducibility is the fundamental principle of a scientific research. This also applies to the computational workflows that are used to process research data. Common Workflow Language (CWL) is a highly formalized way to describe pipelines that was developed to achieve reproducibility and portability of computational analysis. However, there were only few workflow execution platforms that could run CWL pipelines. Here, we present CWL-Airflow – an extension for Airflow to execute CWL pipelines.
Add to your calendar 07/13/2021 4:30 PM 07/13/2021 5:20 PM UTC Airflow Summit: Next-Gen Astronomer Cloud

Astronomer founders Ry Walker and Greg Neiheisel will preview the upcoming next-gen Astronomer Cloud product offering.

https://airflowsummit.org/live

Next-Gen Astronomer Cloud

by Ry Walker & Greg Neiheisel

Astronomer founders Ry Walker and Greg Neiheisel will preview the upcoming next-gen Astronomer Cloud product offering.
Add to your calendar 07/13/2021 5:30 PM 07/13/2021 5:55 PM UTC Airflow Summit: Apache Airflow at Wise

Wise (previously TransferWise) is a London-based fin-tech company. We build a better way of sending money internationally.

At Wise we make great use of Airflow. More than 100 data scientists, analysts and engineers use Airflow every day to generate reports, prepare data, (re)train machine learning models and monitor services.

My name is Alexandra, I’m a Machine Learning Engineer at Wise. Our team is responsible for building and maintaining Wise’s Airflow instances. In this presentation I would like to talk about three main things, our current setup, our challenges and our future plans with Airflow. We are currently transitioning from a single centralised Airflow instance into many segregated instances to increase reliability and limit access. We’ve learned a lot throughout this journey and looking to share these learnings with a wider audience.

https://airflowsummit.org/live

Apache Airflow at Wise

by Alexandra Abbas

Wise (previously TransferWise) is a London-based fin-tech company. We build a better way of sending money internationally. At Wise we make great use of Airflow. More than 100 data scientists, analysts and engineers use Airflow every day to generate reports, prepare data, (re)train machine learning models and monitor services. My name is Alexandra, I’m a Machine Learning Engineer at Wise. Our team is responsible for building and maintaining Wise’s Airflow instances.
Add to your calendar 07/13/2021 6:00 PM 07/13/2021 6:25 PM UTC Airflow Summit: Orchestrating ELT with Fivetran and …

At Fivetran, we are seeing many organizations adopt the Modern Data Stack to suit the breadth of their data needs. However, as incoming data sources begin to scale, it can be hard to manage and maintain the environment, with more time spent repairing and reengineering old data pipelines than building new ones. This talk will introduce a number of new Airflow Providers, including the airflow-provider-fivetran, and discuss some of the benefits and considerations we are seeing data engineers, data analysts, and data scientist experience in doing so.

https://airflowsummit.org/live

Orchestrating ELT with Fivetran and Airflow

by Nick Acosta

At Fivetran, we are seeing many organizations adopt the Modern Data Stack to suit the breadth of their data needs. However, as incoming data sources begin to scale, it can be hard to manage and maintain the environment, with more time spent repairing and reengineering old data pipelines than building new ones. This talk will introduce a number of new Airflow Providers, including the airflow-provider-fivetran, and discuss some of the benefits and considerations we are seeing data engineers, data analysts, and data scientist experience in doing so.
Add to your calendar 07/13/2021 6:30 PM 07/13/2021 7:20 PM UTC Airflow Summit: Upgrading to Apache Airflow 2

Airflow 2.0 was a big milestone for the Airflow community. However, companies and enterprises are still facing difficulties in upgrading to 2.0.

In this talk, I would like to focus and highlight the ideal upgrade path and talk about

  • upgrade_check CLI tool
  • separation of providers
  • registering connections types
  • important 2.0 Airflow configs
  • DB Migration
  • deprecated feature around Airflow Plugins
https://airflowsummit.org/live

Upgrading to Apache Airflow 2

by Kaxil Naik

Airflow 2.0 was a big milestone for the Airflow community. However, companies and enterprises are still facing difficulties in upgrading to 2.0. In this talk, I would like to focus and highlight the ideal upgrade path and talk about upgrade_check CLI tool separation of providers registering connections types important 2.0 Airflow configs DB Migration deprecated feature around Airflow Plugins

2021-07-14T16:00:00.000Z

Add to your calendar 07/14/2021 4:00 PM 07/14/2021 4:25 PM UTC Airflow Summit: Guaranteeing pipeline SLAs and data …

We’ve all heard the phrase “data is the new oil.” But really imagine a world where this analogy is more real, where problems in the flow of data - delays, low quality, high volatility - could bring down whole economies? When data is the new oil with people and businesses similarly reliant on it, how do you avoid the fires, spills, and crises?

As data products become central to companies’ bottom line, data engineering teams need to create higher standards for the availability, completeness, and fidelity of their data.

In this session we’ll demonstrate how Databand helps organizations guarantee the health of their Airflow pipelines. Databand is a data pipeline observability system that monitors SLAs and data quality issues, and proactively alerts users on problems to avoid data downtime.

The session will be led by Josh Benamram, CEO and Cofounder of Databand.ai. Josh will be joined by Vinoo Ganesh, an experienced software engineer, system architect, and current CTO of Veraset, a data-as-a-service startup focused on understanding the world from a geospatial perspective.

Join to see how Databand.ai can help you create stable, reliable pipelines that your business can depend on!

https://airflowsummit.org/live

Guaranteeing pipeline SLAs and data quality standards with Databand

by Josh Benamram & Vinoo Ganesh

We’ve all heard the phrase “data is the new oil.” But really imagine a world where this analogy is more real, where problems in the flow of data - delays, low quality, high volatility - could bring down whole economies? When data is the new oil with people and businesses similarly reliant on it, how do you avoid the fires, spills, and crises? As data products become central to companies’ bottom line, data engineering teams need to create higher standards for the availability, completeness, and fidelity of their data.
Add to your calendar 07/14/2021 4:30 PM 07/14/2021 5:20 PM UTC Airflow Summit: Deep dive in to the Airflow scheduler

The scheduler is the core of Airflow, and it’s a complex beast.

In this session we will go through the scheduler in some detail; how it works; what the communication paths are and what processing is done where.

https://airflowsummit.org/live

Deep dive in to the Airflow scheduler

by Ash Berlin-Taylor

The scheduler is the core of Airflow, and it’s a complex beast. In this session we will go through the scheduler in some detail; how it works; what the communication paths are and what processing is done where.
Add to your calendar 07/14/2021 5:30 PM 07/14/2021 5:55 PM UTC Airflow Summit: Running Big Data Applications in …

In this talk we’ll see some real world examples from Firebolt customers demonstrating how Airflow is used to orchestrate operational data analytics applications with large data volumes, while keeping query latency low.

https://airflowsummit.org/live

Running Big Data Applications in production with Airflow + Firebolt

by Cody Schwarz

In this talk we’ll see some real world examples from Firebolt customers demonstrating how Airflow is used to orchestrate operational data analytics applications with large data volumes, while keeping query latency low.
Add to your calendar 07/14/2021 6:00 PM 07/14/2021 6:25 PM UTC Airflow Summit: Writing Dry Code in Airflow

Engineering teams leverage the factory coding pattern to write easy-to-read and repeatable code. In this talk, we’ll outline how data engineering teams can do the same with Airflow by separating DAG declarations from business logic, abstracting task declarations from task dependencies, and creating a code architecture that is simple to understand for new team members. This approach will set analytics teams up for success as team and Airflow DAG sizes grow exponentially.

https://airflowsummit.org/live

Writing Dry Code in Airflow

by Sarah Krasnik

Engineering teams leverage the factory coding pattern to write easy-to-read and repeatable code. In this talk, we’ll outline how data engineering teams can do the same with Airflow by separating DAG declarations from business logic, abstracting task declarations from task dependencies, and creating a code architecture that is simple to understand for new team members. This approach will set analytics teams up for success as team and Airflow DAG sizes grow exponentially.
Add to your calendar 07/14/2021 6:30 PM 07/14/2021 7:20 PM UTC Airflow Summit: Building a robust data pipeline with the …

Data quality has become a much discussed topic in the fields of data engineering and data science, and it has become clear that data validation is absolutely crucial to ensuring the reliability of any data products and insights produced by an organization’s data pipelines. This session will outline patterns for combining three popular open source tools in the data ecosystem - dbt, Airflow, and Great Expectations - and use them to build a robust data pipeline with data validation at each critical step.

https://airflowsummit.org/live

Building a robust data pipeline with the dAG stack: dbt, Airflow, Great Expectations

by Sam Bail

Data quality has become a much discussed topic in the fields of data engineering and data science, and it has become clear that data validation is absolutely crucial to ensuring the reliability of any data products and insights produced by an organization’s data pipelines. This session will outline patterns for combining three popular open source tools in the data ecosystem - dbt, Airflow, and Great Expectations - and use them to build a robust data pipeline with data validation at each critical step.

2021-07-15T04:00:00.000Z

Add to your calendar 07/15/2021 4:00 AM 07/15/2021 4:25 AM UTC Airflow Summit: An On-Demand Airflow Service for …

EA Games have very dynamic and federated needs on their data processing pipelines. Many individual studios within EA build and manage the data pipelines for their games iterating rapidly through game development cycles. Developer productivity around orchestrating these pipelines is as critical as providing a robust production quality orchestration service. With these in mind, we re-engineered our Airflow service ground up to cater to our large internal user base (1000s) and internet scale data processing systems (Petabytes of data). This session details the evolution of the use of Airflow at EA Digital Platform from a monolithic multi-tenant instance to an “On-Demand” system where teams and studios create their own dedicated Airflow instance with all the necessary bells-and-whistles required at the click of a button - and allows them to immediately get their data pipelines running. We also elaborate how Airflow is interwoven into a “Self Serve” model for ETL pipelines within our teams with the objective of truely democratizing data across our games.

https://airflowsummit.org/live

An On-Demand Airflow Service for Internet Scale Gameplay Pipelines

by Nitish Victor & Yuanmeng Zeng

EA Games have very dynamic and federated needs on their data processing pipelines. Many individual studios within EA build and manage the data pipelines for their games iterating rapidly through game development cycles. Developer productivity around orchestrating these pipelines is as critical as providing a robust production quality orchestration service. With these in mind, we re-engineered our Airflow service ground up to cater to our large internal user base (1000s) and internet scale data processing systems (Petabytes of data).
Add to your calendar 07/15/2021 4:30 AM 07/15/2021 4:55 AM UTC Airflow Summit: Airflow Extensions for Streamlined ETL …

Using Airflow as our scheduling framework, we ETL data generated by tens of millions of transactions every day to build the backbone for our reports, dashboards, and training data for our machine learning models. There are over 500 (and growing) such ingested and aggregated tables owned by multiple teams that contain intricate dependencies between one another. Given this level of complexity, it can become extremely cumbersome to coordinate backfills for any given table, when also taking into account all its downstream dependencies, aggregation intervals, and data availability. This talk will focus on how we customized and extended Airflow at Adyen to streamline our backfilling operations. This allows us to prevent mistakes and enable our product teams to keep launching fast and iterating.

https://airflowsummit.org/live

Airflow Extensions for Streamlined ETL Backfilling

by Ravi Autar

Using Airflow as our scheduling framework, we ETL data generated by tens of millions of transactions every day to build the backbone for our reports, dashboards, and training data for our machine learning models. There are over 500 (and growing) such ingested and aggregated tables owned by multiple teams that contain intricate dependencies between one another. Given this level of complexity, it can become extremely cumbersome to coordinate backfills for any given table, when also taking into account all its downstream dependencies, aggregation intervals, and data availability.
Add to your calendar 07/15/2021 5:00 AM 07/15/2021 5:25 AM UTC Airflow Summit: The new modern data stack - Airbyte, …

In this talk, I’ll describe how you can leverage 3 open-source standards - workflow management with Airflow, EL with Airbyte, transformation with DBT - to build your next modern data stack. I’ll explain how to configure your Airflow DAG to trigger Airbyte’s data replication jobs and DBT’s transformation one with a concrete use case.

https://airflowsummit.org/live

The new modern data stack - Airbyte, Airflow, DBT

by Michel Tricot

In this talk, I’ll describe how you can leverage 3 open-source standards - workflow management with Airflow, EL with Airbyte, transformation with DBT - to build your next modern data stack. I’ll explain how to configure your Airflow DAG to trigger Airbyte’s data replication jobs and DBT’s transformation one with a concrete use case.
Add to your calendar 07/15/2021 5:30 AM 07/15/2021 5:55 AM UTC Airflow Summit: Building ML pipelines with Airflow - …

At Near we work on TBs of Location data with close to real time modelling to generate key consumer insights and estimates for our clients across the globe. We have hundreds of country specific models deployed and managed through airflow to achieve this goal. Some of the workflows that we have deployed our schedule based, some are dynamic and some are trigger based. In this session I would be discussing some of the workflows that are being scheduled and monitored using airflow and the key benefits and also the challenges that we have faced in our production systems.

https://airflowsummit.org/live

Building ML pipelines with Airflow - Learning and Challenges

by Manmeet Kaur

At Near we work on TBs of Location data with close to real time modelling to generate key consumer insights and estimates for our clients across the globe. We have hundreds of country specific models deployed and managed through airflow to achieve this goal. Some of the workflows that we have deployed our schedule based, some are dynamic and some are trigger based. In this session I would be discussing some of the workflows that are being scheduled and monitored using airflow and the key benefits and also the challenges that we have faced in our production systems.
Add to your calendar 07/15/2021 6:00 AM 07/15/2021 6:50 AM UTC Airflow Summit: Modernize a decade old pipeline with …

As a follow up for https://airflowsummit.org/sessions/teaching-old-dag-new-tricks/, in this talk, we would like to share a happy ending story on how Scribd fully migrated its data platform to the cloud and Airflow 2.0.

We will talk about data validation tools and task trigger customizations the team built to smooth out the transition. We will share how we completed the Airflow 2.0 migration started with an unsupported MySQL version and metrics to prove why everyone should perform the upgrade. Lastly, we will discuss how large scale backfills (10 years worth of run) are managed and automated at Scribd.

https://airflowsummit.org/live

Modernize a decade old pipeline with Airflow 2.0

by QP Hou, Kuntal Basu, Stas Bytsko & Dmitry Suvorov

As a follow up for https://airflowsummit.org/sessions/teaching-old-dag-new-tricks/, in this talk, we would like to share a happy ending story on how Scribd fully migrated its data platform to the cloud and Airflow 2.0. We will talk about data validation tools and task trigger customizations the team built to smooth out the transition. We will share how we completed the Airflow 2.0 migration started with an unsupported MySQL version and metrics to prove why everyone should perform the upgrade.
Add to your calendar 07/15/2021 7:00 AM 07/15/2021 7:25 AM UTC Airflow Summit: Building the AirflowEventStream

Or how to keep our traditional java application up-to-date on everything big data.

At Adyen we process tens of millions of transactions a day, a number that rises every day. This means that generating reports, training machine learning models or any other operation that requires a bird’s eye view on weeks or months of data requires the use of Big Data technologies.

We recently migrated to Airflow for scheduling all batch operations on our on-premise Big Data cluster. Some of these operations require input from our merchants or our support team. Merchants can for instance subscribe to reports, choose their preferred time zone, and even specify which columns they want included. After generating the reports, these reports then need to become available in our customer portal.

So how do we keep track in our Customer Area which reports have been generated in Airflow? How do we launch ad-hoc backfills when one of our merchants subscribes to a new report? How do we integrate all of this into our existing monitoring pipeline?

This talk will focus on how we have successfully integrated our big data platform with our existing Java web applications and how Airflow (with some simple add-ons) played a crucial role in achieving this.

https://airflowsummit.org/live

Building the AirflowEventStream

by Jelle Munk

Or how to keep our traditional java application up-to-date on everything big data. At Adyen we process tens of millions of transactions a day, a number that rises every day. This means that generating reports, training machine learning models or any other operation that requires a bird’s eye view on weeks or months of data requires the use of Big Data technologies. We recently migrated to Airflow for scheduling all batch operations on our on-premise Big Data cluster.

2021-07-15T16:00:00.000Z

Add to your calendar 07/15/2021 4:00 PM 07/15/2021 4:25 PM UTC Airflow Summit: Dynamic Security Roles in Airflow for …

Multi-tenant Airflow instances can help save costs for an organization. This talk will walk through how we dynamically assigned roles to users based on groups in Active Directory so that teams would have access to DAGs they created in the UI on our multi-tenant Airflow instance. We created our own custom AirflowSecurityManager class in order to achieve this that ultimately ties LDAP and RBAC together.

https://airflowsummit.org/live

Dynamic Security Roles in Airflow for Multi-Tenancy

by Mark Merling & Sean Lewis

Multi-tenant Airflow instances can help save costs for an organization. This talk will walk through how we dynamically assigned roles to users based on groups in Active Directory so that teams would have access to DAGs they created in the UI on our multi-tenant Airflow instance. We created our own custom AirflowSecurityManager class in order to achieve this that ultimately ties LDAP and RBAC together.
Add to your calendar 07/15/2021 4:30 PM 07/15/2021 5:20 PM UTC Airflow Summit: Apache Airflow 2.0 on Amazon MWAA

In this session we will be presenting Apache Airflow 2.0 on Amazon Managed Workflows for Apache Airflow (MWAA)

https://airflowsummit.org/live

Apache Airflow 2.0 on Amazon MWAA

by John Jackson

In this session we will be presenting Apache Airflow 2.0 on Amazon Managed Workflows for Apache Airflow (MWAA)
Add to your calendar 07/15/2021 5:30 PM 07/15/2021 6:20 PM UTC Airflow Summit: Clearing Airflow obstructions

Apache Airflow aims to speed the development of workflows, but developers are always ready to add bugs here and there.

This talk illustrates a few pitfalls faced while developing workflows at the BBC to build machine learning models. The objective is to share some lessons learned and, hopefully, save others time.

Some of the topics covered, with code examples:

  • Tasks unsuitable to be run from within Airflow executors
  • Plugins misusage
  • Inconsistency while using an operator
  • (Mis)configuration
  • What to avoid during a workflow deployment
  • Consequences of non-idempotent tasks
https://airflowsummit.org/live

Clearing Airflow obstructions

by Tatiana Al-Chueyr Martins

Apache Airflow aims to speed the development of workflows, but developers are always ready to add bugs here and there. This talk illustrates a few pitfalls faced while developing workflows at the BBC to build machine learning models. The objective is to share some lessons learned and, hopefully, save others time. Some of the topics covered, with code examples: Tasks unsuitable to be run from within Airflow executors Plugins misusage Inconsistency while using an operator (Mis)configuration What to avoid during a workflow deployment Consequences of non-idempotent tasks
Add to your calendar 07/15/2021 6:30 PM 07/15/2021 6:55 PM UTC Airflow Summit: Learn how to build your own Airflow …

Help your team standardize workflows and build DAGs quicker.

In this session, you’ll find a walkthrough and sample code to build and share an Airflow provider package. This will help you create repeatable patterns that interface your preferred 3rd party services with Airflow.

https://airflowsummit.org/live

Learn how to build your own Airflow provider

by Plinio Guzman

Help your team standardize workflows and build DAGs quicker. In this session, you’ll find a walkthrough and sample code to build and share an Airflow provider package. This will help you create repeatable patterns that interface your preferred 3rd party services with Airflow.
Add to your calendar 07/15/2021 7:00 PM 07/15/2021 7:25 PM UTC Airflow Summit: Reverse ETL on Airflow

At Snowflake you can imagine we do a lot of data pipelines and tables curating metrics metrics for all parts of the business. These are the lifeline of Snowflake’s business decisions. We also have a lot of source systems that display and make these metrics accessible to end users. So what happens when your data model does not match your system? For example your bookings numbers in salesforce do not match your data model that curates bookings metrics. At snowflake we continued to run into this problem over and over again.

Having this problem we set out to build an infrastructure that would allow users to effortlessly sync the results of their data pipelines with any downstream / upstream system. Allowing us to have a central source of truth in our warehouse. This infrastructure was built on snowflake using airflow and allows a user to begin syncing data with a few details such as model and system to update.

In this presentation we will show you how using airflow and snowflake we are able to use our data pipelines as the source of truth for all systems involved in the business. With this infrastructure we are able to use snowflake models as a central source of truth for all applications used throughout the company. This ensures that any number synced in this way seen by two users is always the same.

https://airflowsummit.org/live

Reverse ETL on Airflow

by Russell Dervay

At Snowflake you can imagine we do a lot of data pipelines and tables curating metrics metrics for all parts of the business. These are the lifeline of Snowflake’s business decisions. We also have a lot of source systems that display and make these metrics accessible to end users. So what happens when your data model does not match your system? For example your bookings numbers in salesforce do not match your data model that curates bookings metrics.

2021-07-16T04:00:00.000Z

Add to your calendar 07/16/2021 4:00 AM 07/16/2021 4:25 AM UTC Airflow Summit: Productionizing ML Pipelines with …

Machine Learning models can add value and insight to many projects, but they can be challenging to put into production due to problems like lack of reproducibility, difficulty maintaining integrations, and sneaky data quality issues. Kedro, a framework for creating reproducible, maintainable, and modular data science code, and Great Expectations, a framework for data validations, are two great open-source Python tools that can address some of these problems. Both integrate seamlessly with Airflow for flexible and powerful ML pipeline orchestration. In this talk we’ll discuss how you can leverage existing Airflow provider packages to integrate these tools to create sustainable, production-ready ML models.

https://airflowsummit.org/live

Productionizing ML Pipelines with Airflow, Kedro, and Great Expectations

by Kenten Danas

Machine Learning models can add value and insight to many projects, but they can be challenging to put into production due to problems like lack of reproducibility, difficulty maintaining integrations, and sneaky data quality issues. Kedro, a framework for creating reproducible, maintainable, and modular data science code, and Great Expectations, a framework for data validations, are two great open-source Python tools that can address some of these problems. Both integrate seamlessly with Airflow for flexible and powerful ML pipeline orchestration.
Add to your calendar 07/16/2021 4:30 AM 07/16/2021 5:20 AM UTC Airflow Summit: Data Lineage with Apache Airflow using …

If you manage a lot of data, and you’re attending this summit, you likely rely on Apache Airflow to do a lot of the heavy lifting. Like any powerful tool, Apache Airflow allows you to accomplish what you couldn’t before… but also creates new challenges. As DAGs pile up, complexity layers on top of complexity and it becomes hard to grasp how a failed or delayed DAG will affect everything downstream. In this session we will provide a crash course on OpenLineage, an open platform for metadata management and data lineage analysis. We’ll show how capturing metadata with OpenLineage can help you maintain inter-DAG dependencies, capture data on historical runs, and minimize data quality issues.

https://airflowsummit.org/live

Data Lineage with Apache Airflow using OpenLineage

by Julien Le Dem & Willy Lulciuc

If you manage a lot of data, and you’re attending this summit, you likely rely on Apache Airflow to do a lot of the heavy lifting. Like any powerful tool, Apache Airflow allows you to accomplish what you couldn’t before… but also creates new challenges. As DAGs pile up, complexity layers on top of complexity and it becomes hard to grasp how a failed or delayed DAG will affect everything downstream.
Add to your calendar 07/16/2021 5:30 AM 07/16/2021 5:55 AM UTC Airflow Summit: Drift Bio: The Future of Microbial …

In recent years, the bioinformatics world has seen an explosion in genomic analysis as gene sequencing technologies have become exponentially cheaper. Tests that previously would have cost tens of thousands of dollars will soon run at pennies per sequence. This glut of data has exposed a notable bottleneck in the current suite of technologies available to bioinformaticians. At Drift Biotechnologies, we use Apache Airflow to transition traditionally on-premise large scale data and deep learning workflows for bioinformatics to the cloud, with an emphasis on workflows and data from next generation sequencing technologies.

https://airflowsummit.org/live

Drift Bio: The Future of Microbial Genomics with Apache Airflow

by Eli Scheele

In recent years, the bioinformatics world has seen an explosion in genomic analysis as gene sequencing technologies have become exponentially cheaper. Tests that previously would have cost tens of thousands of dollars will soon run at pennies per sequence. This glut of data has exposed a notable bottleneck in the current suite of technologies available to bioinformaticians. At Drift Biotechnologies, we use Apache Airflow to transition traditionally on-premise large scale data and deep learning workflows for bioinformatics to the cloud, with an emphasis on workflows and data from next generation sequencing technologies.
Add to your calendar 07/16/2021 6:00 AM 07/16/2021 6:50 AM UTC Airflow Summit: Usability Improvements: Debugging & …

The two most common user questions at Pinterest are: 1) why is my workflow running so long? 2) why did my workflow fail - is it my issue, or a platform issue? As with any big data organization, the workflow platform is just the orchestrator but the “real” work is done on another layer, managed by another platform. There can be plenty of these, and the challenges of figuring out the root cause of an issue can be mundane and time consuming. At Pinterest, we set out to provide additional tooling in our Airflow webserver to make it a quicker inspection process and provide smart tips such as increased runtime analysis, bottleneck identifying, rca, and an easy way for backfilling. We explore deeper the tooling provided to reduce the admin load, and empower our users.

https://airflowsummit.org/live

Usability Improvements: Debugging & Inspection Tooling

by Ace Haidrey

The two most common user questions at Pinterest are: 1) why is my workflow running so long? 2) why did my workflow fail - is it my issue, or a platform issue? As with any big data organization, the workflow platform is just the orchestrator but the “real” work is done on another layer, managed by another platform. There can be plenty of these, and the challenges of figuring out the root cause of an issue can be mundane and time consuming.
Add to your calendar 07/16/2021 7:00 AM 07/16/2021 7:50 AM UTC Airflow Summit: Autoscaling in Airflow - Lessons learned

Autoscaling in Airflow - what we learnt based on Cloud Composer case. We would like to present how we approach the autoscaling problem for Airflow running in Kubernetes in Cloud Composer: how we calculate our autoscaling metric, what problem we had for scaling down and how did we solve it. Also we share an ideas on what and how we could improve the current solution

https://airflowsummit.org/live

Autoscaling in Airflow - Lessons learned

by Mateusz Henc & Anita Fronczak

Autoscaling in Airflow - what we learnt based on Cloud Composer case. We would like to present how we approach the autoscaling problem for Airflow running in Kubernetes in Cloud Composer: how we calculate our autoscaling metric, what problem we had for scaling down and how did we solve it. Also we share an ideas on what and how we could improve the current solution

2021-07-16T16:00:00.000Z

Add to your calendar 07/16/2021 4:00 PM 07/16/2021 4:25 PM UTC Airflow Summit: Building a Scalable & Isolated …

After performing several experiments with Airflow, we reached the best architectural design for processing text medical records in scale. Our hybrid solution uses Kubernetes, Apache Airflow, Apache Livy, and Apache cTAKES. Using Kubernetes’ containers has the benefit of having a consistent, portable, and isolated environment for each component of the pipeline. With Apache Livy, you can run tasks in a Spark Cluster at scale. Additionally, Apache cTAKES helps with the extraction of information from electronic medical records clinical free-text by using natural language processing techniques to identify codable entities, temporal events, properties, and relations.

https://airflowsummit.org/live

Building a Scalable & Isolated Architecture for Preprocessing Medical Records

by Mikaela Pisani & Anthony Figueroa

After performing several experiments with Airflow, we reached the best architectural design for processing text medical records in scale. Our hybrid solution uses Kubernetes, Apache Airflow, Apache Livy, and Apache cTAKES. Using Kubernetes’ containers has the benefit of having a consistent, portable, and isolated environment for each component of the pipeline. With Apache Livy, you can run tasks in a Spark Cluster at scale. Additionally, Apache cTAKES helps with the extraction of information from electronic medical records clinical free-text by using natural language processing techniques to identify codable entities, temporal events, properties, and relations.
Add to your calendar 07/16/2021 4:30 PM 07/16/2021 4:55 PM UTC Airflow Summit: Advanced Superset for Engineers (API’s, …

Apache Superset is a modern, open-source data exploration & visualization platform originally created by Maxime Beauchemin. In this talk, I will showcase advanced technical Superset features like the rich Superset API, how to version control dashboards using Github, embedding Superset charts in other applications, and more. This talk will be technical and hands-on, and I will share all code examples I use so you can play with them yourself afterwards!

https://airflowsummit.org/live

Advanced Superset for Engineers (API’s, Version Controlled Dashboards, & more)

by Srini Kadamati

Apache Superset is a modern, open-source data exploration & visualization platform originally created by Maxime Beauchemin. In this talk, I will showcase advanced technical Superset features like the rich Superset API, how to version control dashboards using Github, embedding Superset charts in other applications, and more. This talk will be technical and hands-on, and I will share all code examples I use so you can play with them yourself afterwards!
Add to your calendar 07/16/2021 5:00 PM 07/16/2021 5:25 PM UTC Airflow Summit: Operating contexts: patterns around …

As people define and publish a DAG, it can be really useful to make it clear how this DAG should behave under different “operating contexts”. Common operating contexts may match your different environments (dev / staging / prod) and/or match your operating needs (quick run, full backfill, test run, …).

Over the years, patterns have emerged around workflow authors, teams and organizations, and little has been shared as to how to approach this. In this talk, we’ll talk about what an “operating context” is, why it’s useful, and describe common patterns and best practices around this topic.

https://airflowsummit.org/live

Operating contexts: patterns around defining how a DAG should behave in dev, staging, prod & beyond

by Maxime Beauchemin

As people define and publish a DAG, it can be really useful to make it clear how this DAG should behave under different “operating contexts”. Common operating contexts may match your different environments (dev / staging / prod) and/or match your operating needs (quick run, full backfill, test run, …). Over the years, patterns have emerged around workflow authors, teams and organizations, and little has been shared as to how to approach this.
Add to your calendar 07/16/2021 5:30 PM 07/16/2021 5:55 PM UTC Airflow Summit: Airflow loves Kubernetes

In this talk Jarek and Kaxil will talk about official, community support for running Airflow in the Kubernetes environment.

The full support for Kubernetes deployments was developed by the community for quite a while and in the past users of Airflow had to rely on 3rd-party images and helm-charts to run Airflow on Kubernetes. Over the last year community members made an enormous effort to provide robust, simple and versatile support for those deployments that would respond to all kinds of Airflow users. Starting from official container image, through quick-start docker-compose configuration, culminating in April with release of the official Helm Chart for Airflow.

This talk is aimed for Airflow users who would like to make use of all the effort. The users will learn how to:

  • Extend or customize Airflow Official Docker Image to adapt it to their needs
  • Run quickstart docker-compose environment where they can quickly verify their images
  • Configure and deploy Airflow on Kubernetes using the Official Airflow Helm chart
https://airflowsummit.org/live

Airflow loves Kubernetes

by Jarek Potiuk & Kaxil Naik

In this talk Jarek and Kaxil will talk about official, community support for running Airflow in the Kubernetes environment. The full support for Kubernetes deployments was developed by the community for quite a while and in the past users of Airflow had to rely on 3rd-party images and helm-charts to run Airflow on Kubernetes. Over the last year community members made an enormous effort to provide robust, simple and versatile support for those deployments that would respond to all kinds of Airflow users.
Add to your calendar 07/16/2021 6:00 PM 07/16/2021 6:50 PM UTC Airflow Summit: Customizing Xcom to enhance data sharing …

In Apache Airflow, Xcom is the default mechanism for passing data between tasks in a DAG. In practice, this has been restricted to small data elements, since the Xcom data is persisted in the Airflow metadatabase and is constrained by database and performance limitations.

With the new TaskFlow API introduced in Airflow 2.0, it is seamless to pass data between tasks and the use of Xcom is invisible. However, the ability to pass data is restricted to a relatively small set of data types which can be natively converted in JSON.

This tutorial describes how to go beyond these limitations by developing and deploying a Custom Xcom backend within Airflow to enable the sharing of large and varied data elements such as Pandas data frames between tasks in a data pipeline, using a cloud storage such as Google Storage or Amazon S3.

https://airflowsummit.org/live

Customizing Xcom to enhance data sharing between tasks

by Vikram Koka & Ephraim Anierobi

In Apache Airflow, Xcom is the default mechanism for passing data between tasks in a DAG. In practice, this has been restricted to small data elements, since the Xcom data is persisted in the Airflow metadatabase and is constrained by database and performance limitations. With the new TaskFlow API introduced in Airflow 2.0, it is seamless to pass data between tasks and the use of Xcom is invisible. However, the ability to pass data is restricted to a relatively small set of data types which can be natively converted in JSON.
Add to your calendar 07/16/2021 7:00 PM 07/16/2021 7:25 PM UTC Airflow Summit: MWAA: Design Choices and Road Ahead

An informal and fun chat about the journey that we took and the decisions that we made in building Amazon Managed Workflows for Apache Airflow. We will talk about

  • Our first tryst with understanding Airflow
  • Talking to Amazon Data Engineers and how they ran workflows at scale
  • Key design decisions and the reasons behind them
  • Road ahead, and what we dream about for future of Apache Airflow.
  • Open-Source tenets and commitment from the team

We will leave time at the end for a short AMA/Questions.

https://airflowsummit.org/live

MWAA: Design Choices and Road Ahead

by Subash Canapathy & John Jackson

An informal and fun chat about the journey that we took and the decisions that we made in building Amazon Managed Workflows for Apache Airflow. We will talk about Our first tryst with understanding Airflow Talking to Amazon Data Engineers and how they ran workflows at scale Key design decisions and the reasons behind them Road ahead, and what we dream about for future of Apache Airflow. Open-Source tenets and commitment from the team We will leave time at the end for a short AMA/Questions.