Dates & times should show in your local time zone.

2020-07-06T16:00:00.000Z

Host: London

Keynote: Airflow then and now

by Bolke de Bruin Maxime Beauchemin
Bolke and Maxime will tell about past on current time of Airflow.

Airflow at Société Générale : An open source orchestration solution in a banking environment

by Mohammed Marragh Alaeddine Maaoui

This talk will cover a overview of Airflow as well as lessons learned of its implementation in a banking production environment which is Société Générale. It will be the summary of a two-year experience, a storytelling of an adventure within Société Générale in order to offer an internal cloud solution based on Airflow (AirflowaaS).

Scheduler as a service - Apache Airflow at EA Digital Platform

by Nitish Victor Preethi Ganeshan Xiaoqin Zhu

In this talk, we share the lessons learned while building a scheduler-as-a-service leveraging Apache Airflow to achieve improved stability and security for one of the largest gaming companies. The platform integrates with different data sources and meets varied SLA’s across workflows owned by multiple game studios. In particular, we present a comprehensive self-serve airflow architecture with multi-tenancy, auto-dag generation, SSO-integration with improved ease of deployment.

2020-07-07T16:00:00.000Z

Host: NYC

Keynote: How large companies use Airflow for ML and ETL pipelines

by Kevin Yang Dan Davydov Tao Feng
In this talk, colleagues from Airbnb, Twitter and Lyft will share details about how they are using Apache Airflow to power their data pipelines.

Data DAGs with lineage for fun and for profit

by Bolke de Bruin

Let’s be honest about it. Many of us don’t consider data lineage to be cool. But what if lineage would allow you to write less boilerplate and less code, while at the same time make your data scientists, your auditors, your management and well everyone more happy? What if you could write DAGs that mix between tasks based and data based?

Airflow on Kubernetes: Containerizing your workflows

by Michael Hewitt
At Nielsen Digital we have been moving our ETLs to containerized environments managed by Kubernetes. We have successfully transferred some of our ETLs to this environment in production. In order to do this we used the following technologies: Helm to easily deploy Airflow on to Kubernetes; Airflow’s Kubernetes Executor to take full advantage Kubernetes features; and Airflow’s Kubernetes Pod Operator in order to execute our containerized Tasks within our DAGs. To automate a lot of the deployment process we also used Terraform.

2020-07-08T04:00:00.000Z

Host: Bangalore

Data flow with Airflow @ PayPal

by Aishwarya Sankaravadivel
In PayPal we decided to move away from two of our enterprise schedulers, Control-M and UC4, to Airflow. As we started the journey, the first most important step that we wanted to take was to build all the mandatory API’s on the top of Airflow so that we could integrate with our Self-Service Tools. In this talk we will share the challenges that we ran into while building APIs on top of Airflow and how we overcame them.

Democratised data workflows at scale

by Mihail Petkov Emil Todorov
Financial Times is increasing its digital revenue by allowing business people to make data-driven decisions. Providing an Airflow based platform where data engineers, data scientists, BI experts and others can run language agnostic jobs was a huge swing. One of the most successful steps in the platform’s development was building our own execution environment, allowing stakeholders to self deploy jobs without cross team dependencies on top of the unlimited scale of Kubernetes.

Migrating Airflow-based Spark jobs to Kubernetes - the native way

by Roi Teveth Itai Yaffe
At Nielsen Identity Engine, we use Spark to process 10’s of TBs of data. Our ETLs, orchestrated by Airflow, spin-up AWS EMR clusters with thousands of nodes per day. In this talk, we’ll guide you through migrating Spark workloads to Kubernetes with minimal changes to Airflow DAGs, using the open-sourced GCP Spark-on-K8s operator and the native integration we recently contributed to the Airflow project.

2020-07-08T16:00:00.000Z

Host: Bay Area

Keynote: Future of Airflow

by Jarek Potiuk Kaxil Naik Tomasz Urbaszek Ash Berlin-Taylor Kamil Bregula Daniel Imberman
A team of core committers will explain what is coming to Airflow 2.0.

Run Airflow DAGs in a secure way

by Rafal Biegacz
In the contemporary world security is important more than ever - Airflow installations are no exception. Google Cloud Platform and Cloud Composer offer useful security options for running your DAGs and tasks in a way so you effectively can manage a risk of data exfiltration and access to the system is limited. This is a sponsored talk, presented by Google Cloud.

Airflow as the next gen of workflow system at Pinterest

by Yulei Li Dinghang Yu Ace Haidrey
At Pinterest, our current workflow system, called pinball, has served the data pipeline orchestration demands well for years. However, with the rapid increasing execution demand the system started to expose scalability and performance issues. Therefore we decided to look for a new solution to better address the issues and serve the workflow scheduling demand, and we chose Airflow as our next generation of workflow. In this talk we want to discuss how we made the decision to on board to airflow, and beyond the out-of-box features and experience what improvements we made to better support the business need at Pinterest.

2020-07-09T16:00:00.000Z

Host: Seattle

Keynote: Making Airflow a sustainable project through D&I

by Aizhamal Nurmamat kyzy Griselda Cuevas
In this talk, the VP of D&I at the ASF - Gris Cuevas, will share some statistics about the state of D&I at the foundation and also the initiative the foundation is taking to make projects more diverse and inclusive. Then, the member of the PMC for Apache Airflow - Aizhamal Nurmamat kyzy, will share her own journey on becoming an open source contributor, and deep dive into project specific initiatives that help Apache Airflow to be one of the most sustainable projects in open source.

Improving Airflow's user experience

by Ry Walker Maxime Beauchemin Viraj Parekh
We will walk you through some current UX challenges, an overview of how the Astronomer platform addresses the major challenges, and also provide sneak peek of the things that we’re working on in the coming months to improve Airflow’s user experience.

Airflow CI/CD: Github to Cloud Composer (safely)

by Jacob Ferriero
Deploying bad DAGs to your airflow environment can wreak havoc. This talk provides an opinionated take on a mono repo structure for GCP data pipelines leveraging BigQuery, Dataflow and a series of CI tests for validating your Airflow DAGs before deploying them to Cloud Composer. Composer makes deploying airflow infrastructure easy and deploying DAGs “just dropping files in a GCS bucket”. However, this opens the opportunity for many organizations to shoot themselves in the foot by not following a strong CI/CD process.

Workshop: Getting started with Apache Airflow

by Fokko Driesprong
Learn how to create your first DAG with an Airflow instance on the cloud via Cloud Composer

2020-07-10T04:00:00.000Z

Host: Bangalore

Advanced Apache Superset for Data Engineers

by Maxime Beauchemin
Superset is the leading open source data exploration and visualization platform. In this talk, we’ll be presenting Superset with a focus on advanced topics that are most relevant to Data Engineers.

Teaching an old DAG new tricks

by QP Hou
In this talk, I would like to share couple best practices on setting up a cloud native Airflow deployment in AWS. For those who are interested in migrating a non-trivial data pipeline to Airflow, I will also share how Scribd plans and executes the migration.

Ask me anything with Airflow members

We will host an ‘ask me anything’ with a group of Airflow committers & PMC members.

2020-07-10T16:00:00.000Z

Host: Warsaw

Demo: Reducing the lines, a visual DAG editor

by Traey Hatch
In this talk I will introduce a DAG authoring and editing tool for Airflow that we have built. Installed as a plugin, this tool allows users to author DAGs compose existing operators and hooks with virtually no Python experience. We will walk you through a live demo of DAG authorship and deployment, and spend time reviewing the underlying open-source standards used and the general approach that was taken to develop the code.

AIP-31: Airflow functional DAG definition

by Gerard Casas Saez
Airflow does not currently have an explicit way to declare messages passed between tasks in a DAG. XCom are available but are hidden in execution functions inside the operator. AIP-31 proposes a way to make this message passing explicit in the DAG file and make it easier to reason about your DAG behaviour. In this talk, we will explore what other DSL are doing for message passing and how has that influenced AIP-31.

Using Airflow to speed up development of data intensive tools

by Blaine Elliot
In this talk we will review how Airflow helped create a tool to detect data anomalies. Leveraging Airlfow for process management, database interoperability, and authentication created an easy path forward to achieve scale, decrease the development time and pass security audits. While Airflow is generally looked at as a solution to manage data pipelines, integrating tools with Airflow can also speed up development of those tools. The Data Anomaly Detector was created at One Medical to scan thousands of metrics per day for data anomalies.

Workshop: Best Practices for running Airflow on Kubernetes

by Daniel Imberman Greg Neiheisel
Learn all of the ins-and-outs of running Airflow on Kubernetes.

2020-07-13T16:00:00.000Z

Host: London

Autonomous driving with Airflow

by Amr Noureldin Michal Dura
This talk describes how Airflow is utilized in an Autonomous driving project, originating from Munich - Germany. We describe the Airflow setup, what challenges we encountered and how we maneuvered to achieve a distributed and highly scalable Airflow setup. One of the biggest automotive manufacturers elected to go for Airflow as an orchestration tool, in the pursuit of producing their first Level-3 autonomous driving vehicle in Germany. In this talk, we will describe the journey of deploying Airflow on top of OpenShift using a PostgreSQL database + RabbitMQ.

From cron to Airflow on Kubernetes: A startup story

by Adam Boscarino
Learn how Devoted Health went from cron jobs to to Airflow deployment Kubernetes using a combination of open source and internal tooling. Devoted Health, a Medicare Advantage startup, went from cron jobs to Airflow on Kubernetes in a short period of time. This journey is a common one, but still has a steep learning curve for new Airflow users. This talk will give you a blueprint to follow by covering the tools we use, best practices, and lessons learned.

Airflow in Airbnb

by Kevin Yang Ping Zhang Yingbo Wang Cong Zhu Conor Camp
Go over the yesterday, today and tomorrow for Airflow in Airbnb. Share our learnings and vision in Airflow core and around Airflow in its eco system. Starting with the history of Airflow in Airbnb, briefly describe how Airflow is used and the high level overview of Airflow in Airbnb. Then going into out current setup of Airflow, short term plans, learnings and best practises of Airflow. And finally talk about our the roadmap and vision of Airflow in Airbnb.

2020-07-14T04:00:00.000Z

Host: Tokyo

Pipelines on pipelines: Agile CI/CD workflows for Airflow DAGs

by Victor Shafran
How do you create fast and painless delivery of new DAGs into production? When running Airflow at scale, it becomes a big challenge to manage the full lifecycle around your pipelines; making sure that DAGs are easy to develop, test, and ship into prod. In this talk, we will cover our suggested approach to building a proper CI/CD cycle that ensures the quality and fast delivery of production pipelines. CI/CD is the practice of delivering software from dev to prod, optimized for fast iteration and quality control.

Production Docker image for Apache Airflow

by Jarek Potiuk
This talk will guide you trough internals of the official Production Docker Image of Airflow. It will show you the foreseen use cases for it and how to use it in conjunction with the Official Helm Chart to make your own deployments.

Airflow as an elastic ETL tool

by Heindrik Kleine Vicente Ruben del Pino Ruiz
In search of a better, modern, simplistic method of managing ETL’s processes and merging them with various AI and ML tasks, we landed on Airflow. We envisioned a new user friendly interface that can leverage dynamic DAG’s and reusable components to build an ETL tool that requires virtually no training. We built several template DAG’s and connectors for Airflow to typical data sources, like SQL Server. Then proceeded to build a modern interface on top that brings ETL build, scheduling and execution capabilities.

2020-07-14T16:00:00.000Z

Host: NYC

How do we reason about the reliability of our data pipeline in Wrike

by Alexander Eliseev
In this talk we will share some of the lessons we have learned after using Airflow for a couple of years and growing from 2 users to 8 teams. We will cover: Establishing a reliable review process on AirFlow Managing multiple Airflow configurations Data versioning

Achieving Airflow observability with Databand

by Josh Benamram
Most teams use Airflow in combination with other tools like Spark, Snowflake, and BigQuery. Join this session to learn how Databand’s observability system makes it easy to monitor your end-to-end pipeline health and quickly remediate issues.

From S3 to BigQuery - How a first-time Airflow user successfully implemented a data pipeline

by Leah Cole
BigQuery is GCP’s serverless, highly scalable and cost-effective cloud data warehouse that can analyze petabytes of data at super fast speeds. Amazon S3 is one of the oldest and most popular cloud storage offerings. Folks with data in S3 often want to use BigQuery to gain insights into their data. Using Apache Airflow, they can build pipelines to seamlessly orchestrate that connection. In this talk, Emily and Leah will walk through how they created an easily configurable pipeline to extract data.

2020-07-15T16:00:00.000Z

Host: Amsterdam

Building reuseable and trustworthy ELT pipelines (A templated approach)

by Nehil Jain
To improve automation of data pipelines, I propose a universal approach to ELT pipeline that optimizes for data integrity, extensibility, and speed to delivery. The workflow is built using open source tools and standards like Apache Airflow, Singer, Great Expectations, and DBT. Templating ETLs is challenging! The creation and maintenance of data pipelines in production require hard work to manage bugs in code and bad data. I like to propose a data pipeline pattern that can simplify building pipelines while optimizing for data integrity and observability.

Testing Airflow workflows - ensuring your DAGs work before going into production

by Bas Harenslak
How do you ensure your workflows work before deploying to production? In this talk I’ll go over various ways to assure your code works as intended - both on a task and a DAG level. In this talk I will cover: How to test and debug tasks locally How to test with and without task instance context How to test against external systems, e.g. how to test a PostgresOperator? How to test the integration of multiple tasks to ensure they work nicely together

Adding an executor to Airflow: A contributor overflow exception

by Vanessa Sochat
Engaging with a new community is a common experience in OSS development. There are usually expectations held by the project about the contributor’s exposure to the community, and by the contributor about interactions with the community. When these expectations are misaligned, the process is strained. In this talk, I’ll discuss a real life experience that required communication, persistence, and patience to ultimately lead to a positive outcome.

Workshop: Contributing to Apache Airflow

by Jarek Potiuk Tomasz Urbaszek
Learn how to become a code contributor to the Apache Airflow project.

2020-07-16T04:00:00.000Z

Host: Melbourne

Migration to Airflow backport providers

by Anita Fronczak
In this talk I will showcase how to use the newly released Airflow Backport Providers. Some of the topics we will cover are: How to install them in Airflow 1.10.x How to install them in Composer How to migrate one or more DAG from using legacy to new providers. Known bugs and fixes.

From Zero to Airflow: bootstrapping a ML platform

by Noam Elfanbaum
At Bluevine we use Airflow to drive our ML platform. In this talk, I’ll present the challenges and gains we had at transitioning from a single server running Python scripts with cron to a full blown Airflow setup. This includes: supporting multiple Python versions, event driven DAGs, performance issues and more! Some of the points that I’ll cover are: Supporting multiple Python versions Event driven DAGs Airflow Performance issues and how we circumvented them Building Airflow plugins to enhance observability Monitoring Airflow using Grafana CI for Airflow DAGs (super useful!

Airflow the perfect match in our analytics pipeline

by Sergio Fandino
For three years we at LOVOO, a market-leading dating app, have been using the Google Cloud managed version of Airflow, a product we’ve been familiar with since its Alpha release. We took a calculated risk and integrated the Alpha into our product, and, luckily, it was a match. Since then, we have been leveraging this software to build out not only our data pipeline, but also boost the way we do analytics and BI.

2020-07-16T16:00:00.000Z

Host: Bay Area

Data engineering hierarchy of needs

by Angel Daz
Data Infrastructures look differently between small, mid, and large sized companies. Yet, most content out there is for large and sophisticated systems. And almost none of it is on migrating a legacy, on-prem, databases over to the cloud. In order to better explain the evolving needs of data engineering organizations, we will review the hierarchy of needs illustrated below.

What open source taught us about business

by Karolina Rosol Maciej Oczko
We will share our journey from a mobile app studio to an OSS oriented partner, as well as the challenges and practical insights into managing open source projects in our company.

Effective Cross-DAG dependency

by Rafael Ribaldo Lucas Mendes Mota da Fonseca
Cross-DAG dependency may reduce cohesion in data pipelines and, without having an explicit solution in Airflow or in a third-party plugin, those pipelines tend to become complex to handle. That is the reason we, at QuintoAndar, have created an intermediate DAG to handle relationships across data pipelines called Mediator, in order for them to be scalable and maintainable by any team. At QuintoAndar we seek automation and modularization in our data pipelines and believe that breaking them into many responsibility modules (DAGs) enhances maintainability, reusability and understanding to move data from one point to another.

2020-07-17T16:00:00.000Z

Host: Warsaw

Airflow: A beast character in the gaming world

by Naresh Yegireddi Patricio Garza
Being a pioneer for the past 25 years, SONY PlayStation has played a vital role in the Interactive Gaming Industry. Over 100+ million monthly active users, 100+ million PS-4 console sales along with thousands of game development partners across the globe, big-data problem is quite inevitable. This presentation talks about how we scaled Airflow horizontally which has helped us building a stable, scalable and optimal data processing infrastructure powered by Apache Spark, AWS ECS, EC2 and Docker.

Machine Learning with Apache Airflow

by Daniel Imberman
This talk will discuss how to build an Airflow based data platform that can take advantage of popular ML tools (Jupyter, Tensorflow, Spark) while creating an easy-to-manage/monitor As the field of data science grows in popularity, companies find themselves in need of a single common language that can connect their data science teams and data infrastructure teams. Data scientists want rapid iteration, infrastructure engineers want monitoring and security controls, and product owners want their solutions deployed in time for quarterly reports.

Achieving Airflow Observability

by Evgeny Shulman
Identify issues in a fraction of the time and streamline root cause analysis for your DAGs. Airflow is the leading orchestration platform for data engineers. But when running Airflow at production scale, many teams have bigger needs for monitoring jobs, creating the right level of alerting, tracking problems in data, and finding the root cause of errors. In this talk we will cover our suggested approach to gaining Airflow observability so that you have the visibility you need to be productive.

Platinum

Gold

Silver

Community