Teaching an old DAG new tricks

Presented at Airflow Summit 2020
By QP Hou

Scribd is migrating its data pipeline from an in house system to Airflow. It’s a one big giant data pipeline consisting of more than 1,500 tasks. In this talk, I would like to share couple best practices on setting up a cloud native Airflow deployment in AWS. For those who are interested in migrating a non-trivial data pipeline to Airflow, I will also share how Scribd plans and executes the migration.

Here are some of the topics that will be covered:

  • How to setup a highly available Airflow cluster in AWS using both ECS and EKS with Terraform.
  • How to manage Airflow DAGs across multiple git repositories.
  • How we manage Airflow variables using a custom Airflow Terraform provider.
  • Best practices on monitoring multiple Airflow clusters with Datadog and Pagerduty.
  • How to Airflow to make it feature parity with Scribd’s in house orchestration system.
  • How to plan and execute non-trivial data pipeline migrations. We transcompiled internal DSL to Airflow DAG to simulate what a real run will look like to surface performance issues early in the process.
  • How we fixed an Airflow performance bottleneck so our giant DAG can be properly rendered in Web UI.

For detailed deep dives on some of topics mentioned above, please check out our blog post series at https://tech.scribd.com/tag/airflow-series/

[Slides] (https://docs.google.com/presentation/d/e/2PACX-1vRb-iH5NX2d7m-rQ7WGc6XlRvRCADwXq2hdjRjRuJ5h7e9ybfoUA13ytxpHgx7JG815fIKEE-QKuRUV/pub?start=false&loop=false&delayms=3000)