From not knowing Python (let alone Airflow), and from submitting the first PR that fixes typo to becoming Airflow Committer, PMC Member, Release Manager, and #1 Committer this year, this talk walks through Kaxil’s journey in the Airflow World.
The second part of this talk explains:
by Kaxil Naik
Learn to contribute to the Apache Airflow ecosystem both with and without code. Post an article to the Airflow blog, improve documentation, or dive head-first into into Airflow’s free and open source software community.https://airflowsummit.org/live
You might have heard some recent news about ransomware attacks for many companies. Quite recently the U. S. Department of Justice has elevated the priority of investigations of ransomware attacks to the same level as terrorism. Certainly security aspects of running software and so called “supply-chain attacks” have made a press recently.
Also, you might have read recently about security researcher who made USD 13,000 via bounties by finding and contacting companies that had old, un-patched versions of Airflow - even if the ASF security process was great and PMC of Airflow has fixed those long time ago.
If any of this rings a bell, then this session is for you.
In this session Dolev (security expert and researchers who submitted security issues recently to Airflow), Ash and Jarek (Airflow PMC members) will discuss the state of security and best practices for keeping your Airflow secure and why it is important. The discussion will be moderated by Tomasz Urbaszek, Airflow PMC member.
You can get a glimpse of what they will talk about through this blog post.https://airflowsummit.org/live
by Tomasz Urbaszek, Ash Berlin-Taylor, Jarek Potiuk & Dolev Farhi
Rachael, a new Airflow contributor, and Leah, an experienced Airflow contributor, share the story of Rachael’s first contribution, highlighting the importance of contributions from new users and the positive impact that non-code contributions have in an open source community.https://airflowsummit.org/live
by Rachael Deacon-Smith & Leah Cole
A discussion with Jay Sen, Data Platform Architect at Paypal, and Ry Walker, Founder/CTO of Astronomer about the central role Airflow plays within Paypal’s data platform, and the opportunity to build stronger integrations between Airflow and other tools that surround it.https://airflowsummit.org/live
by Ry Walker & Jay Sen
Running a platform where different business units at Apple can run their workloads in isolation and share operators.https://airflowsummit.org/live
by Roberto Santamaria & Howie Wang
Digital transformation, application modernization, and data platform migration to the cloud are key initiatives in most enterprises today. These initiatives are stressing the scheduling and automation tools in these enterprises to the point that many users are looking for better solutions. A survey revealed that 88% of users believe that their business will benefit from an improved automation strategy across technology and business. Airflow has an excellent opportunity to capture mindshare and emerge as the leading solution here. At Unravel, we are seeing the trend where many of our enterprise customers are at various stages of migrating to Airflow from their enterprise schedulers or ETL/ELT orchestration tools like Autosys, Informatica, Oozie, Pentaho, and Tidal.
In this talk, we will share lessons learned and best practices found in the entire pipeline migration life-cycle which includes:
(i) The evaluation process which led to picking Airflow, including certain aspects where Airflow can do better
(ii) The challenges in discovering and understanding all components and dependencies that need to be considered in the migration
(iii) The challenges arising during the pipeline code and data migration, especially, in getting a single-pane-of-glass and apples-to-apples views to track the progress of the migration
(iv) The challenges in ensuring that the pipelines that have been migrated to Airflow are able to perform and scale on par or better compared to what existed previouslyhttps://airflowsummit.org/live
by Shivnath Babu & Hari Nair
This talk will cover the adoption journey (Technical Challenges & Team Organization) of Apache Airflow (1.8 to 2.0) at Societe Generale.
Time line of events:
by Ahmed Chakir Alaoui & Alaeddine Maaoui
At QuintoAndar we seek automation and scalability in our data pipelines and believe that Airflow is the right tool for giving us exactly what we need. However, having all concerns mapped and tooling defined doesn’t necessarily mean success.
For months we had struggled with a misconception that Airflow should act as an orchestrator and executor within a monolithic strategy. That could not be further from the truth because of the rise of scalability and performance issues, infrastructure and maintainability costs, and multi-directional impact throughout development teams.
Employing Airflow, though, as an orchestration-only solution may help teams deliver value to end users in a more efficient, reliable and performant manner, where data pipelines can be executed anywhere with proper resources and optimizations.
Those are the reasons we have shifted from an orchestrate-execute strategy to an orchestrate-only one, in order to leverage the full power of data pipeline management in Airflow. Straightaway the separation of data processing and pipeline coordination brought not only a finer resource tuning and better maintainability, but also a tremendous scalability on both ends.https://airflowsummit.org/live
by Rafael Ribaldo & Lucas Fonseca
We are witnessing a rapid growth in the number of mission-critical data pipelines that leaders of data products are responsible for. “Are your data pipelines healthy?” This question was posed to more than 200 leaders of data products from various industries. The answers ranged from “unfortunately, no” to “they are mostly fine, but I am always afraid that something or the other will cause a pipeline to break”.
This talk presents the concept of Pipeline HealthCheck (PHC) which enables leaders of data products to have high confidence in the correctness, performance, and cost efficiency of their data pipelines. More importantly, PHC enables leaders of data products as well as their development and operations teams to have high confidence in their ability to quickly detect, troubleshoot, and fix problems that make data pipelines unhealthy. The talk also includes a demo of how PHC helps handle common problems in data pipelines like incorrect results, missing SLAs, and overshooting cost budgets.https://airflowsummit.org/live
by Shivnath Babu
Participation in this workshop requires previous registration and has limited capacity. Get your ticket at https://ti.to/airflowsummit/2021-contributor
By attending this workshop, you will learn how you can become a contributor to the Apache Airflow project. You will learn how to setup a development environment, how to pick your first issue, how to communicate effectively within the community and how to make your first PR - experienced committers of Apache Airflow project will give you step-by-step instructions and will guide you in the process. When you finish the workshop you will be equipped with everything that is needed to make further contributions to the Apache Airflow project.
In preparation for the class, please make sure you have set up the following prerequisites:
./breeze --python 3.6
Apache Airflow is known to be a great orchestration tool that enables use cases that would not be possible otherwise. One of the great features that Airflow has is the possibility to “glue” together totally separate services to establish bigger functionalities. In this talk you will learn about various Airflow usages that let Airflow users to automate their critical company processes and even establish businesses. The examples provided will be based on Airflow used in the context of Cloud Composer which is a managed service to provision and manage Airflow instances.https://airflowsummit.org/live
by Rafal Biegacz & Filip Knapik
Last year, we were able to share why we have selected Airflow to be our next generation workflow system. This year, we will dive into the journey of migrating over 3000+ workflows and 45000+ tasks to Airflow. We will discuss the infrastructure additions to support such loads, the partitioning and prioritization of different workflow tiers defined in house, the migration tooling we built to get users to onboard, the translation layers between our old DSLs and the new, our internal k8s executor to leverage Pinterest’s kubernetes fleet, and more. We want to share the challenges both technically and usability wise to get such large migrations over the course of a year, and how we overcame it to successfully migrate 100% of the workflows to our inhouse workflow platform branded Spinner.https://airflowsummit.org/live
by Ace Haidrey, Euccas Chen, Yulei Li, Dinghang Yu & Ashim Shrestha
Airflow has a lot of moving parts, and it can be a little overwhelming as a new user - as I was not too long ago.
Join me as we go though Airflow’s architecture at a high level, explore how DAGs work and run, and look at some of the good, the bad, and the unexpected things lurking inside.https://airflowsummit.org/live
by Andrew Godwin
We will describe how we were able to build a system in Airflow for MySQL to Redshift ETL pipelines defined in pure Python using dataclasses. These dataclasses are then used to dynamically generate DAGs depending on pipeline type. This setup allows us to implement robust testing, validation, alerts, and documentation for our pipelines. We will also describe the performance improvements we achieved by upgrading to Airflow 2.0.https://airflowsummit.org/live
by Madison Swain-Bowden
This presentation will detail how Elyra creates Jupyter Notebook, Python and R script- based pipelines without having to leave your web browser. The goal of using Elyra is to help construct data pipelines by surfacing concepts and patterns common in pipeline construction into a familiar, easy to navigate interface for Data Scientists and Engineers so they can create pipelines on their own. In Elyra’s Pipeline Editor UI, portions of Apache Airflow’s domain language are surfaced to the user and either made transparent or understandable through the use of tooltips or helpful notes in the proper context during pipeline construction. With these features, Elyra can rapidly prototype data workflows without the need to know or write any pipeline code. Lastly, we will look at what features we have planned on our roadmap for Airflow, including more robust Kubernetes integration and support for runtime specific components/operators. Project Home: https://github.com/elyra-ai/elyrahttps://airflowsummit.org/live
by Alan Chin
As the Apache Airflow project grows, we seek both ways to incorporate rising technologies and novel ways to expose them to our users. Ray is one of the fastest-growing distributed computation systems on the market today. In this talk, we will introduce the Ray decorator and Ray backend. These features, built with the help of the Ray maintainers at Anyscale, will allow Data Scientists to natively integrate their distributed pandas, XGBoost, and TensorFlow jobs to their airflow pipelines with a single decorator. By merging the orchestration of Airflow and the distributed computation of Ray, this coordination of technologies opens Airflow users to a whole host of new possibilities when designing their pipelines.https://airflowsummit.org/live
by Daniel Imberman
Airflow scheduler uses DAG definitions to monitor the state of tasks in the metadata database, and triggers the task instances whose dependencies have been met. It is based on state of dependencies scheduling. The idea of event based scheduling is to let the operators send events to the scheduler to trigger a scheduling action, such as starting jobs, stopping jobs and restarting jobs. Event based scheduling allows potential support for richer scheduling semantics such as periodic execution and manual trigger at per operator granularity.https://airflowsummit.org/live
by Wuchao Chen
Cloudflare’s network keeps growing, and that growth doesn’t just come from building new data centers in new cities. We’re also upgrading the capacity of existing data centers by adding newer generations of servers — a process that makes our network safer, faster, and more reliable for our users.
In this talk, I’ll share how we’re leveraging Apache Airflow to build our own Provision-as-a-Service (PraaS) platform and cut by 90% the amount of time our team spent on mundane operational tasks.https://airflowsummit.org/live
by Jet Mariscal
In this talk, we present Viewflow, an open-source Airflow-based framework that allows data scientists to create materialized views in SQL, R, and Python without writing Airflow code.
We will start by explaining what problem does Viewflow solve: writing and maintaining complex Airflow code instead of focusing on data science. Then we will see how Viewflow solves that problem. We will continue by showing how to use VIewflow with several real-world examples. Finally, we will see what the upcoming features of Viewflow are!
by Gaëtan Podevijn
This talk aims to share how Airflow’s secrets backend works, and how users can create their custom secret backends for their specific use cases & technology stack.https://airflowsummit.org/live
by Xiaodong DENG
Considering that the role of Analytics Engineering has emerged in the last few years within data and analytics teams, it is important for me to highlight what role an Analytics engineer has and how the Dos and Don’ts from my perspective can contribute to a team and boost their day-to-day work with the help of Airflow.https://airflowsummit.org/live
by Sergio Camilo Fandiño Hernández
As part of my role at Google, maintaining samples for Cloud Composer, hosted managed Airflow, is crucial. It’s not feasible for me to try out every sample every day to check that it’s working. So, how do I do it?
Automation! While I won’t let the robots touch everything, they let me know when it’s time to pay attention. Here’s how:
Step 0: An update for the operators is released Step 1: A GitHub bot called Renovate Bot opens up a PR to a special requirements file to make this update Step 2: Cloud build runs unit tests to make sure none of my DAGs immediately break Step 3: PR is approved and merged to main Step 4: Cloud Build updates my dev environment Step 5: I look at my DAGs in dev to make sure all is well. If there is a problem, I need to resolve it manually and revert my requirements file. Step 6: I manually update my prod PyPI packages
I’ll discuss what automation tools I choose to use and why, and the places where I intentionally leave manual steps to ensure proper oversight.https://airflowsummit.org/live
by Leah Cole
Reproducibility is the fundamental principle of a scientific research. This also applies to the computational workflows that are used to process research data. Common Workflow Language (CWL) is a highly formalized way to describe pipelines that was developed to achieve reproducibility and portability of computational analysis. However, there were only few workflow execution platforms that could run CWL pipelines. Here, we present CWL-Airflow – an extension for Airflow to execute CWL pipelines. CWL-Airflow serves as a processing engine for Scientific Data Analysis Platform (SciDAP) – a data analysis platform that makes complex computational workflows both user-friendly and reproducible. In our presentation we are going to explain why we see Airflow as the perfect backend for running scientific workflows, what problems we encountered in extending Airflow to run CWL pipelines and how we solved them. We will also discuss what are the pros and cons of limiting our platform to CWL pipelines and potential applications of CWL-Airflow outside the realm of biology.https://airflowsummit.org/live
by Nick Luckey & Michael Kotliar
Astronomer founders Ry Walker and Greg Neiheisel will preview the upcoming next-gen Astronomer Cloud product offering.https://airflowsummit.org/live
Wise (previously TransferWise) is a London-based fin-tech company. We build a better way of sending money internationally.
At Wise we make great use of Airflow. More than 100 data scientists, analysts and engineers use Airflow every day to generate reports, prepare data, (re)train machine learning models and monitor services.
My name is Alexandra, I’m a Machine Learning Engineer at Wise. Our team is responsible for building and maintaining Wise’s Airflow instances. In this presentation I would like to talk about three main things, our current setup, our challenges and our future plans with Airflow. We are currently transitioning from a single centralised Airflow instance into many segregated instances to increase reliability and limit access. We’ve learned a lot throughout this journey and looking to share these learnings with a wider audience.https://airflowsummit.org/live
by Alexandra Abbas
At Fivetran, we are seeing many organizations adopt the Modern Data Stack to suit the breadth of their data needs. However, as incoming data sources begin to scale, it can be hard to manage and maintain the environment, with more time spent repairing and reengineering old data pipelines than building new ones. This talk will introduce a number of new Airflow Providers, including the airflow-provider-fivetran, and discuss some of the benefits and considerations we are seeing data engineers, data analysts, and data scientist experience in doing so.https://airflowsummit.org/live
by Nick Acosta
Airflow 2.0 was a big milestone for the Airflow community. However, companies and enterprises are still facing difficulties in upgrading to 2.0.
In this talk, I would like to focus and highlight the ideal upgrade path and talk about
by Kaxil Naik
We’ve all heard the phrase “data is the new oil.” But really imagine a world where this analogy is more real, where problems in the flow of data - delays, low quality, high volatility - could bring down whole economies? When data is the new oil with people and businesses similarly reliant on it, how do you avoid the fires, spills, and crises?
As data products become central to companies’ bottom line, data engineering teams need to create higher standards for the availability, completeness, and fidelity of their data.
In this session we’ll demonstrate how Databand helps organizations guarantee the health of their Airflow pipelines. Databand is a data pipeline observability system that monitors SLAs and data quality issues, and proactively alerts users on problems to avoid data downtime.
The session will be led by Josh Benamram, CEO and Cofounder of Databand.ai. Josh will be joined by Vinoo Ganesh, an experienced software engineer, system architect, and current CTO of Veraset, a data-as-a-service startup focused on understanding the world from a geospatial perspective.
Join to see how Databand.ai can help you create stable, reliable pipelines that your business can depend on!https://airflowsummit.org/live
by Josh Benamram & Vinoo Ganesh
The scheduler is the core of Airflow, and it’s a complex beast.
In this session we will go through the scheduler in some detail; how it works; what the communication paths are and what processing is done where.https://airflowsummit.org/live
In this talk we’ll see some real world examples from Firebolt customers demonstrating how Airflow is used to orchestrate operational data analytics applications with large data volumes, while keeping query latency low.https://airflowsummit.org/live
by Boaz Farkash
Engineering teams leverage the factory coding pattern to write easy-to-read and repeatable code. In this talk, we’ll outline how data engineering teams can do the same with Airflow by separating DAG declarations from business logic, abstracting task declarations from task dependencies, and creating a code architecture that is simple to understand for new team members. This approach will set analytics teams up for success as team and Airflow DAG sizes grow exponentially.https://airflowsummit.org/live
by Sarah Krasnik
Data quality has become a much discussed topic in the fields of data engineering and data science, and it has become clear that data validation is absolutely crucial to ensuring the reliability of any data products and insights produced by an organization’s data pipelines. This session will outline patterns for combining three popular open source tools in the data ecosystem - dbt, Airflow, and Great Expectations - and use them to build a robust data pipeline with data validation at each critical step.https://airflowsummit.org/live
by Sam Bail
EA Games have very dynamic and federated needs on their data processing pipelines. Many individual studios within EA build and manage the data pipelines for their games iterating rapidly through game development cycles. Developer productivity around orchestrating these pipelines is as critical as providing a robust production quality orchestration service. With these in mind, we re-engineered our Airflow service ground up to cater to our large internal user base (1000s) and internet scale data processing systems (Petabytes of data). This session details the evolution of the use of Airflow at EA Digital Platform from a monolithic multi-tenant instance to an “On-Demand” system where teams and studios create their own dedicated Airflow instance with all the necessary bells-and-whistles required at the click of a button - and allows them to immediately get their data pipelines running. We also elaborate how Airflow is interwoven into a “Self Serve” model for ETL pipelines within our teams with the objective of truely democratizing data across our games.https://airflowsummit.org/live
by Nitish Victor & Yuanmeng Zeng
Using Airflow as our scheduling framework, we ETL data generated by tens of millions of transactions every day to build the backbone for our reports, dashboards, and training data for our machine learning models. There are over 500 (and growing) such ingested and aggregated tables owned by multiple teams that contain intricate dependencies between one another. Given this level of complexity, it can become extremely cumbersome to coordinate backfills for any given table, when also taking into account all its downstream dependencies, aggregation intervals, and data availability. This talk will focus on how we customized and extended Airflow at Adyen to streamline our backfilling operations. This allows us to prevent mistakes and enable our product teams to keep launching fast and iterating.https://airflowsummit.org/live
by Ravi Autar
In this talk, I’ll describe how you can leverage 3 open-source standards - workflow management with Airflow, EL with Airbyte, transformation with DBT - to build your next modern data stack. I’ll explain how to configure your Airflow DAG to trigger Airbyte’s data replication jobs and DBT’s transformation one with a concrete use case.https://airflowsummit.org/live
by Michel Tricot
At Near we work on TBs of Location data with close to real time modelling to generate key consumer insights and estimates for our clients across the globe. We have hundreds of country specific models deployed and managed through airflow to achieve this goal. Some of the workflows that we have deployed our schedule based, some are dynamic and some are trigger based. In this session I would be discussing some of the workflows that are being scheduled and monitored using airflow and the key benefits and also the challenges that we have faced in our production systems.https://airflowsummit.org/live
by Manmeet Kaur
As a follow up for https://airflowsummit.org/sessions/teaching-old-dag-new-tricks/, in this talk, we would like to share a happy ending story on how Scribd fully migrated its data platform to the cloud and Airflow 2.0.
We will talk about data validation tools and task trigger customizations the team built to smooth out the transition. We will share how we completed the Airflow 2.0 migration started with an unsupported MySQL version and metrics to prove why everyone should perform the upgrade. Lastly, we will discuss how large scale backfills (10 years worth of run) are managed and automated at Scribd.https://airflowsummit.org/live
by QP Hou, Kuntal Basu, Stas Bytsko & Dmitry Suvorov
Or how to keep our traditional java application up-to-date on everything big data.
At Adyen we process tens of millions of transactions a day, a number that rises every day. This means that generating reports, training machine learning models or any other operation that requires a bird’s eye view on weeks or months of data requires the use of Big Data technologies.
We recently migrated to Airflow for scheduling all batch operations on our on-premise Big Data cluster. Some of these operations require input from our merchants or our support team. Merchants can for instance subscribe to reports, choose their preferred time zone, and even specify which columns they want included. After generating the reports, these reports then need to become available in our customer portal.
So how do we keep track in our Customer Area which reports have been generated in Airflow? How do we launch ad-hoc backfills when one of our merchants subscribes to a new report? How do we integrate all of this into our existing monitoring pipeline?
This talk will focus on how we have successfully integrated our big data platform with our existing Java web applications and how Airflow (with some simple add-ons) played a crucial role in achieving this.https://airflowsummit.org/live
by Jelle Munk
Multi-tenant Airflow instances can help save costs for an organization. This talk will walk through how we dynamically assigned roles to users based on groups in Active Directory so that teams would have access to DAGs they created in the UI on our multi-tenant Airflow instance. We created our own custom AirflowSecurityManager class in order to achieve this that ultimately ties LDAP and RBAC together.https://airflowsummit.org/live
by Mark Merling & Sean Lewis
In this session we will discuss Amazon Managed Workflows for Apache Airflow (MWAA), how Apache Airflow (and specifically version 2.0) is implemented in the service, best practices for deployment and operations, and the Amazon MWAA team’s commitment to open source usage and contributions.https://airflowsummit.org/live
by John Jackson & Sam Dengler
Apache Airflow aims to speed the development of workflows, but developers are always ready to add bugs here and there.
This talk illustrates a few pitfalls faced while developing workflows at the BBC to build machine learning models. The objective is to share some lessons learned and, hopefully, save others time.
Some of the topics covered, with code examples:
by Tatiana Al-Chueyr Martins
Learn how to use Airflow’s robust ecosystem of providers to construct secure, high-quality DAGs.https://airflowsummit.org/live
by Plinio Guzman
At Snowflake you can imagine we do a lot of data pipelines and tables curating metrics metrics for all parts of the business. These are the lifeline of Snowflake’s business decisions. We also have a lot of source systems that display and make these metrics accessible to end users. So what happens when your data model does not match your system? For example your bookings numbers in salesforce do not match your data model that curates bookings metrics. At snowflake we continued to run into this problem over and over again.
Having this problem we set out to build an infrastructure that would allow users to effortlessly sync the results of their data pipelines with any downstream / upstream system. Allowing us to have a central source of truth in our warehouse. This infrastructure was built on snowflake using airflow and allows a user to begin syncing data with a few details such as model and system to update.
In this presentation we will show you how using airflow and snowflake we are able to use our data pipelines as the source of truth for all systems involved in the business. With this infrastructure we are able to use snowflake models as a central source of truth for all applications used throughout the company. This ensures that any number synced in this way seen by two users is always the same.https://airflowsummit.org/live
by Russell Dervay
Machine Learning models can add value and insight to many projects, but they can be challenging to put into production due to problems like lack of reproducibility, difficulty maintaining integrations, and sneaky data quality issues. Kedro, a framework for creating reproducible, maintainable, and modular data science code, and Great Expectations, a framework for data validations, are two great open-source Python tools that can address some of these problems. Both integrate seamlessly with Airflow for flexible and powerful ML pipeline orchestration. In this talk we’ll discuss how you can leverage existing Airflow provider packages to integrate these tools to create sustainable, production-ready ML models.https://airflowsummit.org/live
by Kenten Danas
If you manage a lot of data, and you’re attending this summit, you likely rely on Apache Airflow to do a lot of the heavy lifting. Like any powerful tool, Apache Airflow allows you to accomplish what you couldn’t before… but also creates new challenges. As DAGs pile up, complexity layers on top of complexity and it becomes hard to grasp how a failed or delayed DAG will affect everything downstream. In this session we will provide a crash course on OpenLineage, an open platform for metadata management and data lineage analysis. We’ll show how capturing metadata with OpenLineage can help you maintain inter-DAG dependencies, capture data on historical runs, and minimize data quality issues.https://airflowsummit.org/live
by Julien Le Dem & Willy Lulciuc
In recent years, the bioinformatics world has seen an explosion in genomic analysis as gene sequencing technologies have become exponentially cheaper. Tests that previously would have cost tens of thousands of dollars will soon run at pennies per sequence. This glut of data has exposed a notable bottleneck in the current suite of technologies available to bioinformaticians. At Drift Biotechnologies, we use Apache Airflow to transition traditionally on-premise large scale data and deep learning workflows for bioinformatics to the cloud, with an emphasis on workflows and data from next generation sequencing technologies.https://airflowsummit.org/live
by Eli Scheele
The two most common user questions at Pinterest are: 1) why is my workflow running so long? 2) why did my workflow fail - is it my issue, or a platform issue? As with any big data organization, the workflow platform is just the orchestrator but the “real” work is done on another layer, managed by another platform. There can be plenty of these, and the challenges of figuring out the root cause of an issue can be mundane and time consuming. At Pinterest, we set out to provide additional tooling in our Airflow webserver to make it a quicker inspection process and provide smart tips such as increased runtime analysis, bottleneck identifying, rca, and an easy way for backfilling. We explore deeper the tooling provided to reduce the admin load, and empower our users.https://airflowsummit.org/live
by Ace Haidrey, Euccas Chen, Yulei Li, Dinghang Yu & Ashim Shrestha
Autoscaling in Airflow - what we learnt based on Cloud Composer case. We would like to present how we approach the autoscaling problem for Airflow running in Kubernetes in Cloud Composer: how we calculate our autoscaling metric, what problem we had for scaling down and how did we solve it. Also we share an ideas on what and how we could improve the current solutionhttps://airflowsummit.org/live
by Mateusz Henc & Anita Fronczak
After performing several experiments with Airflow, we reached the best architectural design for processing text medical records in scale. Our hybrid solution uses Kubernetes, Apache Airflow, Apache Livy, and Apache cTAKES. Using Kubernetes’ containers has the benefit of having a consistent, portable, and isolated environment for each component of the pipeline. With Apache Livy, you can run tasks in a Spark Cluster at scale. Additionally, Apache cTAKES helps with the extraction of information from electronic medical records clinical free-text by using natural language processing techniques to identify codable entities, temporal events, properties, and relations.https://airflowsummit.org/live
by Mikaela Pisani & Anthony Figueroa
Apache Superset is a modern, open-source data exploration & visualization platform originally created by Maxime Beauchemin. In this talk, I will showcase advanced technical Superset features like the rich Superset API, how to version control dashboards using Github, embedding Superset charts in other applications, and more. This talk will be technical and hands-on, and I will share all code examples I use so you can play with them yourself afterwards!https://airflowsummit.org/live
by Srini Kadamati
As people define and publish a DAG, it can be really useful to make it clear how this DAG should behave under different “operating contexts”. Common operating contexts may match your different environments (dev / staging / prod) and/or match your operating needs (quick run, full backfill, test run, …).
Over the years, patterns have emerged around workflow authors, teams and organizations, and little has been shared as to how to approach this. In this talk, we’ll talk about what an “operating context” is, why it’s useful, and describe common patterns and best practices around this topic.https://airflowsummit.org/live
by Maxime Beauchemin
In this talk Jarek and Kaxil will talk about official, community support for running Airflow in the Kubernetes environment.
The full support for Kubernetes deployments was developed by the community for quite a while and in the past users of Airflow had to rely on 3rd-party images and helm-charts to run Airflow on Kubernetes. Over the last year community members made an enormous effort to provide robust, simple and versatile support for those deployments that would respond to all kinds of Airflow users. Starting from official container image, through quick-start docker-compose configuration, culminating in April with release of the official Helm Chart for Airflow.
This talk is aimed for Airflow users who would like to make use of all the effort. The users will learn how to:
by Jarek Potiuk & Kaxil Naik
In Apache Airflow, Xcom is the default mechanism for passing data between tasks in a DAG. In practice, this has been restricted to small data elements, since the Xcom data is persisted in the Airflow metadatabase and is constrained by database and performance limitations.
With the new TaskFlow API introduced in Airflow 2.0, it is seamless to pass data between tasks and the use of Xcom is invisible. However, the ability to pass data is restricted to a relatively small set of data types which can be natively converted in JSON.
This tutorial describes how to go beyond these limitations by developing and deploying a Custom Xcom backend within Airflow to enable the sharing of large and varied data elements such as Pandas data frames between tasks in a data pipeline, using a cloud storage such as Google Storage or Amazon S3.https://airflowsummit.org/live
by Vikram Koka & Ephraim Anierobi
An informal and fun chat about the journey that we took and the decisions that we made in building Amazon Managed Workflows for Apache Airflow. We will talk about
We will leave time at the end for a short AMA/Questions.https://airflowsummit.org/live
by Subash Canapathy & John Jackson