Title |
---|
A New SQLAlchemyCollector and OpenLineageAdapter for Emitting Airflow Lineage Metadata as DAGs Runby Michael Robinson
Airflow uses SQLAlchemy under the hood but up to this point has not exploited the tool’s capacity to produce detailed metadata about queries, tables, columns, and more. In fact, SQLAlchemy ships with an event listener that, in conjunction with OpenLineage, offers tantalizing possibilities for enhancing the development process – specifically in the areas of monitoring and debugging.
SQLAlchemy’s event system features a Session object and ORMExecuteState mapped class that can be used to intercept statement executions and emit OpenLineage RunEvents as executions occur.
|
A Single Pane of Glass on Airflow using Astro Python SDK, Snowflake, dbt, and Cosmosby Luan Moreno Medeiros Maciel & Mateus Oliveira
ETL data pipelines are the bread and butter of data teams that must design, develop, and author DAGs to accommodate the various business requirements.
dbt is becoming one of the most used tools to perform SQL transformations on the Data Warehouse, allowing teams to harness the power of queries at scale.
Airflow users are constantly finding new ways to integrate dbt with the Airflow ecosystem and build a single pane of glass where Data Engineers can manage and administer their pipelines.
|
Airflow - Under the hoodby Utkarsh Sharma
Making a contribution to or becoming a committer on Airflow can be a daunting task, even for experienced Python developers and Airflow users. The sheer size and complexity of the code base may discourage potential contributors from taking the first steps. To help alleviate this issue, this session is designed to provide a better understanding of how Airflow works and build confidence in getting started.
During the session, we will introduce the main components of Airflow, including the Web Server, Scheduler, and Workers.
|
Airflow and Data Mesh - Running ~500 airflows crunching millions of meals delivered dailyby M. Waqas Shahid
Ever thought how airflow could play a pivotal role in data mesh architecture, hosting thousands of DAGs and hundreds of thousands daily running tasks, let’s find out!
Delivery Hero delivers food in 70 countries with 12 different brands and platforms. With thousands of engineers, analysts and data scientists spread across many countries running analytics and ML services for all these orders delivered. Serving the workflow orchestration needs for such a massive group becomes a challenge.
|
Airflow at Reddit - How we migrated from Airflow 1 to Airflow 2by Dave Milmont & Branden West
We would love to speak about our experience upgrading our old airflow 1 infrastructure to airflow 2 on kubernetes and how we orchestrated the migration of approximately 1000 DAGs that were owned by multiple teams in our organization. We had some interesting challenges along the way and can speak about our solutions.
Points we can talk about:
Old airflow 1 infrastructure and why we decided to move to kubernetes for airflow 2.
|
Airflow at Snapby Zhengyi Liu, Yuri Desyatnik, Han Gan & Nanxi Chen
We will cover how Snap (parent company of Snapchat) has been using Airflow since 2016.
How we built a secure deployment on GCP that integrates with internal tools for workload authorization, RBAC and more. We made permissions for DAGs easy to use for customers using k8s workload identity binding and tight UI integration.
How are we migrating 2500+ DAGs from Airflow V1, Python 2 to V2 Python 3 using tools + automations.
|
Airflow at The Home Depot Canada- Observable orchestration platform for data integration and MLby Jose Puertos
The purpose of this session is to indicate how we leverage airflow in a federated way across all our business units to perform a cost-effective platform that accommodates different patterns of data integration, replication and ML tasks in a flexible way providing DevOps tunning of DAGs across environments that integrate to our open-source observability strategy that allows our SREs to have a consistent metrics, monitoring and alerting of data tasks.
|
Airflow Driven Data Lineage In Public Cloudby Michał Modras
The session will cover capabilities of data lineage in Apache Airflow, how to use them, and motivations for it. It will present the technical know-how of integrating data lineage solutions with Apache Airflow, and provisioning DAGs metadata to fuel lineage functionalities in a way transparent to the user, limiting the setup friction. It will include Google’s Cloud Composer lineage integration implemented through the current Airflow’s data lineage architecture, and our approach to the lineage evolution strategy.
|
Airflow Executors: Past, Present and Future.by Niko Oliveira
Executors are a core concept in Apache Airflow and are an essential piece to the execution of DAGs. They have seen a lot of investment over the year and there are many exciting advancements that will benefit both users and contributors.
This talk will briefly discuss executors, how they work and what they are responsible for. It will then describe Executor Decoupling (AIP-51) and how this has fully unlocked development of third-party executors.
|
Apache Airflow and OpenTelemetryby Dennis Ferruzzi & Howard Yoo
OpenTelemetry is a vendor-neutral open-source (CNCF) observability framework that is supported by many vendors industry-wide. It is used for instrumenting, generation, collection, and exporting of data within systems which then are ingested by analytics tools that can provide tracing, metrics, and logs. It has long been the plan to adopt the OTel standard within Airflow, allowing builders and users to take advantage of valuable data that could help improve the efficiency, cost and performance of their systems.
|
Apache Airflow Bad vs. Best Practices In Productionby Bhavani Ravi
Apache Airflow - The open-ended nature of this orchestration tool gives room for a variety of customization.
While this is a good thing, there are no bounds in which the system can or cannot be used, resulting in wasting a lot of time in scaling, testing, and debugging when things aren’t set properly.
In this talk, we will go through a series of factors that data teams need to keep a watch for while setting up an Airflow system.
|
Better Airflow with Metaflow : A Modern Human-centric ML Infrastructure Stackby Savin Goyal & Ryan Delgado
Airflow is a household brand in data engineering: It is readily familiar to most data engineers, quick to set up, and, as proven by millions of data pipelines powered by it since 2014, it can keep DAGs running.
But with the increasing demands of ML, there is a pressing need for tools that meet data scientists where they are and address two pressing issues - improving the developer experience & minimizing operational overhead.
|
Beyond XComs - A Deep Dive into Passing Data Between Tasksby Jeff Fletcher
Passing large amounts of data between tasks in Airflow used to be considered an anti-pattern. This is because of the limitations inherent in the way XComs was initially implemented. However Airflow is both evolving and is very customisable, and these limitations are now easily overcome to make moving of large data between tasks not only possible, but also preferable.
In this talk I will explore different ways of passing data between tasks, and the pros and cons of each.
|
Change management done right across environments and tools - DAGs, datasets and visualizationsby Maxime Beauchemin
Maxime Beauchemin, creator of Apache Airflow, will share best practices for change management in data teams. Among the topics he will cover is how to do it across different environments, how CI/CD can help, and what is the adequate level of rigor.
|
Chase The Sun: Build Greener DAGs with VertFlowby Jack Lockyer-Stevens
In 2022, cloud data centres accounted for up to 3.7% of global greenhouse gas emissions, exceeding those of aviation and shipping.
Yet in the same year, Britain wasted 4 Terawatt hours of renewable energy because it couldn’t be transported from where it was generated to where it was needed.
So why not move the cloud to the clean energy?
VertFlow is an Airflow operator that deploys workloads to the greenest Google Cloud data centre, based on the realtime carbon intensity of electricity grids worldwide.
|
Circumventing Airflows Limitations around Multitenancyby Anthony Kalsatos & Akshay Battaje
A steady rise in users and business critical workflows poses challenges to development and production workflows. The solution: enable multi-tenancy on our single Airflow instance. We needed to enable teams to manage their python requirements, and ensure DAGs were insulated from each other. To achieve this we divided our monolithic setup into three parts: Infrastructure (with common code packaging), Workspace Creation, and CI/CD to manage deployments.
Backstage templates enable teams to create isolated development environments that resemble our production environment, ensuring consistency.
|
Cross Environment Event Based Triggers with Airflowby Kunal Jain
In an environment with multiple Airflow instances , how we build custom operators and framework to share events across the instances and trigger dags based on those events
|
DAG parsing optimizationsby Raphaël Vandon
As big Airflow users grow their usage to hundreds of DAGs, parsing them can become a performance bottleneck in the scheduler.
In this talk, we’ll explore how this situation was improved by using caching techniques and pre-processing of DAGs to minimize the overhead of parsing them at runtime.
We’ll also touch on how the the performance of the existing code was analyzed to find points of improvement.
We may include a section on how to configure airflow to benefit from those recent changes, and some tips on how to make DAGs that are quick to parse, but this will not be the core of the talk.
|
Data Product DAGsby Charles Verleyen
This talk will cover in high overview the architecture of a data product DAG, the benefits in a data mesh world and how to implement it easily.
Airflow is the de-facto orchestrator we use at Astrafy for all our data engineering projects. Over the years we have developed deep expertise in orchestrating data jobs and recently we have adopted the “data mesh” paradigm of having one Airlfow DAG per data product.
|
Deferrable Operatorsby Syed Hussain
Deep dive into how AWS is developing Deferrable Operators for the Amazon Provider Package to help users realize the potential cost-savings provided by Deferrable Operators, and promote their usage.
|
Delay Modeling and DAG Connectivity: Optimizing Airflow Performance in Large Organizationsby Ahuitz Rojas
In large organizations, data workflows can be complex and interconnected, with multiple dependencies and varied runtime requirements. To ensure efficient and timely execution of workflows, it is important to understand the factors that affect the performance of the system, such as network congestion, resource availability, and DAG structure. In this talk, we will explore how delay modeling and DAG connectivity analysis can be used to optimize Airflow performance in large organizations.
|
Democratizing ML feature store framework at scale with Airflowby Rafay Aleem & Victoria Varney
Data science and machine learning are at the heart of Faire’s industry-celebrated marketplace (a16z top-ranked marketplace) and drive powerful search, navigation, and risk functions which are powered by ML models that are trained on 3000+ features defined by our data scientists.
Previously, defining, backfilling and maintaining feature lifecycle was error-prone. Having a framework built on top of Airflow has empowered them to maintain and deploy their changes independently.
We will explore:
|
Demystifying Apache Airflow: Separating Facts from Fictionby Shubham Mehta
Apache Airflow is a popular workflow platform, but it often faces critiques that may not paint the whole picture. In this talk, we will unpack the critiques of Apache Airflow and provide a balanced analysis. We will highlight the areas where these critiques correctly point out Airflow’s weaknesses, debunk common myths, and showcase where competitors like Dagster and Prefect are excelling.
By understanding the pros and cons of Apache Airflow, attendees will be better equipped to make informed decisions about whether Airflow is the right choice for their use cases.
|
Eat, Sleep, Test, Repeat: How King Ensures Always-On Databy Nathan Hadfield
At King, data is fundamental in helping us deliver the best possible experiences for the players of our games while continually bringing them new, innovative and evolving gameplay features. Data has to be “always-on”, where downtime and accuracy is treated with the same level of diligence as any of our games and success is measured against internal SLAs.
How is King using ‘data reliability engineering as code’ tools such as SodaCore within Airflow pipelines to detect, diagnose and inform about data issues to create coverage, improve quality & accuracy and help eliminate data downtime?
|
Empowering Collaborative Data Workflows with Airflow and Cloud Servicesby Stanisław Smyl & Hoa Nguyen
Productive cross-team collaboration between data engineers and analysts is the goal of all data teams, however, fulfilling on that mission can be challenging given the diverse set of skills that each group brings. In this talk we present an example of how one team tackled this topic by creating a flexible, dynamic and extensible framework using Airflow and cloud services that allowed engineers and analysts to jointly create data-centric micro-services to serve up projections and other robust analysis for use in the organization.
|
Enabling Data Mesh by Moving from a Monolithic Airflow to Several Smaller Environmentsby Filip Kunčar & Stanislav Repka
Kiwi.com started using Airflow in June 2016 as an orchestrator for several people in the company. The need for the tool grew and the monolithic instance was used by 30+ teams having 500+ DAGs active resulting in 3.5 million tasks/month successfully finished.
At first, we moved to using a monolithic Airflow environment, but our needs quickly changed as we wanted to support a data mesh architecture within kiwi.com.
By leveraging Astronomer on GCP, we were able to move from a monolithic Airflow environment to many smaller instances of Airflow.
|
Event-based DAG parsing - no more F5ing in the UIby Bas Harenslak
Have you ever added a DAG file and had no clue what happened to it? You’re not alone! With default settings, Airflow can wait up to 5 minutes before processing new DAG files.
In this talk, I’ll discuss the implementation of an event-based DAG parser that immediately processes changes in the DAGs folder. As a result, changes are reflected immediately in the Airflow UI. In this talk I will cover:
|
Flexible DAG Trigger Forms (AIP-50)by Jens Scheffler & Christian Schilling
As user of Airflow we often use DagRun.conf attributes to control content and flow of a DAG run. Previously the Airflow UI only allowed to launch via JSON in the UI. This was technically feasible but not user friendly. A user needs to model, check and understand the JSON and enter parameters manually without the option to validate before trigger.
Similar like Jenkins or Github/Azure pipelines we desire an UI option to trigger with a UI and specifying parameters.
|
Forging the Future: Five Years of Fabricating with Airflowby Madison Swain-Bowden
As a data engineer, I’ve used Airflow extensively over the last 5 years: across 3 jobs, several different roles; for side projects, for critical infrastructure; for manually triggered jobs, for automated workflows; for IT (Ookla/Speedtest.net), for science (Allen Institute for Cell Science), for the commons (Openverse), for liberation (Orca Collective). Authoring a DAG has changed dramatically since 2018, thanks to improvements in both Airflow and the Python language. In this session, we’ll take a trip back in time to see how DAGs looked several years ago, and what the same DAGs might look like now.
|
From Pain Points to Best Practices: Enhancing Airflow Migrations and Local Development at Wix.comby Roy Noyman
Are you tired of spending hours on Airflow migrations and wondering how to make them more accessible? Would you like to be able to test your code on different Airflow versions? or are you struggling to set up a reliable local development environment?
These are some of the top pain points for data engineers working with Airflow. But fear not – Wix Data Engineering has some best practices to share that will make your life easier.
|
Future of the Airflow UIby Brent Bovenzi
We are continuing to modernize the Airflow UI to make it easier to manage all aspects of your DAGs. See a demo of the latest updates and improve your workflows with new tips and tricks. Then get a preview of what else will be coming soon.
Followed up by Q&A for people to field their own use-cases and explore new ideas on how to improve the user experience.
|
Guided tour to DAG authoringby Jed Cunningham
New to Airflow or haven’t followed any of the recent DAG authoring enhancements? This talk is for you!
We will go through various DAG authoring features like Setup/Teardown tasks (~2.7), Datasets (2.4), Dynamic Tasks (2.3) and Async tasks (2.2). You won’t be an expert after this short talk, however, you’ll have a head start when you write your next DAG, no hacks required.
|
How Asurion simplified Workload Orchestration using Airflow at petabyte scaleby Rajesh Gundugollu
Workload Orchestration is at the heart of a successful Data lakehouse implementation. Especially for the “house” part which represents the Datawarehouse workloads which often are complex because of the very nature of warehouse data, which have dependency orchestration problems.
We at Asurion have spent years in perfecting the Airflow solution to make it a super power for our Data Engineers. We have innovated in key areas like single operator for all use cases, auto DAG code generation, custom UI components for Data Engineers, monitoring tools etc.
|
How to use Data Contracts for Data Quality in your Airflow Ecosystemby Shirshanka Das
Data contracts have been much discussed in the community of late, with a lot of curiosity around how to approach this concept in practice.
We believe data contracts need a harmonizing layer to manage data quality in a uniform manner across a fragmented stack. We are calling this harmonizing layer the Control Plane for Data - powered by the common thread across these systems: metadata.
For teams already orchestrating pipelines with Airflow, data contacts can be an effective way to process data that meets preset quality standards.
|
Leveraging Dynamic DAGs for Data Ingestion at Bloombergby Ivan Sayapin & Yenny Su
Bloomberg’s Data Platform Engineering team powers some of the most valuable business and financial data on which Bloomberg clients rely. We recently built a configuration-driven system that allows non-engineers to onboard alternative datasets into the company’s ecosystem. This system uses Apache Airflow to orchestrate the data flow across different applications and Bloomberg Terminal functions. We are unique in that we have over 1500 dynamic DAGs tailored for each dataset’s needs (which very few Airflow users have).
|
Manifest destiny? Orchestrating dbt using Airflowby Jonathan Talmi
Airflow is a popular choice for organizations looking to integrate open-source dbt within their existing data infrastructure. This talk will explore two primary methods of running dbt in Airflow: job-level and model-level. We’ll discuss the tradeoffs associated with each approach, highlighting the simplicity and efficiency of job-level orchestration, contrasted with the enhanced observability and control provided by model-level orchestration. We’ll also explain how the balance has shifted in recent years, with improvements to dbt core making model-level more efficient and innovative Airflow extensions like Cosmos making it easier to implement.
|
Mastering Dependencies: The Airflow Wayby Jarek Potiuk
Apache Airflow has over 650 Python dependencies. In case you did not know already, dependencies in Python are difficult subject. And Airflow has its own, custom ways of managing the dependencies.
Airflow has a rather complex system to manage dependencies in their CI system, but this talk is not about it. This talk is directed to the users of Airflow who want to keep their dependencies updated, describing ways they can do it.
|
Micropipelines: A Microservice Approach for DAG Authoring using Datasetsby Vikram Koka
Introduced in Airflow 2.4, Datasets are a foundational feature for authoring modular data pipelines. As DAGs grow to encompass a larger number of data sources and encompass multiple data transformation steps, they typically become less predictable in the timeliness of execution and less efficient.
This talk focuses on leveraging Datasets to enable predictable and more efficient DAGs, by leveraging patterns from microservice architectures. Just as large monolithic applications were decomposed into micro-services to deliver more efficient scalability and faster development cycles, micropipelines have the same potential to radically transform data pipeline efficiency and development velocity.
|
Multi-tenancy state of the unionby Jarek Potiuk, Mateusz Henc & Vincent Beck
This sesion is about the current state of implementation for multi-tenancy feature of Airflow. This is a long-term feature that involves multiple changes, separate AIPs to implement, with the long-term vision of having single Airflow instance supporting multiple, independed teams using it - either from the same company or as part of Airflow-As-A-Service implementation.
|
OpenLineage in Airflow: A Comprehensive Guideby Maciej Obuchowski
With native support for OpenLineage in Airflow, users can now observe and manage their data pipelines with ease. This talk will cover the benefits of using OpenLineage, how it is implemented in Airflow, practical examples of how to take advantage of it, and what’s in our roadmap. Whether you’re an Airflow user or provider maintainer, this session will give you the knowledge to make the most of this tool.
|
Opportunities to join the Airflow (docs) communityby Laura Zdanski
Open Source doc edits provide a low-stakes way for new users to first contribute. Ideally, new users find opportunities and feel welcome to fix docs as they learn, engaging with the community from the start. But, I found that contributing docs to Airflow had some surprising obstacles.
In this talk, I’ll share my first docs contribution journey, including problems and fixes. For example, you must understand how Airflow uses Sphinx and know when to choose to edit in the GitHub UI or locally.
|
Platform for genomic processing using Airflow and ECSby Zohar Donenhirsh & Alina Aven
High-scale orchestration of genomic algorithms using Airflow workflows, AWS elastic container service (ECS), and docker.
Genomic algorithms are highly demanding of CPU, RAM, and storage. Our data science team requires a platform to facilitate the development and improving proprietary algorithms. The Data engineering team develops a research data platform that enables Data Scientists to publish docker images to amazon ECR and run them using airflow dags that provision amazon’s ECS compute power of EC2 and Fargate.
|
Reducing costs by maximizing Airflow and DAG performanceby John Jackson
Airflow DAGs are Python code (which can pretty much do anything you want) and Airflow has hundreds configuration options (which can dramatically change Airflow behavior). Those two facts contribute to endless combinations that can run the same workloads, but only a precious few are efficient. The rest will result in failed tasks and excessive compute usage, costing time and money.
This talk will demonstrate how small changes can yield big dividends, and reveals some code improvements and Airflow configurations that can reduce costs and maximize performance.
|
Reliable Airflow DAG Design When Building a Time-Series Data Lakehouseby Sung Yun
As a team that has built a Time-Series Data Lakehouse at Bloomberg, we looked for a workflow orchestration tool that could address our growing scheduling requirements. We needed a tool that was reliable and scalable, but also could alert on failures and delays to enable users to recover quickly from them. From using triggers over simple sensors to implementing custom SLA monitoring operators, we explore our choices in designing Airflow DAGs to create a reliable data delivery pipeline that is optimized for failure detection and remediation.
|
Seamless Integration of Apache Airflow into Existing Task Management Workflowsby Serene Ghazi & Andy Weiss
This session will demonstrate how to seamlessly incorporate Apache Airflow into daily task management applications. We will delve into how to integrate Airflow with existing communication and coordination tools with minimal disruption. By doing so, we aim to showcase the practical benefits of adopting a “keep the lights on” approach to workflow management that ensures tasks are completed efficiently and without delays. Through the utilization of Airflow, we can promptly deliver solutions while maintaining a streamlined stack devoid of the need for supplementary monitoring tools.
|
Simplifying the Creation of Data Science Pipelines with Airflowby Soren Archibald, Brian Donaghy & Jay Thomas
The ability to create DAGs programmatically opens up new possibilities for collaboration between Data Science and Data Engineering. Engineering and DevOPs are typically incentivized by stability whereas Data Science is typically incentivized by fast iteration and experimentation. With Airflow, it becomes possible for engineers to create tools that allow Data Scientists and Analysts to create robust no-code/low-code data pipelines for feature stores.
We will discuss Airlow as a means of bridging the gap between data infrastructure and modeling iteration as well as examine how a Qbiz customer did just this by creating a tool which allows Data Scientists to build features, train models and measure performance, using cloud services, in parallel.
|
Streamlining Data Processing for Tableau Dashboards with Airflow: Gojek Case Studyby Wanda Kinasih
With millions of orders per day, Gojek needs a data processing solution that can handle a high volume of data. Airflow is a scalable tool that can handle large volumes of data and complex workflows, making it an ideal solution for Gojek’s needs.
With Airflow, we can create automated data pipelines to extract data from various sources, transform it, and load it into dashboards such as Tableau for analysis and visualization.
|
Supercharge Productivity of Airflow Users at Coinbaseby Jianlong Zhong
At Coinbase, Airflow is adopted by a wide range of applications, and used by nearly all the engineering and data science teams. In this session, we will share our journey in improving the productivity of Airflow users at Coinbase. The presentation will focus on three main topics:
Monorepo based architecture: our approach of using a monorepo to simplify DAG development and enable developers from across the company to work more efficiently and collaboratively.
|
Supporting the vast Airflow community: Lessons learned from over 100 Airflow webinarsby Kenten Danas
Kenten Danas will share some of the key learnings gathered from 2.5 years of conducting webinars aimed at supporting the community in growing their Airflow use, including how to best cater DevRel efforts to the many different types of Airflow users and how to effectively push for the adoption of new Airflow features.
|
Testing Airflow DAGs with Dagtestby Victor Chiapaikeo
For the dag owner, testing Airflow DAGs can be complicated and tedious. Kubectl cp your dag from local to pod, exec into the pod, and run a command? Install breeze? Why pull the Airflow image and start up the webserver / scheduler / triggerer if all we want is to test the addition of a new task?
It doesn’t have to be this hard. At Etsy, we’ve simplified testing dags for the dag owner with dagtest.
|
The Future of Airflow: A Panel Discussionby Viraj Parekh, Kaxil Naik, Jarek Potiuk, Ash Berlin-Taylor & Marc Lamberti
Airflow is almost 10 years old! Since starting out at AirBnB, the project has taken all sorts of twists and turns before getting to where it is now. Through it’s lifecycle, Airflow has seen an explosion of contributors (over 2400 and counting), end users, use cases, and so much more.
This panel will be about the future of Airflow and where it’s going. We’ll hear from some of the oldest, and newest, faces in the community about the project’s direction and vision as we turn towards the next decade of the project’s lifecycle.
|
The Why and How of running a self-managed Airflow on Kubernetesby Parnab Basak
Today, all major cloud service providers and 3rd party providers include Apache Airflow as a managed service offering in their portfolios. While these cloud based solutions help with the undifferentiated heavy lifting of environment management, some data teams are also looking to operate self-managed Airflow instances to satisfy specific differentiated capabilities. In this session, we would talk about:
1. Why should you might need to run self managed Airflow 2. The available deployment options (with emphasis on Airflow on Kubernetes) 3.
|
Things to Consider When Building an Airflow Serviceby Viraj Parekh & Pete DeJoy
Data platform teams often find themselves in a situation where they have to provide Airflow as a service to downstream teams, as more users and use cases in their organization require an orchestrator. In these situations, it’s giving each team it’s own Airflow environment can unlock velocity and actually be lower overhead to maintain than a monolithic environment.
This talk will be about things to keep in mind when building an Airflow service that supports several environments, persona of users, and use cases.
|
To Debug a DAG: The Airflow Local Dev Storyby Daniel Imberman
As much as we love airflow, local development has been a bit of a white whale through much of its history. Until recently, Airflow’s local development experience has been hindered by the need to spin up a scheduler and webserver. In this talk, we will explore the latest innovation in Airflow local development, namely the “dag.test()” functionality introduced in Airflow 2.5.
We will delve into practical applications of “dag.test()”, which empowers users to locally run and debug Airflow DAGs on a single python process.
|
Traps and misconceptions of running reliable workloads in Apache Airflowby Bartosz Jankiewicz
Reliability is a complex and important topic. I will focus on both reliability definition and best practices.
I will begin by reviewing the Apache Airflow components that impact reliability. I will subsequently examine those aspects, showing the single points of failure, mitigations, and tradeoffs.
The journey starts with the scheduling process. I will focus on the aspects of Scheduler infrastructure and configuration that address reliability improvements. It doesn’t run in a vacuum therefore I’ll share my observations on the reliability aspect of Scheduler infrastructure.
|
Twitch’s Recommendation System: Starring Airflow!by Ritika Jain
Twitch, the world’s leading live streaming platform, has a massive user base of over 140 million active users and an incredibly complex recommendation system to deliver a personalized and engaging experience to its users.
In this talk, we will dive into how Twitch leverages the power of Apache Airflow to manage and orchestrate the training and deployment of its recommendation models. You will learn about the scale of Twitch’s reach and the challenges we faced in building a scalable, reliable, and developer-friendly recommendation system.
|
Unlocking the Power of Warehouse Allocation: Optimizing Task Dispatching for Cost-Effective and Effiby Ben Chen
In this session, we’ll explore the inner workings of our warehouse allocation service and its many benefits. We’ll discuss how you can integrate these principles into your own workflow and provide real-world examples of how this technology has improved our operations. From reducing queue times to making smart decisions about warehouse costs, warehouse allocation has helped us streamline our operations and drive growth.
With its seamless integration with Airflow, building an in-house warehouse allocation pipeline is simple and can easily fit into your existing workflow.
|
Using Dynamic Task Mapping to Orchestrate dbtby Pádraic Slattery
Airflow, traditionally used by Data Engineers, is now popular among Analytics Engineers who aim to provide analysts with high-quality tooling while adhering to software engineering best practices. dbt, an open-source project that uses SQL to create data transformation pipelines, is one such tool. One approach to orchestrating dbt using Airflow is using dynamic task mapping to automatically create a task for each sub-directory inside dbt’s staging, intermediate, and marts directories. This enables analysts to write SQL code that is automatically added as a dedicated task in Airflow at runtime.
|
What Everybody Ought to Know about Airflowby Marc Lamberti
Airflow is a powerful tool for orchestrating complex data workflows, which have undergone significant changes over the past two years.
Since the Airflow release cycle has accelerated, you may struggle to keep up with the continuous flow of new features and improvements, which can lead to miss opportunities for addressing new use cases or solving your existing ones more efficiently.
This presentation is intended to give you a solid update on the possibilities of Airflow and address misconceptions you may have heard or still believe that used to be valid but no longer are.
|