Orchestrating Streaming Data Pipelines with Airflow, Kafka, Spark, and Kubernetes on GCP

By Karan Alang

Modern data platforms rely on real-time pipelines to process and analyze large volumes of streaming events. Apache Airflow is widely used for batch orchestration, but it can also coordinate complex streaming architectures. In this session, we explore how Airflow orchestrates scalable pipelines built with Apache Kafka and Apache Spark running on Kubernetes in cloud environments.

We walk through an architecture where Kafka handles high-throughput event ingestion, Spark processes streaming data for analytics and transformation, and Kubernetes provides scalable infrastructure for distributed workloads. Airflow acts as the orchestration layer, coordinating job scheduling, pipeline dependencies, and operational visibility.

Through practical examples and design patterns, attendees will learn how Airflow integrates with Kubernetes to manage Spark jobs, trigger processing pipelines, and coordinate streaming and batch workloads. We will also discuss monitoring strategies and best practices for operating production-grade streaming pipelines using Airflow, Kafka, Spark, and Kubernetes.

Karan Alang

Karan Alang | Thought Leader in AI, Cloud & Big Data