Apache Spark Declarative Pipelines in Action
Session Abstract
Learn Spark 4.1’s brand-new Declarative Pipelines, a paradigm shift replacing imperative code with simple declarations. We’ll build a real-time data pipeline together, processing streaming ADS-B flight data from tens of thousands of aircraft overhead.
Session Description
Spark Declarative Pipelines: Building Data Workflows with Spark 4.1’s Game-Changing Feature
Apache Spark 4.1 introduces Spark Declarative Pipelines (SDP), a paradigm shift that transforms how data engineers design and maintain complex data workflows. This hands-on session provides a comprehensive introduction to SDP, demonstrating how declarative configuration can replace traditional imperative Spark code for common data pipeline patterns.
I will present a live example using an open-sourced PySpark data source I built with OpenSky founders from Oxford and ETH Zurich. In just a few lines of code, you’ll create a continuous data pipeline with streaming tables ingesting real ADS-B flight data from aircraft overhead—from tiny Cessnas to massive Airbus A380s. No complex “glue code” for incremental ingestion—just define what your pipeline should do while Spark figures out how to do it.
Using streaming tables and materialized views, we’ll layer on AI-powered analytics, turning natural language questions like “Show me flights above 30,000 feet over California” into instant SQL queries against live crowdsourced IoT data. I’ll demonstrate with a forever-free cloud environment where every attendee can replicate this example hands-on. Attendees will leave with practical knowledge to immediately begin experimenting with SDP and best practices for modernizing their pipeline development.