Streamling: Lightweight, Extensible Streaming on DataFusion

Operations

Stories

Stream

Streamling: Lightweight, Extensible Streaming on DataFusion

Session Abstract

Apache DataFusion is moving beyond batch into streaming. We built Streamling, a Rust streaming engine that uses DataFusion planning and Arrow RecordBatch streams for real-time SQL/WASM transforms. This talk covers how we built it, highlights key features (FFI plugins, WASM transforms, and dynamic tables), and shares production lessons.

Session Description

Stream processing systems are complex. Our previous platform was Flink-based. We learned a lot from it, but wanted a lighter approach for workloads that do not need distributed stateful processing. At the same time, a growing ecosystem was emerging around Apache DataFusion and Arrow. We built Streamling to explore a specific point in this design space: a production streaming engine that stays intentionally simple, with no distributed shuffle and no stateful joins, and focuses on operational clarity, extensibility, and cloud-native deployment.

Part 1: The Engine Internals

A deep dive into how we extended DataFusion for streaming:

Streaming SQL on DataFusion: We use DataFusion’s query planner, custom TableProviders, and ExecutionPlan traits to process Kafka Avro data as continuous Arrow RecordBatch streams.
Checkpoint coordination: A lightweight Chandy–Lamport style protocol (Marker → Ack → Finalizer) that guarantees at-least-once delivery. State is persisted via a pluggable backend system (in-memory, SQLite, or PostgreSQL in production), keeping checkpoint storage decoupled from the engine itself.
Runtime extensibility: WebAssembly script transforms (JS/TS via Extism), HTTP handler transforms, and an abi_stable plugin system provide FFI-safe, language-agnostic extension points without requiring engine forks.
Dynamic tables: Stateful lookup tables can be populated from streams or updated externally (for example, in Postgres), enabling deduplication and enrichment in SQL through custom UDFs without pipeline restarts.

Part 2: From Engine to Platform

How we designed the system for production cloud deployment:

Control/data plane separation: The engine (data plane) is decoupled from orchestration (control plane), enabling both fully managed and BYOC (Bring Your Own Cloud) deployment models.
Kubernetes-native lifecycle: Pipeline management (create, pause, resume, restart), resource sizing, secret injection, and namespace isolation.
Clean separation: Why defining this boundary early keeps the engine portable and the platform flexible across deployment models.

Key takeaways for the audience:

1. DataFusion is proving to be a versatile foundation for streaming, not just batch. We’ll share a brief overview of the landscape and where different projects sit.
2. You don’t need distributed stateful processing for many streaming workloads. Deliberately scoping down unlocks operational simplicity.
3. Designing a clean control/data plane boundary from day one keeps your architecture flexible for different deployment models.

This talk is aimed at engineers building or evaluating streaming platforms, and anyone exploring DataFusion beyond batch analytics.

Talk

Xiao Meng

Rafael Aguiar

All Speakers

All Sessions

PS DEV