Why Choose One: Multi-Engine Analytics with Apache Wayang

Data Science

Scale

Store

Why Choose One: Multi-Engine Analytics with Apache Wayang

Session Abstract

Choosing the best engine for each data task sounds right, but in modern data stacks doing so requires expertise and effort. Apache Wayang, a recently graduated TLP, addresses this by decoupling logical dataflows from execution engines. From big data platforms to SQL and ML engines, Wayang enables cross-platform execution that maximizes performance.

Session Description

Modern analytics pipelines frequently span databases, big data engines, and machine learning frameworks. Connecting these systems manually leads to complex orchestration, high data movement cost, and platform-specific rewrites. This challenge also appears in agent driven workflows where different steps of a task naturally map to different engines.

Apache Wayang is a recently graduated Apache Top Level Project that provides a unified data analytics framework for cross-platform execution. Pipelines are expressed with platform independent operators using Java, Scala, Python, or SQL APIs. A cross-platform optimizer then maps operators to execution backends such as Spark, Flink, JDBC databases, and ML systems, and produces execution plans that may span multiple engines. It models operator and data movement cost and supports runtime re optimization when estimates are wrong. In practice, this lets developers write a pipeline once and run it efficiently across multiple engines without hard-wiring platform choices.

The talk is technical and system focused, aimed at practitioners working with heterogeneous data stacks. It has three parts:

1. Motivation (10-15min)
Why single engine execution is often not enough. Concrete ETL, ML, and agent-based workflows that require multiple systems and create optimization and integration challenges.

2. System architecture and optimizer (20-25min)
Wayang’s platform agnostic plans, operator mappings, cross-platform data movement handling, and stage-based execution model. How the cost-based optimizer inflates plans, evaluates alternatives, and selects mixed engine execution strategies. Brief coverage of SQL, ML, and multi-language UDF support.

3. Project history, status, and next steps (5-10min)
From multi year cross-platform analytics research to Apache and recent Top Level Project graduation. Extensibility for new platforms and current work on improved cost models and optimizer enhancements.

Attendees will gain a practical understanding of how cross-platform analytics can be executed efficiently and how to design pipelines that are not locked to a single processing engine.

Talk

Zoi Kaoudi

Haralampos Gavriilidis

All Speakers

All Sessions

PS DEV