Sessions

Sessions – PS DEV Sessions – PS DEV

Store

Stream

OTel + Apache Iceberg: The New Standard for Observability

Talk

Observability is moving from vendor stacks to open standards. This talk presents a design where OpenTelemetry provides collection and semantic context, and Apache Iceberg is the data layer for logs, metrics, and traces. We cover portability, governance, agent investigation, and write-path pitfalls: drift, small files, compaction.

Scale

Stories

Constant-Time Aggregations with Star-Tree in OpenSearch

Sandesh Kumar, Shailesh Kumar Singh

Talk

Discover how OpenSearch breaks linear scaling. Inspired by Apache Pinot, the Star-Tree index moves performance dependency from document count to field cardinality. Learn how we extended Lucene’s DocValues to build multi-dimensional materialized views that deliver sub-second analytics on billion-scale datasets for observability workloads.

Operations

Scale

GitOps for n8n: Treating Workflows as Code

Joao Gilberto Magalhaes

Talk

n8n-gitops is an open-source CLI that applies GitOps principles to n8n workflows. This talk shows how workflows can be exported, reviewed, versioned, and deployed from Git instead of manually promoted via the UI. Through a live demo, we explore safer deployments, rollbacks, and lessons learned operating automation as code.

Scale

Store

Stream

Turning the database inside out again

Tom Scott, Roman Kolesnev

Talk

We rethink data systems by putting streams at the center. Expanding on Martin Kleppmann's: Turning the Database Inside Out, this talk shows how Apache Kafka and Apache Iceberg together provide durable storage, indexing, and rich views that eliminate brittle ETL and unify real-time and historical analysis. A new way to see databases—and streams.

Operations

Store

Stories

10x CouchDB Performance Gains for a AAA Game Launch

Jan Lehnardt

Talk

All software benchmarks and claims of performance are carefully crafted lies and this talk is no different. Instead of giving you a quick “do steps one, two, three for a magic speedup”, we aim to explain how we arrived at the changes we made and how we rigorously tested those changes to make sure we understand their impact.

Data Science

Operations

Scale

Sunset for the Wild West: Making ML disciplined by default

William Benton

Talk

Many novel machine learning techniques started as clever hacks that just happened to work, but the demands of building real systems can be at odds with this creative culture. Learn about our open-source stack to improve quality-of-life for ML researchers and infrastructure teams alike — and how their concerns aren't as different as you might think.

Data Science

Store

Stories

Building Schema-Free Applications with RDF

Gosia Zagajewska, Mateusz Charytoniuk

Talk

RDF was designed for the semantic web, but it turns out to be a perfect fit for systems where structure emerges from user interaction, not upfront design. This talk covers how to build applications entirely on RDF triples, translate natural language to SPARQL with small, open source language models, and discover implicit knowledge in user input.

Operations

Stories

Stream

Streamling: Lightweight, Extensible Streaming on DataFusion

Xiao Meng, Rafael Aguiar

Talk

Apache DataFusion is moving beyond batch into streaming. We built Streamling, a Rust streaming engine that uses DataFusion planning and Arrow RecordBatch streams for real-time SQL/WASM transforms. This talk covers how we built it, highlights key features (FFI plugins, WASM transforms, and dynamic tables), and shares production lessons.

Operations

Scale

Society, Ethics & Sustainabilty

Escaping the Cloud: High-Performance AI in your Browser

Johannes Kolbe

Talk

Server-side inference is the bottleneck of modern AI, creating costs and privacy hurdles. But what if the solution is scaling down to the browser? This session investigates Client-Side AI using WebGPU, ONNX Runtime, and Transformers.js. We’ll explore the reality of hardware access, model size, and the 2026 trade-offs of browser based execution.

Operations

Scale

Store

Floe: Policy-Based Table Maintenance for Apache Iceberg

Neelesh Salian

Talk

Iceberg maintenance procedures work. Orchestrating them across hundreds of tables is the problem. Floe is an open-source system that treats maintenance as policy: glob patterns, schedules, and health-driven triggers that gate operations on real table metrics. Supports 7 catalogs, executes via Spark or Trino.

Operations

Scale

Stream

Beyond the Hype: When Apache Flink Solves Real Problems

Naci Simsek

Talk

When does Apache Flink solve real problems versus add complexity? Explore use cases where Flink becomes essential such as fraud detection, CDC, real-time analytics versus when batch or Kafka Streams suffice. Compare stream engines (Flink, Spark) with platforms (Kafka, Pulsar) to confidently decide when streaming delivers value.

Apache Solr 10: What’s Coming up for Vector Search

Alessandro Benedetti, Ilaria Petreti, Anna Ruggero

Talk

With Apache Solr 10 out, there are plenty of goodies coming up for vector-search aficionados. From scalar and binary quantization to speed up your search and reduce the memory footprint, to early termination and hybrid approaches to navigate the HNSW graph. Join us if you want to learn about the big steps forward of Apache Solr vector search!

Operations

Scale

Society, Ethics & Sustainabilty

SPRUCE it up! Open Source GreenOps at scale

Julien Nioche

Talk

GreenOps adoption is stalled by missing data from cloud providers. <a href="https://opensourcegreenops.cloud/" target="_blank" rel="noopener noreferrer">SPRUCE</a> is an open-source, scalable platform built on Apache Spark that enriches cloud usage reports with open models to quantify carbon impact, build insights, and help teams reduce both emissions and cloud spend.

Data Science

Operations

The Failures That Don’t Crash: MLOps for AI Agents

Bartosz Mikulski

Talk

This talk takes four reliability patterns from distributed systems and shows what they look like inside an agent architecture. How to shadow-test an agent. Why your circuit breakers need confidence thresholds. What an eval harness looks like when your system is non-deterministic. And why human oversight degrades faster than anyone admits.

Data Science

Stories

When better retrieval makes agents worse

Lester Solbakken

Talk

Agentic systems can break not because information is missing, but because persuasively wrong context gets promoted into action. We examine a recurring pattern: retrieval metrics improve while agent behavior degrades as distractors enter multi-step loops. We show why relevance, reliability, and security are tightly connected in agentic retrieval.

Data Science

Operations

Observability’s Sixth Sense: Detecting Anomalies in Metrics

Diana Todea

Talk

In this talk, we look at anomaly detection as a complementary way of working with metrics. Instead of relying on predefined limits, anomaly detection focuses on identifying behavior that deviates from what is normally observed over time. The focus is on how developers can interpret these signals, where anomaly detection is useful, where it is not.

Operations

Store

What you should know about constraints in PostgreSQL 18

Gülçin Yıldırım Jelinek

Talk

This talk explains how constraints work in Postgres by exploring the pg_constraint catalog and core concepts like table vs. column constraints, constraint triggers, domains and constraint deferrability through SQL queries. It then covers what’s new in Postgres 18 including temporal keys, NOT NULL as a first-class constraint, NOT ENFORCED and more.

Scale

Beyond Grep: Search for Reliable Coding Agents

Amine GANI, Roudy Khoury

Talk

Coding agents succeed in verifiable loops (compiler + tests), but large repos still expose retrieval weaknesses. This session explores how lexical, structural, and semantic search can provide cleaner context for LLMs. We compare tradeoffs and evaluation approaches to improve reliability without inflating token cost.

Data Science

Scale

Store

Writes, 3 ways: Postgres, Apache Kafka® and Apache Iceberg™

Celeste Horgan

Short Talk

Learning new things is hard, but a useful way to think about new things is by comparing them to things you already know. In this talk, we'll compare writes between 3 different popular data services: Postgres, Apache Kafka and Apache Iceberg. In doing so, we'll learn a bit about the evolution of how we've thought of data storage as developers.

Data Science

Scale

Store

Why Choose One: Multi-Engine Analytics with Apache Wayang

Zoi Kaoudi, Haralampos Gavriilidis

Talk

Choosing the best engine for each data task sounds right, but in modern data stacks doing so requires expertise and effort. Apache Wayang, a recently graduated TLP, addresses this by decoupling logical dataflows from execution engines. From big data platforms to SQL and ML engines, Wayang enables cross-platform execution that maximizes performance.

Operations

Scale

Stream

Event-driven Agents with Complex Event Processing in Flink

Steffen Hoellinger

Talk

Event-driven Agents calling LLMs can be combined with Pattern Recognition and Anomaly Detection in Apache Flink in smart ways to increase cost efficiency, avoid hallucinations and enforce predictable, deterministic behavior. Specifically in a business process context, this architecture provides opportunities for continuous real-time process mining.

Data Science

Stories

Let LLMs Wander: Engineering RL Environments

Stefano Fiorucci

Talk

What if, instead of learning only from examples, Language Models could explore crafted Environments, little worlds where they can act and improve autonomously? Join me to see how Reinforcement Learning Environments work, how to build them with open-source tools, and how to use them to evaluate and train LLMs/Agents.

Data Science

Society, Ethics & Sustainabilty

Stories

No 0-day required, just target the AI coding assistant!

Leo Visser

Talk

Discover how attackers can manipulate AI coding assistants through hidden text, typosquatting and code errors. Learn to detect concealed instructions and set up trusted dependencies to keep unsafe code out of your environment.

Store

Stream

The Agent Era: How AI Agents Are Reshaping Data Platforms

Monica Sarbu

Panel

AI agents have quietly become some of the most demanding users of modern data platforms and most weren't built with them in mind. In this panel, leaders from Snowflake, Elastic, ClickHouse, and Xata share what agentic workloads actually look like in production: what broke, what had to be rebuilt, and where the architecture is heading.

Data Science

Stories

Building a Local News RAG: The Quest for Trustworthiness

Marcel Dokters

Talk

We will show you how we build a local newspaper rag and all the problems that came along the way. From trustworthiness to customer wishes, search optimization and generation problems. Local villages, that LLMs know nothing about, content that is semantically the same and outdated information are only a part of the journey we made.

Scale

Context-Aware Segments: Solving the “Scatter-Read” Problem

Rishav Sagar, Tejas Shah

Talk

Traditional OpenSearch segments are context-blind, scattering data across multiple segments. We introduce Context-Aware Segments (CAS), an architecture that brings "sharding" logic to the segment level. By enforcing document locality during indexing, we slashed query latency and minimized data footprint through superior pruning and compression.

Operations

Scale

Store

How Apache Iceberg Enables Multi-Engine Data Platforms

Geetha Anne

Talk

the session will cover operational best practices, including metadata management, file sizing, compaction strategies, and performance tuning at scale. Attendees will leave with practical guidance for designing &operating open, flexible, multi-engine data architectures built on Apache Iceberg, enabling faster analytics, lower operational flexibility

Scale

Store

From OLTP to OLAP: Is PostgreSQL Eating Analytics Too?

Daniel Seybold

Short Talk

Can PostgreSQL become a serious analytics engine? With emerging columnar extensions, PostgreSQL is pushing beyond OLTP into OLAP territory. This talk explores the current columnar landscape, architectural trade-offs, and how far PostgreSQL can go compared to analytical engines like ClickHouse.

Scale

Stories

One GPU, Four Retrieval Modes: Multi-Model Search Serving

Filip Makraduli

Talk

Competitive search now needs dense embeddings, sparse vectors, ColBERT, and cross-encoder reranking. Most teams run four separate containers. This talk shows how to serve all four from one process, walks through building a hybrid retrieval pipeline with real benchmark data, and covers where each retrieval mode wins and where it wastes compute.

Stories

Towards Chunk-less RAG

Carles Onielfa

Short Talk

Retrieval-Augmented Generation (RAG) systems rely on pre-chunked documents, tying retrieval to arbitrary boundaries. This talk explores an experimental approach that surfaces semantically relevant text spans, without chunking. We'll share surprising findings and examine whether this technique points toward a viable chunk-free retrieval paradigm.

Store

DuckDB beyond the notebook

Matthias Niehoff

Talk

Most people know DuckDB as a fast analytics tool for notebooks and scripts. But embedded OLAP enables much more: browser-based analytics via WebAssembly, serverless data processing, and lightweight data apps — without heavy infrastructure. This talk shows how DuckDB changes the way we build data-driven applications.

Reviving phonetic algorithms for better search relevance

Pietro Mele, Radu Pop

Short Talk

Fuzzy search is a double-edged sword: it fixes typos but drowns users in noise on large corpora. At INA, we revived ancient phonetic algorithms to improve relevance. This session compares fuzzy vs. phonetic search on a massive archive, showing how "sounding right" beats "spelling close."

People & Community

How to Survive the Vortex of LLM Change

Carmen Iniesta, Carles Onielfa

Short Talk

The LLM ecosystem changes faster than most teams can adapt. This talk shares our experience and the practical lessons we’ve learned while building an intelligent search product in a world where models, tools, and best practices constantly evolve.

People & Community

Society, Ethics & Sustainabilty

AI Can Contribute. It Can’t Lead.

Lahari Chowtoori

Short Talk

Today, AI writes code, reviews PRs, answers questions. Some communities ban it, others label it. Most will accept it eventually. But AI won't show up to community calls for two years. It won't mentor your next maintainer. We're losing maintainers faster than we're replacing them. Stop fighting it. Start investing in what it can't replace: people.

Scale

Store

C++ Search for Database Kernels: Built In, Not Bolted On

Andrey Abramov

Talk

IResearch is an Apache 2.0 C++ search engine built to live inside databases. We'll benchmark it against leading open-source search engines, show why vectorized scoring is the next frontier for information retrieval engines, share the mistakes we made over a decade of development and explore how database-native search fits modern query execution.

People & Community

Society, Ethics & Sustainabilty

OpenSearch Software Foundation: 1 Year of Open Governance

Kris Freedain

Short Talk

In this presentation, we will talk through moving a major open source project into a foundation and the benefits of open governance, and a vendor-neutral home has proven through a sustained growth in community contributions.

Scale

Stream

What If We’ve Been Scaling Stream Processing Wrong All Along

Hartmut Armbruster

Talk

We’ve normalised extraordinary inefficiency in stream processing. Thousands of events/sec don't justify repartition storms, serialization overhead, state migration. This talk explores a different path: Kafka Streams DSL, adopt Flink-like exactly-once semantics, Project Loom, and challenging the assumption that stream processing must be distributed.

Data Science

Store

Stream

Kafi Streams: Complex Stream Processing Made Simple

Ralph Matthias Debusmann

Talk

You can finally stop caring about co-partitioning, state stores and eventual consistency. Kafi Streams, built on (Py)DBSP, treats streaming like batch — strongly consistent, no special concepts. An Open Source Python library for the 80% of use cases that don't need extreme scale. Fully incremental stream processing for everyone, from day one.

Stream

Dynamic Broker-Side Filtering for Kafka

David Kjerrumgaard, Álvaro Rodríguez

Talk

KAFKA-6020 has been open for 7 years. This talk demos broker-side filtering for Kafka with sub-millisecond latency (p99 < 25ms). Live demo with working code shows how it reduces network costs, simplifies consumers, and enables new use cases. Real-world validation from financial services and logistics deployments.

Data Science

People & Community

Stories

AI in the physical world: from observation to discovery

Dmitriy Kostunin, Julian von Hoerschelmann-Schliwinski

Short Talk

In 2026, AI is moving beyond digital tasks into the physical world. It increasingly interacts with instruments, experiments, and real-world data. Physicists stand at this frontier, using deep learning, LLMs, and agents to analyze nature itself. What have we learned about AI when it meets reality?

Data Science

Store

Stories

Ultraviolet: Turn Hidden Document Data into an AI Advantage

Alessio Vertemati

Short Talk

Every PDF hides a world of structure, metadata and embedded signals that can silently influence AI based processing. With ultraviolets, we reveal how those can be exploited for malicious purposes and even become powerful tools for smarter applications. Designing for both humans and machines become a vital aspect of AI experience design.

Data Science

Scale

From Legacy Search to Vespa: What a Real PoC Taught Us

André Charton, Valeriia Platonova

Short Talk

For years, Germany’s largest classifieds website relied on a search-first relevance approach because structured data was sparse. This talk shares how we introduced Vespa in the Motors category, enriched signals with embeddings and extracted attributes, and migrated step by step; what worked, what failed, and which lessons only a real PoC reveals.

Scale

Stream

From Inverted Index to Columnar Vectorized Execution Search

Saurabh Singh, Rishabh Kumar Maurya

Talk

Search engines are converging with analytical data systems. This talk explores how columnar data layouts, SIMD-accelerated execution, and bulk-oriented processing are reshaping search internals. We examine where traditional models fall short and how hardware-aware techniques from analytics engines are defining the next search infrastructure.

Data Science

Text-to-Struct: Fine-tuning SLMs for Query Intent

Hugo Jimenez, Sandra Bullón

Talk

Hybrid search fails on complex intent: vector search misses constraints, keywords miss nuance. This talk explores fine-tuning SLMs for 'Query Understanding'—transforming vague inputs into structured requests. Learn to extract metadata, expand terms, and route intent to build a search engine that does the hard work for your users.

Operations

Scale

Stories

Time-Traveling Agents: That Rewind, Retry, Recover

Danish Rehman

Short Talk

Enterprises need AI that won’t hallucinate, break rules, or cause revenue loss. This talk introduces Time-Traveling Agents; LLM systems built on event sourcing and replay, letting teams rewind decisions, inject fixes, and guarantee safe, compliant automation at scale.

Data Science

Operations

Stories

How to Tell If Your Agent Used the Right Stuff

Apurva Misra

Talk

Many so-called “agent failures” are actually context failures in disguise. In this session, we’ll explore how to tell whether your agent truly saw and used the right context, using techniques like tracing and attribution, golden datasets for context-aware evaluation, and targeted probes to test retrieval quality.

Data Science

Circular Dependency Fixes when Bootstrapping a Golden Set

Radu Gheorghe, Rafał Kuć

Short Talk

For a golden set, you need queries. Even if you have them, you can’t judge all docs for each query. Only the top N. How do we rank the top N? See the circular dependency? We’ll talk about ways to untangle it: lexical search, significant terms, training an embedder from scratch, etc. By iteratively refining data and queries, we'll get there.

Data Science

Society, Ethics & Sustainabilty

Stories

Low-Resource Languages as Stress Tests for NLP Data

Priscilla Lola Adenuga

Short Talk

Low-resource languages expose weaknesses in NLP systems that are often hidden by benchmark data. Drawing on experience annotating fieldwork data, this talk shows how ambiguity and annotation decisions reveal fundamental data quality issues relevant to real-world NLP pipelines.

Data Science

Operations

Stream

Apache Spark Declarative Pipelines in Action

Frank Munz

Talk

Learn Spark 4.1's brand-new Declarative Pipelines, a paradigm shift replacing imperative code with simple declarations. We'll build a real-time data pipeline together, processing streaming ADS-B flight data from tens of thousands of aircraft overhead.

People & Community

Society, Ethics & Sustainabilty

Mentoring In Open Source in the Age of AI

Tilda Udufo, Busayo Ojo

Talk

Open source mentorship changed overnight with AI tools. Contributors submitted polished code they couldn’t explain, making learning harder to assess. This talk shares what we learned mentoring Outreachy contributors—what failed, what worked, and what we’re still figuring out.

Operations

Store

Stream

Keeping data private in real-time pipelines

Olena Kutsenko

Talk

Real-time data is awesome… until you realize it’s leaking names, emails, and locations. In this talk, you’ll learn how to keep streaming data private, from simple masking to tricks that beat re-identification. All with live demos and some juicy real-world stories.

Operations

Correctness Too Cheap To Meter: Formal Verification and LLMs

Emilie Ma

Talk

Formal methods are powerful tools to verify software systems' correctness and reliability. However, manually writing system specs is time-consuming and hard to maintain. LLMs can help with this burden. We'll share new research into tools to automate formal methods workflows and learnings from how LLMs currently perform.

Scale

Agentic Retrieval: Building Self-Optimizing Search Systems

Skip Everling, Jo Kristian Bergum

Talk

Relevance feedback loops used to take months. AI agents can now compress the process to seconds. This talk explores agentic retrieval: systems where agents adjust scoring models, schema, and indexing in real time. Learn how to build retrieval infrastructure with verifiable APIs that enable agents to optimize their own search context.

Stream

The Three-Body Problem of Inverse Hybrid Search

Ravindra Harige

Talk

When users expect alerts for new products matching an uploaded image, the problem becomes inverse hybrid search. Unlike top-K search, alerting must guarantee fetch-all semantics: zero missed matches across all saved searches, combining vector similarity, boolean filters, and lexical signals. We show why this breaks traditional scaling intuition.

Stories

Zero downtime index upgrade in Apache Solr

Rahul Goswami

Talk

In this talk we'll explore how Apache Solr introduced the capability to upgrade an index in-place with zero downtime. This upgrade path helps prepare the index for a future Solr major version upgrade without needing to recreate the index from source as is the case with Lucene based search engines today.

Sessions

Filter by: