C++ Search for Database Kernels: Built In, Not Bolted On

Scale

Store

C++ Search for Database Kernels: Built In, Not Bolted On

Session Abstract

IResearch is an Apache 2.0 C++ search engine built to live inside databases. We’ll benchmark it against leading open-source search engines, show why vectorized scoring is the next frontier for information retrieval engines, share the mistakes we made over a decade of development and explore how database-native search fits modern query execution.

Session Description

There’s a certain irony in building a search engine no one can find. IResearch is an open-source Apache 2.0 C++ search library that has been quietly powering search inside databases since 2015: first behind ArangoSearch, now as the foundation of SereneDB. Instead of becoming yet another standalone search server, it evolved into a library designed to be embedded directly into database kernels. That journey defined most of the architectural decisions in the project. Some of them were good, some painful. This talk tells the honest story of it.
We’ll start with how IResearch ended up inside databases and what that means in practice: WAL integration, transactional consistency of search indexes, synchronization with the main storage.
From there, we’ll compare IResearch functionally and architecturally to Lucene and Tantivy. All three Lucene-inspired engines diverge in different aspects: functionality, index layout and scoring. Using the search-benchmark-game suite, we’ll put them head to head – not to declare a winner, but to dissect why the numbers look the way they do and trace performance differences back to architectural roots.
Then we will talk about how different search engines handle scoring and a very noticeable difference of IResearch in that regard. When documents are scored one at a time, significant CPU throughput is left on the table. While recent Lucene versions have begun moving toward block-based evaluation, scoring remains far from fully vectorized. However, some newer approaches treat relevance computation as a SIMD-friendly evaluation pipeline similar to query execution engines. We’ll walk through how this could work in practice and show the concrete throughput gains it delivers.
Along the way, we’ll be honest about the mistakes we made: architectural bets that didn’t pay off, abstractions that hurt performance in production, integration patterns we had to rip out and rebuild.
Finally, we’ll zoom out to the architectural questions that emerge when search lives inside a database. How do you scale a search index when compute and storage are separated? How does search fit into OLAP query execution: late materialization, joins between search indexes and analytical data, unified query planning? We’ll share what we’ve learned solving these in practice.

Talk

Andrey Abramov

All Speakers

All Sessions

PS DEV