One GPU, Four Retrieval Modes: Multi-Model Search Serving

Scale

Stories

One GPU, Four Retrieval Modes: Multi-Model Search Serving

Session Abstract

Competitive search now needs dense embeddings, sparse vectors, ColBERT, and cross-encoder reranking. Most teams run four separate containers. This talk shows how to serve all four from one process, walks through building a hybrid retrieval pipeline with real benchmark data, and covers where each retrieval mode wins and where it wastes compute.

Session Description

Every production search system in 2026 runs multiple models. A dense embedder handles semantic search. A sparse model provides keyword recall. A multi-vector model like ColBERT enables token-level matching. A cross-encoder reranker improves final precision. These four stages have become table stakes for competitive retrieval quality.

The infrastructure story is less elegant. The industry default is one container per model, typically using HuggingFace TEI, Triton, or a custom Flask wrapper. Four models means four separate deployments, four sets of scaling rules, and four GPU allocations where each model uses a fraction of what it reserves.

When building SIE, an open-source search inference engine, we took a different approach: one server process that handles all four retrieval modes through a unified API with three primitives (encode, score, extract). Models like BGE-M3 return dense, sparse, and multi-vector outputs from a single encode call. Cross-encoder reranking uses the score primitive. Same server, same GPU, same API.

The talk covers four areas.

First, why hybrid retrieval requires multiple model types. We will walk through a real retrieval pipeline: sparse for keyword recall, dense for semantic matching, ColBERT for token-level precision, and a cross-encoder for final reranking. For each stage we will show what it adds to retrieval quality using BEIR benchmark data, and when the added complexity is not worth it.

Second, the adapter architecture that makes multi-model serving possible. SIE wraps PyTorch, FlashAttention, SentenceTransformers, and SGLang behind a common interface. We will walk through the lifecycle of a request: API call, tokenization on CPU, batching, GPU inference, and postprocessing. Different model architectures need different compute backends, and we will explain why a single unified runtime was not the right choice.

Third, building the pipeline end to end. A practical walkthrough of dense + sparse + ColBERT + reranking from a single server instance, including how to combine scores from different retrieval modes and how to tune the balance between recall and precision.

Fourth, tradeoffs and lessons. When does multi-model serving on one GPU work well, and when should a model get its own dedicated container? What happens under concurrent load when multiple models compete for memory? We will share real data from running these workloads on L4 GPUs.

Talk

Filip Makraduli

All Speakers

All Sessions

PS DEV