Search
Stories

Towards Chunk-less RAG

Session Abstract

Retrieval-Augmented Generation (RAG) systems rely on pre-chunked documents, tying retrieval to arbitrary boundaries. This talk explores an experimental approach that surfaces semantically relevant text spans, without chunking. We’ll share surprising findings and examine whether this technique points toward a viable chunk-free retrieval paradigm.

Session Description

RAG systems have become foundational for grounding LLM outputs in factual knowledge, but they share a common limitation: semantic search operates at the chunk level, not the token level.

This talk presents an experimental investigation into whether we can bypass chunking entirely by extracting token-level relevance directly from dense embedding models. The core insight is simple, by preventing the embedding pooling step and computing cosine similarity between every query token and document token, we can generate relevance heatmaps that highlight exactly which spans matter for a given query, from which we can extract relevant text spans.

The session will walk through the complete pipeline:

  • Extracting token-level embeddings from the last hidden layer of dense embedding models (specifically Qwen3-Embedding-0.6B)
  • Computing relevance matrices via normalized dot products between query and document token vectors
  • Collapsing multi-token query representations into per-document-token scores
  • Designing a clustering algorithm that identifies relevance peaks, groups nearby high-scoring tokens, and extends matches to semantic boundaries
  • Comparing results against purpose-built late-interaction models (ColBERT variants, Jina Embeddings v4)

The experimental results reveal that the extracted spans show strong F1 scores when evaluated against ground truth answers in test documents. And the comparison between models shows that, despite being trained for pooled sentence embeddings, Qwen3’s token-level representations outperform ColBERT-style models specifically designed for multi-vector matching.

However, the approach surfaces two major challenges: storage requirements balloon by roughly 900× compared to traditional chunking and the model’s decoder-only architecture creates attention patterns that bias relevance toward document endings.

This is explicitly experimental work shared in the spirit of exploring new directions, not presenting a production solution. The goal is to spark discussion about whether the chunking paradigm is a necessary constraint or an artifact of current tooling, what modifications to model training or inference could make span-level retrieval practical at scale, and the parallelism between this approach and promising knowledge graph retrieval strategies.

Short Talk