Data Science
Search

Circular Dependency Fixes when Bootstrapping a Golden Set

Session Abstract

For a golden set, you need queries. Even if you have them, you can’t judge all docs for each query. Only the top N. How do we rank the top N? See the circular dependency? We’ll talk about ways to untangle it: lexical search, significant terms, training an embedder from scratch, etc. By iteratively refining data and queries, we’ll get there.

Session Description

If you’re not satisfied with your golden set or don’t have it at all, this session is for you. You may have queries (e.g., from query logs) or you need to generate them. We’ll start by looking at how to create synthetic queries from individual documents, as well as from facets and facet combinations, that might match N documents.

We’ll move on to relevance judgements. Even with LLM-as-a-judge, it’s not feasible to, say, rate a 1M doc corpus for 1K queries. We need the top N. How do we know the “correct” top N? We’ll need to explore the dataset for any query that is ambiguous (i.e., doesn’t clearly match a single doc). There are different methods for exploring data: visualizations, analysis tweaks (e.g., stemming, synonyms)… Vector similarity also helps, but choosing an embedder is tricky because transfer learning can introduce bias that may be misleading for our dataset.

We can’t get a perfect golden set on the first try, but we’ll explore techniques to iterate until we’re happy. Which is important for any new search application, whether it’s central to the business (i.e., larger teams, bigger budget) or not.