Reviving phonetic algorithms for better search relevance
Session Abstract
Fuzzy search is a double-edged sword: it fixes typos but drowns users in noise on large corpora. At INA, we revived ancient phonetic algorithms to improve relevance. This session compares fuzzy vs. phonetic search on a massive archive, showing how “sounding right” beats “spelling close.”
Session Description
When users are unsure of a spelling, fuzzy search is the standard engineering solution. However, at the scale of the French National Audio-visual Institute, we found that standard fuzziness hits a wall. On a massive corpus, “approximate” matching retrieves a paralyzing amount of noise, degrading the user experience.
To solve this, we looked back to move forward. We revived and re-implemented “ancient” phonetic algorithms, some dating back decades, to test if matching by sound could outperform matching by character distance.
In this talk, we share our journey in tuning relevance for the French language, which is notoriously difficult due to silent letters and homophones. We will cover:
- The Fuzziness Trap: Why increasing edit distance failed to solve our precision/recall trade-off.
- Algorithm Showdown: A comparative analysis of standard Fuzzy Querying vs. Phonetic Analysis (e.g., Soundex, Beider-Morse, Metaphone) within our search pipeline.
- Implementation: How we integrated these phonetic tokens into our indexing strategy to filter noise without losing relevant results.
You will leave with a clear understanding of when to abandon standard fuzziness and how to leverage phonetic search to clean up your own noisy results.