ShopCo's B2B marketplace had a search engine that was essentially a keyword matcher built in 2017. As the catalogue grew to 4.2M SKUs, keyword search broke down in predictable ways: synonym blindness ("laptop stand" ≠ "notebook riser"), no understanding of intent ("cheap office chairs" returned results sorted by relevance score, not price), and no personalisation for repeat buyers. 18% of searches returned zero results. Buyers were switching to Google to find products on ShopCo — then coming back to complete the purchase. We were losing the discovery step.
Query log analysis revealed three dominant failure patterns: (1) synonym failures — 34% of zero-result queries had a successful equivalent query in our logs (e.g., "A4 copier paper" had inventory; "A4 printer paper" returned zero results); (2) intent mismatch — queries containing words like "cheap", "bulk", "sample" were not being interpreted as intent signals; (3) catalogue gaps — 28% of zero-result queries were genuine gaps in the catalogue, not a relevance problem. The buyer interviews added a fourth insight: B2B buyers are highly repeat purchasers — 70% had bought the same SKU more than once — but the search engine had no memory. A buyer who'd purchased "Leitz arch lever files A4 50mm" three times still had to type the full query each time.
We rebuilt relevance in three layers: (1) a synonym and query expansion layer using a fine-tuned sentence transformer model trained on our query logs; (2) an intent classifier that detected modifier intent (price, quantity, sample) and adjusted ranking accordingly; (3) a personalisation layer that boosted previously-purchased and viewed SKUs for logged-in buyers. We also tackled catalogue gaps by building a signal pipeline that flagged high-volume zero-result queries for the catalogue team to act on.
The biggest architectural decision was whether to use a general-purpose LLM for query understanding or a smaller, fine-tuned model. The LLM approach had better out-of-the-box coverage but was 40× more expensive per query at our volume and added 120ms to search latency. We chose the fine-tuned sentence transformer — smaller, faster, cheaper — and accepted that we'd need to maintain the training data pipeline. The second key decision was rollout strategy: we ran a 4-week shadow evaluation (new ranker running in parallel, not served to users) before any A/B test, which let us catch two edge-case regressions in electronics search before they affected buyers.
The intent classifier had a persistent precision problem on ambiguous queries. "Office chairs" could mean budget intent or just browsing — the model was over-indexing on price signals from the training data because cheap office chairs generated more clicks historically. We had to build a feedback loop from explicit user signals ("Sort by price" clicks post-search) rather than just click-through data to correct the bias. This added 6 weeks to the project.
I should have scoped the catalogue gap work out of this project from the start. It was real and important, but it was a separate problem (catalogue ops) masquerading as a search problem. We spent 6 weeks building the gap signal pipeline, which delayed the relevance work. In hindsight, I'd have surfaced the catalogue gap data to the relevant team early, committed to a later integration date, and kept the search team focused on relevance. I also learned to always run a shadow evaluation before an A/B test for ML ranking changes — the regressions we caught in shadow would have been painful to discover in production.