Search satisfaction NPS from 52 to 79 — and CTR on top result up 34%

Key outcomes

Search satisfaction NPS improved from 52 to 79

Click-through rate on top search result up 34%

Zero-result searches reduced from 18% to 6% of queries

Search-initiated GMV up 22% quarter-on-quarter

The problem

ShopCo's B2B marketplace had a search engine that was essentially a keyword matcher built in 2017. As the catalogue grew to 4.2M SKUs, keyword search broke down in predictable ways: synonym blindness ("laptop stand" ≠ "notebook riser"), no understanding of intent ("cheap office chairs" returned results sorted by relevance score, not price), and no personalisation for repeat buyers. 18% of searches returned zero results. Buyers were switching to Google to find products on ShopCo — then coming back to complete the purchase. We were losing the discovery step.

Search was the #1 product discovery channel, accounting for 63% of all browse sessions. Buyers who used search converted at 2.8× the rate of browsing buyers. Every point of search satisfaction was measurably correlated with repeat purchase rate. This wasn't a nice-to-have — it was a core retention lever.

Research & insights

Methods: Query log analysis (6 months, 40M queries), 22 user interviews with B2B buyers, 8 interviews with suppliers, 3 competitor benchmarks, monthly NPS survey segmented by search usage

Query log analysis revealed three dominant failure patterns: (1) synonym failures — 34% of zero-result queries had a successful equivalent query in our logs (e.g., "A4 copier paper" had inventory; "A4 printer paper" returned zero results); (2) intent mismatch — queries containing words like "cheap", "bulk", "sample" were not being interpreted as intent signals; (3) catalogue gaps — 28% of zero-result queries were genuine gaps in the catalogue, not a relevance problem. The buyer interviews added a fourth insight: B2B buyers are highly repeat purchasers — 70% had bought the same SKU more than once — but the search engine had no memory. A buyer who'd purchased "Leitz arch lever files A4 50mm" three times still had to type the full query each time.

Solution

We rebuilt relevance in three layers: (1) a synonym and query expansion layer using a fine-tuned sentence transformer model trained on our query logs; (2) an intent classifier that detected modifier intent (price, quantity, sample) and adjusted ranking accordingly; (3) a personalisation layer that boosted previously-purchased and viewed SKUs for logged-in buyers. We also tackled catalogue gaps by building a signal pipeline that flagged high-volume zero-result queries for the catalogue team to act on.

Key decisions & trade-offs

The biggest architectural decision was whether to use a general-purpose LLM for query understanding or a smaller, fine-tuned model. The LLM approach had better out-of-the-box coverage but was 40× more expensive per query at our volume and added 120ms to search latency. We chose the fine-tuned sentence transformer — smaller, faster, cheaper — and accepted that we'd need to maintain the training data pipeline. The second key decision was rollout strategy: we ran a 4-week shadow evaluation (new ranker running in parallel, not served to users) before any A/B test, which let us catch two edge-case regressions in electronics search before they affected buyers.

Results

MetricBeforeAfterDeltaTime

Search satisfaction NPS5279+278 months

Top-result CTR31%42%+34%8 months

Zero-result rate18%6%−67%8 months

Search-initiated GMVbaseline+22%+22%Q4 2024

Challenges & learnings

Challenges

The intent classifier had a persistent precision problem on ambiguous queries. "Office chairs" could mean budget intent or just browsing — the model was over-indexing on price signals from the training data because cheap office chairs generated more clicks historically. We had to build a feedback loop from explicit user signals ("Sort by price" clicks post-search) rather than just click-through data to correct the bias. This added 6 weeks to the project.

What I'd do differently

I should have scoped the catalogue gap work out of this project from the start. It was real and important, but it was a separate problem (catalogue ops) masquerading as a search problem. We spent 6 weeks building the gap signal pipeline, which delayed the relevance work. In hindsight, I'd have surfaced the catalogue gap data to the relevant team early, committed to a later integration date, and kept the search team focused on relevance. I also learned to always run a shadow evaluation before an A/B test for ML ranking changes — the regressions we caught in shadow would have been painful to discover in production.

Skills demonstrated

MLSearchB2BPlatformNLP

Back to Alex's folio

This is a demo folio

Build your own in minutes — free to start, no card required.

Start yours