When Approximate Became Good Enough: The ANN Revolution

Why 99% recall changed everything, and the tradeoff that enabled billion-scale search

-20 min read

There's a moment in every engineer's life when they realize that "perfect" is the enemy of "ships by Friday." The entire ANN revolution is basically that moment, crystallized into algorithms.

For decades, researchers chased exact nearest neighbor search. The goal: given a query point, find theactual closest point in your dataset. No compromises. Mathematically pure. And at scale, mathematically impossible within reasonable compute budgets.

Then someone asked the uncomfortable question: "What if we just... didn't?"

The Heresy That Worked: Approximate Is Fine

Graph showing the tradeoff between recall and query speed, with the 'good enough' zone highlighted

The magic zone: 95-99% recall at 100-1000x speedup. For most applications, users cannot tell the difference.

Here's the thing about "exact" nearest neighbor: in most applications, nobody actually needs it. When you're recommending products, finding similar images, or retrieving relevant documents, the difference between the 1st and 5th most similar item is imperceptible to users.

Let's make this concrete. Say you're building a photo similarity feature. User uploads a sunset photo, you return similar sunsets. The "true" most similar image might have a cosine similarity of 0.9523. The image you actually return (because your approximate algorithm missed the best one) has similarity 0.9518.

Does the user care? Does the user notice? No. But the exact search took 400ms and the approximate search took 2ms. Now that the user notices.

"A quick answer that's 99% right beats a slow answer that's 100% right. In production, latency is a feature."

This isn't just hand-waving. The math checks out. For most real-world embedding spaces, the top-10 nearest neighbors form a cluster. If you find 9 out of 10, you're still returning extremely relevant results. The one you missed was probably only marginally better than your 10th pick anyway.

Locality Sensitive Hashing: The Original Good-Enough Algorithm

Before HNSW conquered the world, before graph-based methods became standard, there was LSH. Locality Sensitive Hashing. It's the algorithm that proved approximate search could work at scale, even if it's not the algorithm we mostly use today.

The core insight is beautifully simple: design hash functions where similar items hash to the same bucket with high probability.

LSH intuition: random hyperplanes partition space, nearby points tend to end up in the same buckets

Random hyperplanes cut through your vector space. Points on the same side of most hyperplanes are probably close together.

Normal hash functions do the opposite—they spread similar inputs across different buckets (that's the whole point of hashing). LSH deliberately breaks this. Similar vectors should collide. Dissimilar vectors should separate.

The classic LSH for cosine similarity uses random hyperplanes. Pick a random vector. For each data point, check which side of the hyperplane it's on. That gives you one bit. Do this with many random vectors, concatenate the bits, and you have a hash.

Why does this work? If two vectors point in similar directions, they'll be on the same side of most random hyperplanes. Their hashes will match on most bits. The probability of collision is directly related to their cosine similarity.

LSH in Practice: The Multi-Probe Trick

Raw LSH has a problem: to get high recall, you need many hash tables. Each table is independent, so you're betting on at least one of them putting your query and its true neighbors in the same bucket. More tables = higher probability = more memory.

The multi-probe trick fixes this: instead of just checking the bucket your query hashes to, check nearby buckets too. "Nearby" means buckets that differ by one or two bits. You're probing the hash space around your query point.

This trades computation for memory. Fewer tables, more probes per query. In practice, it made LSH viable for production systems that couldn't afford to keep dozens of replicated hash tables in RAM.

Why LSH Faded (And What Replaced It)

LSH was revolutionary, but it had limitations. The hash functions are data-independent—they don't adapt to your specific embedding distribution. If your data clusters weirdly, LSH doesn't know.

The recall/speed tradeoffs also hit a ceiling. Getting from 90% to 99% recall required exponentially more hash tables or probes. For applications demanding high recall (like search engines), this got expensive.

Enter the algorithms that did learn from data: tree-based methods, quantization-based methods, and eventually the method that would dominate: graph-based search.

HNSW: The Algorithm That Changed Everything

HNSW graph structure showing hierarchical layers with increasing connectivity at higher levels

HNSW builds a layered graph. Higher layers are sparser, enabling fast coarse navigation. Lower layers are denser, enabling precise search.

Hierarchical Navigable Small World graphs (HNSW) appeared in a 2016 paper by Malkov and Yashunin, and within a few years it had become the default choice for high-recall ANN search. There's a reason: it's really, really good.

The core idea combines two concepts: navigable small-world graphs and hierarchical structure. Let me unpack both.

Small-World Graphs: Six Degrees of Separation

You know the "six degrees of separation" thing? Any person on Earth is connected to any other through about six friend-of-friend hops. That's a small-world network: high clustering (your friends know each other) plus a few long-range connections (you know someone across the world).

Now imagine your vectors as nodes in a graph. Connect each vector to its k-nearest neighbors. You get high clustering—similar vectors are connected to each other. Add some longer-range edges, and you can navigate from any starting point to any target in logarithmic hops.

Search becomes graph traversal: start somewhere, greedily move to the neighbor closest to your query, repeat until you can't improve. The small-world property guarantees you'll reach the neighborhood of your target quickly.

The Hierarchical Trick: Coarse-to-Fine Navigation

Plain small-world graphs have a problem: if you start far from your target, greedy search can get stuck in local minima. You need to jump across the graph, but all your edges are to nearby points.

HNSW fixes this with layers. The bottom layer (layer 0) contains all points. Higher layers contain progressively fewer points—each point has a probability of appearing in higher layers, creating a power-law distribution.

Search starts at the top layer, which is sparse. Greedy search on sparse graphs takes big jumps across the vector space. When you can't improve anymore, drop down a layer. The next layer is denser, so your jumps get smaller. Keep descending until you hit layer 0, where you refine to the final answer.

It's like zooming in on a map. Start at continent level, quickly navigate to the right country, zoom in to find the city, zoom again for the neighborhood, and finally the exact address. Each level has the right resolution for its job.

Why HNSW Dominates

HNSW hit the sweet spot that production systems care about:

  • High recall at reasonable speed: 99%+ recall is achievable without heroic compute. The hierarchical structure makes search efficient even at very high recall targets.
  • Incremental updates: Unlike IVF indices that require periodic retraining, HNSW can add new vectors by connecting them to existing graph nodes. No full rebuild needed.
  • Good memory efficiency: The graph structure requires storing neighbor lists, but this scales linearly with data size. No exponential blowup.
  • Works well out of the box: The hyperparameters (M, ef_construction, ef_search) have intuitive effects and sensible defaults. You don't need a PhD to tune it.

Today, essentially every vector database—Pinecone, Weaviate, Qdrant, Milvus—offers HNSW as their primary or default index type. It's the algorithm that made billion-scale similarity search routine rather than heroic.

The "Good Enough" Philosophy

I want to zoom out and talk about what the ANN revolution really represents, because I think it's deeper than just "we found faster algorithms."

Diagram showing the conceptual shift from exact answers to good-enough answers enabling new applications

The ANN revolution wasn't about better algorithms. It was about accepting "good enough" as a design principle.

Computer science has a perfectionist streak. We love formal proofs, optimal algorithms, guaranteed bounds. ANN said: forget all that. Let's be empirically good enough for real-world data, and deal with edge cases if they actually happen.

This is philosophically uncomfortable! What if you miss an important result? What if your 1% error rate compounds across millions of queries? What about adversarial inputs designed to hit your algorithm's weak spots?

These are valid concerns, and for some applications (medical diagnosis, legal discovery), approximate search might genuinely be the wrong choice. But for the vast majority of similarity search use cases, the "good enough" philosophy is exactly right.

"Perfect is the enemy of shipped. And shipped is the enemy of perfectly implemented. But shipped and good enough? That's the sweet spot."

Think about it: embeddings themselves are approximations. The neural network that produced them doesn't give you "true" semantic similarity—it gives you a learned approximation trained on some proxy task. Your "exact" nearest neighbor in embedding space isn't the "true" most similar item; it's the item whose embedding happens to be closest according to your model's compression of meaning.

In that light, insisting on exact search seems... odd? You're demanding mathematical precision at one layer of the stack while accepting deep approximations at every other layer.

The Tradeoff Triangle: Recall, Speed, Memory

Every ANN algorithm navigates the same fundamental tradeoff. You can optimize for any two of three properties, but not all three:

The ANN tradeoff triangle: recall, speed, and memory - pick two

Pick two. This constraint is fundamental—it comes from information theory, not implementation details.

High Recall + Fast Queries = More Memory

Want 99% recall with sub-millisecond latency? You need the index to "know" more about the data structure. For HNSW, that means more edges per node. For IVF, that means more partitions and residual data. For LSH, that means more hash tables.

All of this costs memory. At billion scale, memory costs real money—cloud RAM isn't cheap.

High Recall + Low Memory = Slower Queries

If you can't afford to store extra index structures, you compensate by doing more work at query time. Search more partitions. Traverse more graph edges. Probe more hash buckets. The recall goes up, but so does latency.

Fast Queries + Low Memory = Lower Recall

This is the aggressive compression regime. Quantize vectors heavily. Use fewer partitions. Keep the graph sparse. Queries fly, memory is tiny, but you'll miss more true neighbors.

In practice, you choose where on this triangle your application sits. E-commerce recommendations might tolerate 90% recall for ultra-low latency. A legal document search might demand 99.5% recall and accept higher latency and memory costs.

Lessons for Production Systems

Having worked with ANN systems at various scales, here are the practical lessons I've learned:

1. Measure Recall on YOUR Data

Benchmark numbers lie. An algorithm's recall on ann-benchmarks.com doesn't predict its recall on your specific embeddings and query distribution. Your data might cluster differently. Your queries might hit edge cases.

Always evaluate on a held-out set with ground truth. Compute brute-force neighbors for a sample of queries, then see how many your ANN system finds. That's your real recall.

2. Recall Varies Across Query Types

Average recall hides variation. Some queries are "easy"—their true neighbors are in dense, well-connected regions of your index. Other queries are "hard"—their neighbors are in sparse areas or near cluster boundaries.

Look at the recall distribution, not just the mean. A system with 95% average recall might have 80% recall for the hardest 10% of queries. That tail might matter to your application.

3. Index Construction Cost Is Real

HNSW gives you incremental inserts, but that doesn't mean free. Each insert requires finding the right place in the graph and connecting edges. At high insert rates, this becomes a bottleneck.

For batch workloads, consider building indices offline and swapping them in. For streaming workloads, budget for index update overhead in your capacity planning.

4. Hybrid Approaches Win

No single ANN algorithm is best for all scenarios. Production systems often combine methods: HNSW for the primary index, IVF for extremely large partitions, exact search for small filtered subsets.

Don't be dogmatic about algorithms. Use what works, where it works.

The Future: Where Do We Go From Here?

HNSW feels mature, but the field isn't done. Several directions look promising:

  • Learned indices: Why use fixed algorithms when we could learn index structures directly from data? Research on neural index structures is early but intriguing.
  • Hardware specialization: GPUs, TPUs, custom silicon designed for vector operations. As embedding-based systems become ubiquitous, specialized hardware becomes economical.
  • Filtered search optimization: The metadata filtering problem (find similar vectors where category=X) is still awkward. Better solutions are emerging.
  • Streaming and dynamic: Production data changes constantly. Indices that handle inserts, updates, and deletes efficiently—without periodic rebuilds—are the goal.

But the core insight of the ANN revolution will remain: for most applications, approximate really is good enough. That philosophical shift—from mathematical perfectionism to empirical pragmatism—opened the door to everything we're building today.

Want to go deeper?

FAISS Extended implements sorted inverted lists with TBB parallelism, designed for the specific tradeoffs of code search. MLGraph adds distributed coordination for multi-node deployments.

References & Further Reading