When Approximate Became Good Enough: The ANN Revolution
Why 99% recall changed everything, and the tradeoff that enabled billion-scale search
There's a moment in every engineer's life when they realize that "perfect" is the enemy of "ships by Friday." The entire ANN revolution is basically that moment, crystallized into algorithms.
For decades, researchers chased exact nearest neighbor search. The goal: given a query point, find theactual closest point in your dataset. No compromises. Mathematically pure. And at scale, mathematically impossible within reasonable compute budgets.
Then someone asked the uncomfortable question: "What if we just... didn't?"
The Heresy That Worked: Approximate Is Fine

The magic zone: 95-99% recall at 100-1000x speedup. For most applications, users cannot tell the difference.
Here's the thing about "exact" nearest neighbor: in most applications, nobody actually needs it. When you're recommending products, finding similar images, or retrieving relevant documents, the difference between the 1st and 5th most similar item is imperceptible to users.
Let's make this concrete. Say you're building a photo similarity feature. User uploads a sunset photo, you return similar sunsets. The "true" most similar image might have a cosine similarity of 0.9523. The image you actually return (because your approximate algorithm missed the best one) has similarity 0.9518.
Does the user care? Does the user notice? No. But the exact search took 400ms and the approximate search took 2ms. Now that the user notices.
"A quick answer that's 99% right beats a slow answer that's 100% right. In production, latency is a feature."
This isn't just hand-waving. The math checks out. For most real-world embedding spaces, the top-10 nearest neighbors form a cluster. If you find 9 out of 10, you're still returning extremely relevant results. The one you missed was probably only marginally better than your 10th pick anyway.
Locality Sensitive Hashing: The Original Good-Enough Algorithm
Before HNSW conquered the world, before graph-based methods became standard, there was LSH. Locality Sensitive Hashing. It's the algorithm that proved approximate search could work at scale, even if it's not the algorithm we mostly use today.
The core insight is beautifully simple: design hash functions where similar items hash to the same bucket with high probability.

Random hyperplanes cut through your vector space. Points on the same side of most hyperplanes are probably close together.
Normal hash functions do the opposite—they spread similar inputs across different buckets (that's the whole point of hashing). LSH deliberately breaks this. Similar vectors should collide. Dissimilar vectors should separate.
The classic LSH for cosine similarity uses random hyperplanes. Pick a random vector. For each data point, check which side of the hyperplane it's on. That gives you one bit. Do this with many random vectors, concatenate the bits, and you have a hash.
Why does this work? If two vectors point in similar directions, they'll be on the same side of most random hyperplanes. Their hashes will match on most bits. The probability of collision is directly related to their cosine similarity.
LSH in Practice: The Multi-Probe Trick
Raw LSH has a problem: to get high recall, you need many hash tables. Each table is independent, so you're betting on at least one of them putting your query and its true neighbors in the same bucket. More tables = higher probability = more memory.
The multi-probe trick fixes this: instead of just checking the bucket your query hashes to, check nearby buckets too. "Nearby" means buckets that differ by one or two bits. You're probing the hash space around your query point.
This trades computation for memory. Fewer tables, more probes per query. In practice, it made LSH viable for production systems that couldn't afford to keep dozens of replicated hash tables in RAM.
Why LSH Faded (And What Replaced It)
LSH was revolutionary, but it had limitations. The hash functions are data-independent—they don't adapt to your specific embedding distribution. If your data clusters weirdly, LSH doesn't know.
The recall/speed tradeoffs also hit a ceiling. Getting from 90% to 99% recall required exponentially more hash tables or probes. For applications demanding high recall (like search engines), this got expensive.
Enter the algorithms that did learn from data: tree-based methods, quantization-based methods, and eventually the method that would dominate: graph-based search.
HNSW: The Algorithm That Changed Everything

HNSW builds a layered graph. Higher layers are sparser, enabling fast coarse navigation. Lower layers are denser, enabling precise search.
Hierarchical Navigable Small World graphs (HNSW) appeared in a 2016 paper by Malkov and Yashunin, and within a few years it had become the default choice for high-recall ANN search. There's a reason: it's really, really good.
The core idea combines two concepts: navigable small-world graphs and hierarchical structure. Let me unpack both.
Small-World Graphs: Six Degrees of Separation
You know the "six degrees of separation" thing? Any person on Earth is connected to any other through about six friend-of-friend hops. That's a small-world network: high clustering (your friends know each other) plus a few long-range connections (you know someone across the world).
Now imagine your vectors as nodes in a graph. Connect each vector to its k-nearest neighbors. You get high clustering—similar vectors are connected to each other. Add some longer-range edges, and you can navigate from any starting point to any target in logarithmic hops.
Search becomes graph traversal: start somewhere, greedily move to the neighbor closest to your query, repeat until you can't improve. The small-world property guarantees you'll reach the neighborhood of your target quickly.
The Hierarchical Trick: Coarse-to-Fine Navigation
Plain small-world graphs have a problem: if you start far from your target, greedy search can get stuck in local minima. You need to jump across the graph, but all your edges are to nearby points.
HNSW fixes this with layers. The bottom layer (layer 0) contains all points. Higher layers contain progressively fewer points—each point has a probability of appearing in higher layers, creating a power-law distribution.
Search starts at the top layer, which is sparse. Greedy search on sparse graphs takes big jumps across the vector space. When you can't improve anymore, drop down a layer. The next layer is denser, so your jumps get smaller. Keep descending until you hit layer 0, where you refine to the final answer.
It's like zooming in on a map. Start at continent level, quickly navigate to the right country, zoom in to find the city, zoom again for the neighborhood, and finally the exact address. Each level has the right resolution for its job.
Why HNSW Dominates
HNSW hit the sweet spot that production systems care about:
- High recall at reasonable speed: 99%+ recall is achievable without heroic compute. The hierarchical structure makes search efficient even at very high recall targets.
- Incremental updates: Unlike IVF indices that require periodic retraining, HNSW can add new vectors by connecting them to existing graph nodes. No full rebuild needed.
- Good memory efficiency: The graph structure requires storing neighbor lists, but this scales linearly with data size. No exponential blowup.
- Works well out of the box: The hyperparameters (M, ef_construction, ef_search) have intuitive effects and sensible defaults. You don't need a PhD to tune it.
Today, essentially every vector database—Pinecone, Weaviate, Qdrant, Milvus—offers HNSW as their primary or default index type. It's the algorithm that made billion-scale similarity search routine rather than heroic.
The "Good Enough" Philosophy
I want to zoom out and talk about what the ANN revolution really represents, because I think it's deeper than just "we found faster algorithms."

The ANN revolution wasn't about better algorithms. It was about accepting "good enough" as a design principle.
Computer science has a perfectionist streak. We love formal proofs, optimal algorithms, guaranteed bounds. ANN said: forget all that. Let's be empirically good enough for real-world data, and deal with edge cases if they actually happen.
This is philosophically uncomfortable! What if you miss an important result? What if your 1% error rate compounds across millions of queries? What about adversarial inputs designed to hit your algorithm's weak spots?
These are valid concerns, and for some applications (medical diagnosis, legal discovery), approximate search might genuinely be the wrong choice. But for the vast majority of similarity search use cases, the "good enough" philosophy is exactly right.
"Perfect is the enemy of shipped. And shipped is the enemy of perfectly implemented. But shipped and good enough? That's the sweet spot."
Think about it: embeddings themselves are approximations. The neural network that produced them doesn't give you "true" semantic similarity—it gives you a learned approximation trained on some proxy task. Your "exact" nearest neighbor in embedding space isn't the "true" most similar item; it's the item whose embedding happens to be closest according to your model's compression of meaning.
In that light, insisting on exact search seems... odd? You're demanding mathematical precision at one layer of the stack while accepting deep approximations at every other layer.
The Tradeoff Triangle: Recall, Speed, Memory
Every ANN algorithm navigates the same fundamental tradeoff. You can optimize for any two of three properties, but not all three:

Pick two. This constraint is fundamental—it comes from information theory, not implementation details.
High Recall + Fast Queries = More Memory
Want 99% recall with sub-millisecond latency? You need the index to "know" more about the data structure. For HNSW, that means more edges per node. For IVF, that means more partitions and residual data. For LSH, that means more hash tables.
All of this costs memory. At billion scale, memory costs real money—cloud RAM isn't cheap.
High Recall + Low Memory = Slower Queries
If you can't afford to store extra index structures, you compensate by doing more work at query time. Search more partitions. Traverse more graph edges. Probe more hash buckets. The recall goes up, but so does latency.
Fast Queries + Low Memory = Lower Recall
This is the aggressive compression regime. Quantize vectors heavily. Use fewer partitions. Keep the graph sparse. Queries fly, memory is tiny, but you'll miss more true neighbors.
In practice, you choose where on this triangle your application sits. E-commerce recommendations might tolerate 90% recall for ultra-low latency. A legal document search might demand 99.5% recall and accept higher latency and memory costs.
Lessons for Production Systems
Having worked with ANN systems at various scales, here are the practical lessons I've learned:
1. Measure Recall on YOUR Data
Benchmark numbers lie. An algorithm's recall on ann-benchmarks.com doesn't predict its recall on your specific embeddings and query distribution. Your data might cluster differently. Your queries might hit edge cases.
Always evaluate on a held-out set with ground truth. Compute brute-force neighbors for a sample of queries, then see how many your ANN system finds. That's your real recall.
2. Recall Varies Across Query Types
Average recall hides variation. Some queries are "easy"—their true neighbors are in dense, well-connected regions of your index. Other queries are "hard"—their neighbors are in sparse areas or near cluster boundaries.
Look at the recall distribution, not just the mean. A system with 95% average recall might have 80% recall for the hardest 10% of queries. That tail might matter to your application.
3. Index Construction Cost Is Real
HNSW gives you incremental inserts, but that doesn't mean free. Each insert requires finding the right place in the graph and connecting edges. At high insert rates, this becomes a bottleneck.
For batch workloads, consider building indices offline and swapping them in. For streaming workloads, budget for index update overhead in your capacity planning.
4. Hybrid Approaches Win
No single ANN algorithm is best for all scenarios. Production systems often combine methods: HNSW for the primary index, IVF for extremely large partitions, exact search for small filtered subsets.
Don't be dogmatic about algorithms. Use what works, where it works.
The Future: Where Do We Go From Here?
HNSW feels mature, but the field isn't done. Several directions look promising:
- Learned indices: Why use fixed algorithms when we could learn index structures directly from data? Research on neural index structures is early but intriguing.
- Hardware specialization: GPUs, TPUs, custom silicon designed for vector operations. As embedding-based systems become ubiquitous, specialized hardware becomes economical.
- Filtered search optimization: The metadata filtering problem (find similar vectors where category=X) is still awkward. Better solutions are emerging.
- Streaming and dynamic: Production data changes constantly. Indices that handle inserts, updates, and deletes efficiently—without periodic rebuilds—are the goal.
But the core insight of the ANN revolution will remain: for most applications, approximate really is good enough. That philosophical shift—from mathematical perfectionism to empirical pragmatism—opened the door to everything we're building today.
Want to go deeper?
FAISS Extended implements sorted inverted lists with TBB parallelism, designed for the specific tradeoffs of code search. MLGraph adds distributed coordination for multi-node deployments.
References & Further Reading
- Efficient and robust approximate nearest neighbor search using HNSW graphs (Malkov & Yashunin, 2016 - The HNSW paper)
- Billion-scale similarity search with GPUs (Johnson et al., 2017 - FAISS paper)
- Multi-probe LSH: Efficient Indexing for High-Dimensional Similarity Search (Lv et al., 2007 - The multi-probe paper)
- ANN Benchmarks (Standardized benchmarks for comparing ANN algorithms)
- HNSW Explained (Pinecone - Excellent visual explanation)