From 1M to 1B Vectors: The Scaling Crisis That Created MLGraph
Why existing solutions couldn't handle our scale, what we learned from their limitations, and the path to building something new
There's a moment in every scaling story where the solutions that got you this far suddenly don't work anymore. For us, that moment came in early 2024. We were running FAISS Extended for a client with 450 million vectors. Things were working. Then they asked us to merge three more datasets. The combined size: 1.2 billion vectors. And they wanted sub-100ms query latency. And real-time updates. And multi-tenant isolation.
I remember sitting in front of our architecture whiteboard, trying to make the numbers work. They didn't. No matter how I arranged the boxes, something broke.
This is the story of that crisis, what we learned, and how it eventually led us to build MLGraph. I want to be honest about the failures along the way, because I think there's more to learn from what didn't work than from the polished version we eventually shipped.
The Numbers That Broke Everything
Let me lay out what we were dealing with. The client was a code intelligence platform - think "GitHub code search but for enterprise monorepos." Their requirements, which seemed reasonable when we first heard them:
- 1.2 billion code embeddings (function-level chunks)
- 1536 dimensions (text-embedding-3-large)
- Query latency: p99 under 100ms
- Updates: 50K vector insertions/deletions per hour
- Multi-tenancy: 200+ organizations, strict isolation
- Filtering: restrict search to org, repo, language, date range
Do the basic math. 1.2 billion vectors at 1536 dimensions, 4 bytes per float. That's 7.3 terabytes of raw vector data. Add your IVF index structures, your metadata, your filter indices, and you're looking at 12-15 TB of indexed data that needs to be searchable in under 100ms.
Okay, you say, just shard it. Split across machines. Standard horizontal scaling. Here's where it gets interesting.
Why Simple Sharding Didn't Work
The Query Fan-Out Problem
When you shard a vector index by some key (document ID, time range, tenant ID), you need to understand how queries will hit those shards. For vector search, the answer is usually "all of them."
Why? Because you're looking for the k nearest neighbors in the entire dataset. Unless you know a priori that the relevant vectors are all in one shard, you need to query every shard, get their top-k results, and merge. This is the fan-out problem.
With 100 shards (a reasonable number for 1.2B vectors), every query becomes 100 sub-queries. If each shard query takes 20ms, you're at 20ms total with perfect parallelization. But you're not going to get perfect parallelization. Network variance, GC pauses, hot shards, cold caches. In practice, your p99 is dominated by the slowest shard, and that's a lot higher than 20ms.
We measured this in our test cluster. P50 was fine: 35ms. P99? 280ms. Completely unacceptable.
The Shard Imbalance Problem
Okay, so what if we shard smarter? Use locality-sensitive hashing to put similar vectors on the same shard? Then queries can be routed to fewer shards.
We tried this. The problem is that "similar" in vector space doesn't correlate with usage patterns. If a thousand developers are all searching for authentication code, those authentication vectors are probably clustered in the same LSH bucket. One shard gets hammered. Others sit idle.
We watched our "hot" shards hit 10x the query rate of cold shards. Latency variance exploded. The p99 was now being dominated by a single overwhelmed machine.
The Filter Join Problem
Here's one that really surprised us. Most queries had filters: "search code in the payments repo, Python files only, modified in the last 30 days." Straightforward, right?
Except now you need to coordinate between your vector index and your filter index. Pre-filter (apply filters first, then vector search)? Works great unless your filter is selective, then you're searching a tiny subset and your vector index wasn't built for that. Post-filter (vector search first, then filter)? You might scan way more candidates than necessary, killing latency.
The academic literature calls this the "hybrid query" problem. The industry solution is usually "try both strategies and hope one works." That's not exactly inspiring.
The Update Saga
I haven't even mentioned updates yet. 50K vector updates per hour sounds manageable until you think about what it means for index consistency.
FAISS's IVF indices are optimized for static workloads. Adding vectors is fine-ish - they get assigned to the nearest cluster. Deleting vectors requires marking them as deleted and occasionally compacting. Update means delete-then-add.
At 50K updates per hour, across 100 shards, you're modifying 500 vectors per shard per hour. That's about one modification every 7 seconds, per shard. Each modification potentially triggers cluster rebalancing. Each rebalancing needs to not interfere with concurrent queries.
We tried multiple approaches:
Approach 1: Write-Ahead Log + Periodic Merge
Buffer updates in a WAL, periodically merge into the main index. Classic database pattern. Problem: during the merge window, you're either blocking queries or serving stale results. Neither was acceptable.
Approach 2: Dual Index Swap
Maintain two copies of the index. Apply updates to the offline copy, swap when ready. Problem: 2x memory requirement. At 15TB indexed data, that's 30TB of RAM across your cluster. Cost prohibitive.
Approach 3: Segment-Based Updates
Split each shard into time-based segments. New segments are immutable. Updates go to the newest segment. Searches fan out across all segments.
This actually worked, with one catch: segment count grows over time, and so does query latency. You need a background compaction process that merges segments without blocking queries. That's its own engineering project.
What We Learned From the Failures
Somewhere around month three of this project, I stepped back and tried to articulate what we were actually fighting. Not the symptoms, but the root causes. Here's what I came up with:
1. ANN Indices Are Fundamentally Static
The algorithms powering approximate nearest neighbor search (HNSW, IVF, etc.) were designed for batch workloads. Build an index, query it many times, occasionally rebuild. The "occasional rebuild" part is usually measured in hours or days, not seconds.
Trying to bolt dynamic updates onto fundamentally static data structures is fighting the physics. You can make it work, but you're constantly paying a performance tax.
2. Separation of Concerns Is Expensive
We had separate systems for vector indexing, metadata filtering, and query coordination. Every query required orchestrating across these systems. That orchestration added latency, complexity, and failure modes.
The insight: what if the vector index knew about metadata? What if filters were part of the index structure, not a separate bolt-on?
3. Sharding Strategy Is Query-Dependent
There's no universally good way to shard vectors. LSH-based sharding helps some queries and hurts others. Time-based sharding helps recency queries but not semantic ones. Random sharding gives even load but maximum fan-out.
The insight: maybe the sharding needs to be dynamic. Maybe different queries should hit different shard subsets based on what they're asking for.
The MLGraph Hypothesis
By mid-2024, we had a hypothesis. What if we built a system from scratch with these constraints as first-class concerns? Not as afterthoughts bolted onto an existing index library, but as fundamental design principles?
We started sketching what became MLGraph. The core ideas:
Principle 1: Native Multi-Tenancy
Instead of one big index that you filter by tenant, maintain per-tenant index structures. This sounds expensive until you realize that tenant-specific indices are much smaller, fit in cache better, and don't require cross-tenant coordination.
For our client's 200 organizations, average tenant size was 6 million vectors. That's manageable on a single machine. No fan-out needed for single-tenant queries. Only cross-tenant admin queries need coordination.
Principle 2: Update-First Architecture
We designed the index structure around updates, not around static bulk loads. Incoming vectors go into a write-optimized buffer. Background compaction merges buffers into the main index. Queries always check both buffer and main index.
The trick was making buffer queries fast enough. We use a small but precise graph index for the buffer, which handles the recency bias (recent code is searched more often) naturally.
Principle 3: Integrated Filtering
Metadata isn't stored separately - it's part of the vector index structure. When you traverse the HNSW graph, you're simultaneously checking filter predicates. No post-filter scan. No pre-filter index lookup.
This required building custom index structures. We couldn't use off-the-shelf HNSW implementations because they don't support predicate pushdown. But the latency improvement was worth the engineering.
Principle 4: Adaptive Query Routing
Not all queries should hit all shards. A query for "Python authentication code" should probably start with Python-heavy shards. If it finds good enough results, stop. If not, expand.
We built a query router that learns from past queries. It maintains statistics about which shards contain which types of content. It's not perfect, but it reduces average fan-out from 100% to about 30% for typical workloads.
The Honest Assessment
I want to be clear about what MLGraph is and isn't. It's not a magic solution that makes billion-scale vector search trivial. It's a set of architectural decisions that trade off different things than existing solutions.
What We Got Right
Query latency at scale. For our benchmark workload (1B vectors, realistic filter patterns), MLGraph achieves p99 latency of 45ms. That's about 6x better than our best FAISS Extended deployment at the same scale.
Update throughput. We can sustain 100K updates per minute across the cluster without degrading query latency. The segment-based architecture just... works.
Operational simplicity. Despite being distributed, the system has fewer moving parts than our previous architecture. Less orchestration means fewer failure modes.
What We're Still Figuring Out
Cross-tenant queries. When you need to search across all tenants (admin use case, global code search), you're back to full fan-out. We have ideas for caching and pre-computation but haven't shipped them yet.
Index build time. Initial bulk load for a new tenant is slower than FAISS. The update-optimized structures have overhead that hurts batch performance. We're working on a hybrid approach that uses fast bulk loading and converts to update-friendly format afterward.
Memory efficiency. The integrated filter indices use more memory than we'd like. We're experimenting with compressed filter representations but haven't found the right tradeoff yet.
The Bigger Picture
Looking back at the scaling crisis of early 2024, I'm struck by how much of it was predictable. The trajectory from million to billion was visible for anyone watching the industry. The limitations of static ANN indices were well-documented in academic papers. The challenges of distributed vector search were known.
We just weren't paying attention. We were too busy keeping existing systems running to invest in what would be needed next.
The lesson I take from this: when you see a 10x scale jump on the horizon, start building for it before you need it. The architecture that works at current scale probably doesn't work at 10x. And retrofitting is way more painful than designing correctly from the start.
Right now, I'm watching embedding dimensions grow (4096 is becoming common), multimodal vectors emerge (images + text + code), and real-time collaborative use cases that demand even lower latency. The next scaling crisis is already taking shape. We're trying not to get surprised this time.
"The best time to redesign your architecture was before you hit scale limits. The second best time is now. The worst time is when everything is on fire."
MLGraph isn't the final answer. It's our current best attempt at solving the problems we understand today. Two years from now, we'll probably be writing another post about what we got wrong and what we learned.
That's how it works. You build the best thing you can with what you know. You ship it. You learn from how it breaks. You build something better.
The scaling crisis that created MLGraph was painful while we lived through it. But I'm grateful for it, in a way. It forced us to question assumptions we'd been living with for years. It pushed us to build something we're genuinely proud of.
And hey, the next time a client calls asking about a few billion vectors, we actually have an answer.
Ready to Scale Beyond FAISS?
MLGraph was built for the challenges described in this post: billion-scale vector search with real-time updates, low latency, and native multi-tenancy. If this sounds like your use case, we should talk.