I still remember the week OpenAI released text-embedding-ada-002. It was December 2022, and within 72 hours, three different clients called us asking if our FAISS infrastructure could handle "a few billion vectors." Just like that. As if the previous world where a million vectors was a big deployment had simply evaporated.

The thing about paradigm shifts is that they don't announce themselves. Nobody sent out a memo saying "Hey, the way we build search is about to fundamentally change." GPT-3's embeddings just... appeared. And suddenly everyone wanted to vectorize everything.

The Before Times: When a Million Vectors Was a Lot

Let me paint you a picture of 2021. If you needed semantic search, you were probably using Elasticsearch with some flavor of word2vec or sentence-transformers. A million vectors was a serious deployment. Ten million was "call in the consultants" territory. Vector databases existed, sure, but they were niche tools for computer vision researchers and recommendation systems.

The embeddings themselves were... fine. BERT-based models gave you 768-dimensional vectors. They worked okay for semantic similarity, struggled with anything requiring actual reasoning, and had this annoying tendency to think "bank" the financial institution and "bank" the river feature were basically the same thing. We made it work. We always make it work.

Then text-embedding-ada-002 dropped.

What Made Ada-002 Different

Here's the thing most people missed in the announcement: the model itself wasn't the revolutionary part. Yes, 1536 dimensions. Yes, better benchmarks. Yes, built on top of code-davinci-002's architecture. All nice. But the real revolution was in three boring details that nobody put in the blog posts:

1. The API Made It Trivial

Before ada-002, embedding text at scale required maintaining your own inference infrastructure. GPUs, model serving, batching, the whole operational nightmare. OpenAI said: here's an API endpoint, just POST your text, get vectors back. Pay per token.

That sounds small. It wasn't. Suddenly, any developer could embed a million documents over a weekend for a few hundred bucks. No ML expertise required. No infrastructure. Just curl and a credit card.

2. The Quality Jump Was Real

Previous embedding models were trained on... limited data. Ada-002 inherited GPT-3's training, which meant it had seen the internet. Code, prose, technical documentation, Reddit arguments, academic papers. The embeddings understood context in ways previous models couldn't.

We ran comparisons with our clients' existing search. The difference wasn't marginal. Queries that returned garbage before suddenly returned gold. "Find code that handles race conditions in the payment processing" actually found race condition handling. Not just files that mentioned the words.

3. The Price Made Scale Economical

$0.0001 per 1K tokens. Think about that. You could embed a million documents for about $100. The economics that previously made "embed everything" crazy suddenly made it... obvious?

And this is where things got interesting. Because when "embed everything" became economical, everyone started embedding everything. Documents, code, messages, support tickets, logs. If it was text, someone wanted to vectorize it.

The RAG Revolution Lands

Then came the magic trick: Retrieval Augmented Generation. The idea was elegant. Don't try to cram everything into the LLM's context window. Instead, embed your documents, embed the user's query, find the most similar documents, and feed those to the LLM as context.

RAG took ChatGPT from "impressive party trick" to "actually useful for work." Suddenly your AI assistant could answer questions about your codebase, your documentation, your customer support history. It wasn't hallucinating (as much) because it had real documents to ground its answers.

"We went from 'cool demo' to 'production requirement' in about six months. Every enterprise client wanted RAG. And RAG wanted vector search."

The infrastructure implications were immediate. If you're doing RAG, you need:

Fast vector search at query time (users expect millisecond responses)
Ability to update vectors as documents change
Scale to your entire document corpus (often tens of millions of chunks)
Filtering on metadata (search code, but only in the auth module)

That last requirement killed a lot of early vector database deployments. Pure ANN search is one thing. ANN search with metadata filtering is a different beast entirely.

The Scale Requirements Nobody Expected

Here's where I need to get honest about what happened to us. In early 2023, we were running FAISS deployments for several large enterprises. Typical scale: 5-20 million vectors. Comfortable. Manageable. We knew how to shard, how to replicate, how to handle updates.

Then a financial services client called. They wanted to embed every document they'd ever created. Every email. Every report. Every transaction note. The number they threw out: 2.3 billion documents.

I'll be honest: my first reaction was "that's insane." My second reaction was to start running the math. 2.3 billion vectors at 1536 dimensions, 4 bytes per float. That's roughly 14 terabytes just for the vectors. Add the IVF index structure, the metadata, and you're looking at 20-25TB of searchable data. Response time requirement: under 100ms for top-100 retrieval.

Standard FAISS couldn't do it. Not without some serious rearchitecting.

What Broke (And How We Fixed It)

Problem 1: The Index Building Time

FAISS's IVF index training assumes all your vectors fit in memory. At 2.3 billion vectors, they decidedly do not. Even on our beefiest machines. The standard approach is to train on a sample, but sampling introduces accuracy loss. And when your RAG system's quality depends on retrieval precision, accuracy loss means unhappy users.

We developed a streaming training approach. Instead of loading all vectors, we train on successive batches, updating centroids incrementally. It's slower, but it scales arbitrarily. And critically, it can be parallelized across multiple machines.

Problem 2: The Storage Architecture

At billion-scale, you can't keep everything in memory. But disk-based vector search has traditionally been slow. Really slow. The random access patterns of ANN search are the worst possible workload for spinning disks, and even NVMe SSDs struggle with the latency requirements.

We built a three-tier storage system. Hot data (frequently accessed vectors) stays in memory. Warm data lives on local NVMe. Cold data can be pushed to S3 or similar object storage. The system automatically promotes and demotes based on access patterns. For RAG workloads where the same documents get queried repeatedly, this works surprisingly well.

Problem 3: The Update Story

IVF indexes are notoriously bad at updates. Add a vector? It goes into the closest cluster. Delete a vector? Now you have gaps. Update a vector? Delete and re-add. Do this enough times and your index degrades, search quality drops, and you need to rebuild.

Enterprise RAG systems can't afford multi-hour rebuilds. Documents change constantly. New files get added. Old ones get archived. The index needs to stay fresh without requiring downtime.

Our solution: sorted inverted lists with incremental merging. Instead of the traditional unsorted inverted lists, we maintain vectors sorted by distance to centroid within each cluster. This enables efficient updates (insert in sorted order) and better search (early termination when we've seen enough good candidates). The merge operations can happen in the background, continuously, without blocking queries.

The Lessons That Shaped Our Architecture

Looking back at the embedding explosion of 2022, a few things stand out. These aren't just "lessons learned" platitudes. They fundamentally shaped how we built FAISS Extended and eventually MLGraph.

1. The Bottleneck Moves

When embeddings were expensive, the bottleneck was the embedding model. Everyone optimized inference. When embeddings became cheap, the bottleneck moved to storage and retrieval. Now, as scale keeps growing, the bottleneck is increasingly at the filtering and ranking stage.

Every architecture decision should ask: "Where's the bottleneck at 10x current scale? 100x?"

2. Embeddings Aren't Enough

This one took us longer to internalize. Pure vector similarity is a powerful primitive, but it's not sufficient for production search. You need filtering (search within this date range, this user's documents, this codebase). You need exact match capabilities (when the user types a function name, match it exactly). You need ranking signals beyond cosine similarity.

The best systems we've built combine vector search with traditional techniques. BM25 for lexical matching. AST parsing for structural queries. Graph traversal for relationship queries. Each has failure modes. Together, they cover each other's gaps.

3. The Easy Part Isn't the Hard Part

Getting embeddings into a vector database? Easy. Building an ANN index? Well-documented. Running a basic similarity search? Tutorial material.

The hard part is everything else. Handling updates without degradation. Scaling beyond what fits in memory. Filtering efficiently. Maintaining performance as usage patterns shift. Debugging why search quality dropped on Tuesday.

Where This All Led

The embedding explosion of 2022 was a wake-up call. The industry had been building vector databases for computer vision and recommendation workloads. Those are batch systems. Index once, query many times. Relatively static.

RAG changed the requirements. Documents update constantly. Queries come in real-time. Latency matters. Accuracy is table stakes. And scale? The numbers people were throwing around went from "millions" to "billions" in about six months.

FAISS Extended came out of that moment. We needed sorted inverted lists to handle updates efficiently. We needed TBB parallelism to saturate modern hardware. We needed tiered storage to handle scales that don't fit in memory. We needed quantization schemes that preserved quality for embedding dimensions that kept growing.

And when even Extended FAISS wasn't enough, when we needed distributed indexing across dozens of machines, when we needed real-time updates with strong consistency guarantees... that's when we started building MLGraph.

But that's a story for another post.

What's Next

It's now late 2025, and the explosion hasn't stopped. New embedding models keep arriving. OpenAI released text-embedding-3 with variable dimensions. Anthropic has Claude embeddings. Open-source models like E5 and GTE are competitive with commercial offerings. Dimensions keep growing. Context windows are expanding. Multimodal embeddings mean we're not just vectorizing text anymore.

The lesson from 2022 still applies: when the cost of a capability drops dramatically, demand explodes in ways nobody predicts. The infrastructure that seemed wildly over-engineered yesterday becomes table stakes tomorrow.

We're still figuring out what the next explosion looks like. But we're building to handle it. Because the next time someone calls asking if we can handle "a few billion more vectors," we'd like the answer to be something other than nervous laughter.

Building for Billion-Scale Vector Search?

FAISS Extended brings sorted inverted lists, TBB parallelism, and tiered storage to Facebook's FAISS library. Open-source and ready for production scale.

Explore FAISS Extended Learn About MLGraph

The Great Embedding Explosion of 2022: When Everything Became a Vector