FAISS Extended
We didn't just fork FAISS—we taught it new tricks. TBB replaces OpenMP for proper parallelism that composes with your application. Thread-safe on-disk I/O with pread/pwrite. Concurrent caching with invalidation. Multiple merges that don't core dump. Pluggable storage backends for S3, Azure, or your NVMe. It's still Facebook's baby, just production-ready.
Why We Extended FAISS
Here's the thing about vanilla FAISS: it's brilliant, but it's also showing its age. The inverted lists are unsorted—meaning every search requires checking every vector in the selected clusters. Got a million vectors spread across 1,000 clusters? That's 1,000 vectors to check per cluster, every single query. No early termination. No mercy.
And let's talk about storage. FAISS assumes your vectors live in memory or on local disk. Want them on S3 for cost efficiency? On Azure for your enterprise stack? You're on your own. The storage layer is hardcoded, and swapping it out means surgery on code that was never designed for it.
We fixed both problems. Our sorted inverted lists maintain distance order, enabling binary search and early termination. Our pluggable storage backends let you mix and match—hot data on NVMe, cold data on object storage, all behind a unified interface. Same FAISS API, better internals.
What We Built
Six major innovations that make FAISS faster, more flexible, and ready for production.
TBB-Parallel On-Disk Inverted Lists
Intel Threading Building Blocks replaces OpenMP for proper parallelism that composes with your application. Work-stealing scheduler, lock-free containers, and pipeline stages. Multiple cores actually cooperate instead of fighting over locks.
Thread-Safe pread/pwrite I/O
POSIX positional I/O eliminates file descriptor contention. Each thread reads from its own position without coordination. No locks, no seeks, just parallel I/O. The kernel handles the rest.
Pluggable Storage Backends
File, mmap, RocksDB, S3, Azure, GCS—pick your poison. Want vectors on local NVMe for speed? Done. Need cold storage on S3 for cost? Also done. The same index, different backends.
TBBCacheWithInvalidation
Concurrent caching with explicit invalidation. Thread A reads from cache while Thread B writes to disk—no stale data, no race conditions. Read-copy-update (RCU) pattern with shared_ptr for safe concurrent access.
Multiple Parallel Merges
Merge multiple on-disk inverted lists simultaneously—something vanilla FAISS can't do. Batch your inserts across indices, then merge in bulk. Original FAISS core dumps; we handle errors gracefully.
Free Slot Reuse
Deleted vectors leave holes. We track them with block-based bitmasks and reuse them for new inserts. Less fragmentation, smaller files, happier SSDs.
TBB On-Disk Inverted Lists: The Architecture
Vanilla FAISS uses OpenMP for parallelism and seek+read for disk I/O. That's fine for simple use cases, but it falls apart when you need high throughput: OpenMP doesn't compose with other threading libraries, and seek+read serializes on the file descriptor. Two threads reading different lists? They fight.
Our TBB-based implementation fixes both problems. Here's what we changed:
- POSIX pread/pwrite: Each thread reads from its own position without moving the file pointer. No locks, no contention, the kernel handles concurrent access to the same file descriptor.
- Per-list reader-writer locks: Multiple threads can read the same list simultaneously; writes are exclusive. Fine-grained locking means no global bottleneck.
- TBB parallel_pipeline: Three-stage pipeline—serial input, parallel processing, serial output. Natural prefetching without manual management.
// Thread-safe read without moving file pointer
size_t bytes_read = data_stream_->pread(
codes->data(), // buffer
1, // element size
codes_size, // count
offset // absolute position - no seek needed!
);
// Per-list reader-writer locks
tbb::spin_rw_mutex::scoped_lock lock(list_locks_[list_no], false);
// false = read lock (shared), true = write lock (exclusive)
// TBB parallel search with pipeline
tbb::parallel_pipeline(
max_tokens,
tbb::make_filter<void, QueryBatch>(
tbb::filter_mode::serial_in_order,
InputFilter(queries, batch_size)
) &
tbb::make_filter<QueryBatch, ResultBatch>(
tbb::filter_mode::parallel,
ProcessFilter(invlists, k) // Parallel distance computation
) &
tbb::make_filter<ResultBatch, void>(
tbb::filter_mode::serial_in_order,
OutputFilter(results)
)
);The pipeline pattern is the key: while stage 2 processes query N, stage 1 is preparing query N+1 and stage 3 is writing results for query N-1. Natural prefetching without explicit management. Combined with pread for I/O, we get 2.3x higher throughput than vanilla FAISS.
Your Vectors, Your Storage
One interface, many backends. Hot data on NVMe, warm data on SSD, cold data on S3.
| Backend | Throughput | Latency | Best For |
|---|---|---|---|
File I/O | 200 MB/s | ~0.1ms | Development |
Memory-Mapped | 500 MB/s | ~0.05ms | Production (reads) |
RocksDB | 180 MB/s | ~0.2ms | Updates/deletes |
S3/Azure/GCS | 100-250 MB/s | 50-200ms* | Cloud-native |
* Cloud storage throughput is excellent (100-250 MB/s on sequential operations), but latency is the bottleneck for vector search. Each query requires multiple round-trips. That's why we cache aggressively.
// Automatic backend selection from URI
auto backend = StorageFactory::instance()
.create_from_uri("s3://my-bucket/vectors/index.ivf");
// Or explicit configuration
StorageConfig config;
config.endpoint = "s3.amazonaws.com";
config.bucket = "my-vectors";
config.credentials = get_aws_credentials();
auto s3_backend = StorageFactory::instance()
.create_by_type("S3", &config);
// Use with any inverted list implementation
auto invlists = std::make_unique<SortedOnDiskInvertedLists>(
nlist, code_size, s3_backend.get()
);The Numbers Don't Lie
5GB dataset (5 batches × 1GB), 128-dimensional vectors, 8 threads.
| Implementation | Add Time | Search QPS | Complexity |
|---|---|---|---|
Regular FAISS (OpenMP) | 8.2s | 175,000 | O(n) |
FAISS Extended (TBB) | 7.1s | 398,000 | O(n) |
TBB + Cache (1GB) | 6.8s | 425,000 | O(n) |
TBB + Pipeline | 6.5s | 450,000 | O(n) |
TBB: Because OpenMP Has Its Limits
OpenMP is great for simple parallel loops. But when you need per-list locking, buffer pools, and pipeline stages that don't step on each other's memory? That's where Intel TBB shines.
Our OnDiskInvertedListsTBB implementation uses:
- Reader-writer locks per list — Multiple threads can read the same list; writers get exclusive access.
- Buffer pools — 16 pre-allocated buffers with RAII guards. No malloc in the hot path.
- pread/pwrite — Positional I/O that doesn't fight over file position state.
- Pipeline stages — Input → Process → Output, each stage running in parallel.
The result? 2.7x faster search on small datasets. Large datasets are still being optimized—turns out, there's always another cache miss to hunt down.
// TBB parallel search with pipeline
tbb::parallel_pipeline(
max_tokens,
tbb::make_filter<void, QueryBatch>(
tbb::filter_mode::serial_in_order,
InputFilter(queries, batch_size)
) &
tbb::make_filter<QueryBatch, ResultBatch>(
tbb::filter_mode::parallel,
ProcessFilter(invlists, k) // This is where the magic happens
) &
tbb::make_filter<ResultBatch, void>(
tbb::filter_mode::serial_in_order,
OutputFilter(results)
)
);Ready to Make FAISS Faster?
Whether you're building a recommendation engine, semantic search, or just trying to find similar images without waiting all day—we've got you covered.