FAISS Extended

Architecture

TBB Parallelism

FAISS Extended Architecture

How we took FAISS and made it production-ready: TBB parallelism instead of OpenMP, thread-safe on-disk storage, concurrent caching with invalidation, and update/delete operations that vanilla FAISS simply doesn't support.

FAISS is incredible for vector search—fast, well-tested, and used by everyone from Facebook to startups. But it has gaps when you're building production systems: OpenMP doesn't compose well with other parallel libraries, there's no safe concurrent access to on-disk indices, caching is rudimentary, and you can't update or delete vectors without rebuilding the entire index.

FAISS Extended fills those gaps. We've added TBB-based parallelism for better composability, thread-safe file I/O with pread/pwrite, cache management with explicit invalidation, and update/delete operations with pluggable storage backends. This isn't a fork—it's FAISS plus the production features you actually need.

TBB Parallelism: Beyond OpenMP

TBB architecture: IndexIVFTBB coordinates parallel operations through OnDiskInvertedListsTBB

The core issue with OpenMP: it assumes it owns all your threads. If you're using FAISS in a larger application that also does parallel processing (say, with TBB or std::async), you get thread contention, false sharing, and mysterious slowdowns. OpenMP and other threading libraries don't play nice together.

Why TBB?

Work-Stealing Scheduler

Instead of static thread allocation, TBB dynamically steals work from busy queues. Cores stay fed even with imbalanced workloads.

Lock-Free Containers

concurrent_hash_map, concurrent_queue, and friends provide fine-grained locking without global synchronization points.

Cache-Aware Grain Sizing

We calculate optimal grain sizes based on your L3 cache size—typically 64K floats per thread, enough to amortize scheduling overhead while staying cache-resident.

Our OnDiskInvertedListsTBB replaces OpenMP pragmas with TBB parallel patterns: parallel_for for batch operations, parallel_pipeline for search, and concurrent_hash_map for caching. The result? 2.3x higher search throughput (398K QPS vs 175K QPS) and 95%+ CPU utilization across all cores.

Thread-Safe File I/O: The pread/pwrite Pattern

Vanilla FAISS uses seek + read for disk I/O. That's fine with a global lock, but destroys parallelism. The moment two threads try to read different lists, they serialize on the file descriptor's seek position.

POSIX pread/pwrite to the Rescue

// Thread-safe read without moving file pointer

size_t bytes_read = data_stream_->pread(

codes->data(), // buffer

1, // element size

codes_size, // count

offset // absolute position

);

Each thread reads from its own position without coordination. The kernel handles concurrent access to the same file descriptor. No locks, no seeks, just parallel I/O.

We use read/write locks (tbb::spin_rw_mutex) at the list level: multiple threads can read the same list simultaneously, but writes are exclusive. The false parameter in scoped_lock means "read lock"—it's shared, not exclusive.

TBB Pipeline: Natural Prefetching

Three-stage pipeline: serial input, parallel processing, serial output

Search is embarrassingly parallel—each query is independent. But you need to maintain query order for reproducibility, and you want to maximize throughput. TBB's pipeline pattern is perfect for this.

Three Pipeline Stages

Stage 1: Serial Input

Generate query-list pairs in order. This stage is fast (just indexing into arrays), so serialization doesn't hurt.

Stage 2: Parallel Processing

Compute distances for each query. This is the expensive part—distance calculations, top-k selection. Multiple queries processed simultaneously.

Stage 3: Serial Output

Write results to output arrays in order. Again, fast and serial to maintain order.

The beauty: while stage 2 processes query N, stage 1 is preparing query N+1 and stage 3 is writing results for query N-1. It's natural prefetching without manual management.

Cache Management: TBBCacheWithInvalidation

Cache coherency with concurrent reads, writes, and explicit invalidation

The cache coherency problem: Thread A reads list 5 from cache. Thread B writes to list 5 on disk. Thread C tries to read list 5 and gets stale data. Classic coherency nightmare.

Solution: Explicit Invalidation

We use shared_ptr<const vector> for cached data. The data is immutable—when you update a list, you don't modify the cached vector, you invalidate it. Next reader loads fresh data from disk.

Thread A gets shared_ptr, increments refcount

Thread B writes to disk, marks cache dirty

Thread C reads, detects dirty, invalidates cache

Thread C reloads fresh data from disk

Thread A's shared_ptr keeps the old data alive until it's done. No dangling pointers, no segfaults. This is the read-copy-update (RCU) pattern.

Cache statistics (hit rate, miss rate) help you tune. If hit rate is 10%, your cache is too small or your access pattern is too random. If it's 99%, you might be over-caching. We default to 1GB which handles most workloads.

Update and Delete: What FAISS Doesn't Give You

Pluggable storage backends with update/delete support and lazy compaction

Here's a dirty secret about FAISS: it's append-only. You can add vectors, but updating or deleting them? The docs tell you to rebuild the index. For a billion-vector index, that means hours of downtime. Unacceptable for production.

OnDiskInvertedListsNew

Pluggable storage backends for flexible updates

We created OnDiskInvertedListsNew with pluggable storage backends. Each list gets two keys: list_N.codes and list_N.ids. Updates write new files atomically. No partial writes, no corruption.

Local File System

Direct disk I/O for low latency

S3 Storage

Cloud-native with versioning

Custom Backend

Implement StorageBackend interface

Deletion Strategy

Lazy Deletion with Tombstones

<1%

Ignore

Negligible overhead, don't compact

1-20%

Filter During Search

Mark as tombstones, skip in results

>20%

Compact to Reclaim Space

Rebuild lists without deleted entries

The compact operation is expensive—O(total vectors). But you do it once when fragmentation is high, not on every delete. This trade-off optimizes for the 99% case: search is fast, updates are reasonable but not instant.

Performance at Scale

5GB Dataset Benchmark Results

5 batches × 1GB each, 8-core system

Search Throughput

398K QPS

2.3x vs regular (175K QPS)

CPU Utilization

95%+

vs 60-70% with OpenMP

Cache Hit Rate

85-95%

After warmup, typical workload

Memory Overhead

~128 bytes

Per list for TBB structures

When to Use FAISS Extended

Use FAISS Extended When:

✓

You need high query throughput (millions of queries/second)

✓

Your application uses other parallel libraries (TBB, std::async)

✓

You need update/delete operations without index rebuild

✓

Running on NUMA systems with multiple sockets

✓

Dataset is 10GB+ and doesn't fit in RAM

Stick with Vanilla FAISS When:

Single-threaded search is sufficient

Small datasets (under 1GB) fit entirely in RAM

Append-only workload with no updates/deletes

TBB dependency is problematic for your build

You're just prototyping and don't need production features

Dive Deeper

TBB Deep Dive

Detailed walkthrough of TBB implementation patterns

Product Overview

Features, benchmarks, and getting started guide