TBB Parallelism
FAISS Extended
Performance

TBB-Based Inverted Lists: Parallelism Without the Drama

You know how sometimes you're working with millions of vectors and your program is just... sitting there? Like it's taking a coffee break while you're frantically refreshing the terminal? That's the problem we're solving here.

The core insight is embarrassingly simple: FAISS's original inverted lists use OpenMP for parallelism, which works great until you try to compose it with other parallel libraries. Then you get into this weird situation where threads are fighting over locks, waiting for each other like cars at a four-way stop where everyone arrived at the same time.

The TBB Advantage

Intel's Threading Building Blocks (TBB) takes a different approach. Instead of "here are 8 threads, go wild," it says "here are tasks that might be independent, let me figure out the optimal execution." It's like having a really smart scheduler that actually understands dependencies.

Our OnDiskInvertedListsTBB replaces OpenMP primitives with TBB parallel patterns:

// Instead of #pragma omp parallel for
tbb::parallel_for(
    tbb::blocked_range<size_t>(0, n, grain_size),
    [&](const tbb::blocked_range<size_t>& range) {
        // Process range.begin() to range.end()
    }
);

Cache-Aware Grain Sizing

The magic is in the grain size calculation. We actually detect your L3 cache size from sysfs and calculate how many elements fit per thread:

size_t OnDiskInvertedListsTBB::calculate_cache_aware_grain_size(
        size_t n, size_t element_size) {
    size_t l3_size = get_l3_cache_size();
    size_t max_concurrency = tbb::this_task_arena::max_concurrency();

    // How many elements fit in L3 per thread?
    size_t cache_per_thread = l3_size / max_concurrency;
    size_t elements_per_cache = cache_per_thread / element_size;

    // Use 1/4 of cache to leave room for other data
    return elements_per_cache / 4;
}

For a typical 16MB L3 cache with 8 threads, that's 2MB per thread, or about 64K floats. Small enough to stay hot in cache, large enough to amortize scheduling overhead.

Thread-Safe File I/O: The pread/pwrite Pattern

Here's where things get interesting. The original FAISS uses a single file pointer with seek+read. That works fine with a global lock, but kills parallelism. We use POSIX pread/pwrite instead:

// Thread-safe read without moving file pointer
size_t bytes_read = data_stream_->pread(
    codes->data(),    // buffer
    1,                // element size
    codes_size,       // count
    offset           // absolute position
);

Each thread reads from its own position without coordination. No locks, no seeks, just parallel I/O. The kernel handles the complexity.

Read/Write Lock Pattern

const uint8_t* OnDiskInvertedListsTBB::get_codes(size_t list_no) const {
    // Check cache first (with TBB concurrent_hash_map)
    if (cache_) {
        return tbb_cache->get_or_load_codes(list_no, [this, list_no]() {
            // Read lock (allows multiple readers)
            tbb::spin_rw_mutex::scoped_lock lock(
                list_rw_mutexes_[list_no]->mutex, false);

            // pread doesn't need exclusive access
            data_stream_->pread(codes.data(), 1, codes_size, offset);
            return codes;
        });
    }
}

Notice the false in the scoped_lock? That's a read lock. Multiple threads can read the same list simultaneously. We only take a write lock in add_entries.

The Pipeline Pattern for Search

Search is where TBB really shines. We use a three-stage pipeline:

  1. 1
    Serial input stage: Generate query-list pairs in order
  2. 2
    Parallel processing stage: Compute distances for each query
  3. 3
    Serial output stage: Write results in order
tbb::parallel_pipeline(
    n,  // max tokens in flight

    // Stage 1: Generate queries
    tbb::make_filter<void, std::pair<size_t, idx_t>>(
        tbb::filter_mode::serial_in_order,
        [&](tbb::flow_control& fc) -> std::pair<size_t, idx_t> {
            if (*idx >= n) {
                fc.stop();
                return {0, -1};
            }
            return std::make_pair(*idx, assign[*idx]);
        }
    ) &

    // Stage 2: Process in parallel
    tbb::make_filter<std::pair<size_t, idx_t>, SearchResult>(
        tbb::filter_mode::parallel,
        [&](std::pair<size_t, idx_t> input) -> SearchResult {
            // Compute distances, keep top-k
            // Each query runs in parallel
        }
    ) &

    // Stage 3: Write results in order
    tbb::make_filter<SearchResult, void>(
        tbb::filter_mode::serial_in_order,
        [&](const SearchResult& result) {
            // Copy to output arrays
        }
    )
);

The pipeline maintains query order (important for reproducibility) while parallelizing the expensive distance computations. And because it's a pipeline, there's natural prefetching - while stage 2 processes query N, stage 1 is already preparing query N+1.

Performance Numbers

5GB Dataset Benchmark (5 batches × 1GB)

Search Throughput
398K QPS
2.3x speedup vs regular (175K QPS)
CPU Efficiency
95%+
vs 60-70% with OpenMP
Add Latency
Similar
I/O bound, not CPU bound
Memory Overhead
~128 bytes
per list for mutexes + TBB tasks

The key insight: TBB's work-stealing scheduler keeps cores fed. OpenMP tends to create barriers where threads wait. TBB says "if your task is blocked, steal work from another queue."

When to Use This

Use TBB When:

  • High query throughput (millions of queries/second)
  • Composability - using FAISS within larger parallel app
  • NUMA systems - TBB's affinity partitioner respects topology
  • Dynamic workloads - when list sizes vary wildly

Don't Use If:

  • Single-threaded search - overhead isn't worth it
  • Small datasets - use ArrayInvertedLists in memory
  • No TBB available - falls back to regular OnDiskInvertedLists

Critical requirement: You must use IndexIVFTBB (not IndexIVFFlat) to avoid OpenMP/TBB conflicts. The index and inverted lists need to agree on their threading model.

The Bottom Line

The code is production-ready, battle-tested on datasets up to 1 billion vectors. The only gotcha: you must use IndexIVFTBB (not IndexIVFFlat) to avoid OpenMP/TBB conflicts. The index and inverted lists need to agree on their threading model.

Think of it like this: TBB is async/await for C++. You describe tasks and dependencies, and the runtime figures out optimal execution. It's more work upfront (understanding parallel patterns) but pays dividends in composability and performance.

Ready to parallelize your vector search?

Explore FAISS Extended's TBB implementations and benchmark results on GitHub.