TBB-Based Inverted Lists: Parallelism Without the Drama
You know how sometimes you're working with millions of vectors and your program is just... sitting there? Like it's taking a coffee break while you're frantically refreshing the terminal? That's the problem we're solving here.
The core insight is embarrassingly simple: FAISS's original inverted lists use OpenMP for parallelism, which works great until you try to compose it with other parallel libraries. Then you get into this weird situation where threads are fighting over locks, waiting for each other like cars at a four-way stop where everyone arrived at the same time.
The TBB Advantage
Intel's Threading Building Blocks (TBB) takes a different approach. Instead of "here are 8 threads, go wild," it says "here are tasks that might be independent, let me figure out the optimal execution." It's like having a really smart scheduler that actually understands dependencies.
Our OnDiskInvertedListsTBB replaces OpenMP primitives with TBB parallel patterns:
// Instead of #pragma omp parallel for
tbb::parallel_for(
tbb::blocked_range<size_t>(0, n, grain_size),
[&](const tbb::blocked_range<size_t>& range) {
// Process range.begin() to range.end()
}
);Cache-Aware Grain Sizing
The magic is in the grain size calculation. We actually detect your L3 cache size from sysfs and calculate how many elements fit per thread:
size_t OnDiskInvertedListsTBB::calculate_cache_aware_grain_size(
size_t n, size_t element_size) {
size_t l3_size = get_l3_cache_size();
size_t max_concurrency = tbb::this_task_arena::max_concurrency();
// How many elements fit in L3 per thread?
size_t cache_per_thread = l3_size / max_concurrency;
size_t elements_per_cache = cache_per_thread / element_size;
// Use 1/4 of cache to leave room for other data
return elements_per_cache / 4;
}For a typical 16MB L3 cache with 8 threads, that's 2MB per thread, or about 64K floats. Small enough to stay hot in cache, large enough to amortize scheduling overhead.
Thread-Safe File I/O: The pread/pwrite Pattern
Here's where things get interesting. The original FAISS uses a single file pointer with seek+read. That works fine with a global lock, but kills parallelism. We use POSIX pread/pwrite instead:
// Thread-safe read without moving file pointer
size_t bytes_read = data_stream_->pread(
codes->data(), // buffer
1, // element size
codes_size, // count
offset // absolute position
);Each thread reads from its own position without coordination. No locks, no seeks, just parallel I/O. The kernel handles the complexity.
Read/Write Lock Pattern
const uint8_t* OnDiskInvertedListsTBB::get_codes(size_t list_no) const {
// Check cache first (with TBB concurrent_hash_map)
if (cache_) {
return tbb_cache->get_or_load_codes(list_no, [this, list_no]() {
// Read lock (allows multiple readers)
tbb::spin_rw_mutex::scoped_lock lock(
list_rw_mutexes_[list_no]->mutex, false);
// pread doesn't need exclusive access
data_stream_->pread(codes.data(), 1, codes_size, offset);
return codes;
});
}
}Notice the false in the scoped_lock? That's a read lock. Multiple threads can read the same list simultaneously. We only take a write lock in add_entries.
The Pipeline Pattern for Search
Search is where TBB really shines. We use a three-stage pipeline:
- 1Serial input stage: Generate query-list pairs in order
- 2Parallel processing stage: Compute distances for each query
- 3Serial output stage: Write results in order
tbb::parallel_pipeline(
n, // max tokens in flight
// Stage 1: Generate queries
tbb::make_filter<void, std::pair<size_t, idx_t>>(
tbb::filter_mode::serial_in_order,
[&](tbb::flow_control& fc) -> std::pair<size_t, idx_t> {
if (*idx >= n) {
fc.stop();
return {0, -1};
}
return std::make_pair(*idx, assign[*idx]);
}
) &
// Stage 2: Process in parallel
tbb::make_filter<std::pair<size_t, idx_t>, SearchResult>(
tbb::filter_mode::parallel,
[&](std::pair<size_t, idx_t> input) -> SearchResult {
// Compute distances, keep top-k
// Each query runs in parallel
}
) &
// Stage 3: Write results in order
tbb::make_filter<SearchResult, void>(
tbb::filter_mode::serial_in_order,
[&](const SearchResult& result) {
// Copy to output arrays
}
)
);The pipeline maintains query order (important for reproducibility) while parallelizing the expensive distance computations. And because it's a pipeline, there's natural prefetching - while stage 2 processes query N, stage 1 is already preparing query N+1.
Performance Numbers
5GB Dataset Benchmark (5 batches × 1GB)
The key insight: TBB's work-stealing scheduler keeps cores fed. OpenMP tends to create barriers where threads wait. TBB says "if your task is blocked, steal work from another queue."
When to Use This
Use TBB When:
- High query throughput (millions of queries/second)
- Composability - using FAISS within larger parallel app
- NUMA systems - TBB's affinity partitioner respects topology
- Dynamic workloads - when list sizes vary wildly
Don't Use If:
- Single-threaded search - overhead isn't worth it
- Small datasets - use ArrayInvertedLists in memory
- No TBB available - falls back to regular OnDiskInvertedLists
Critical requirement: You must use IndexIVFTBB (not IndexIVFFlat) to avoid OpenMP/TBB conflicts. The index and inverted lists need to agree on their threading model.
The Bottom Line
The code is production-ready, battle-tested on datasets up to 1 billion vectors. The only gotcha: you must use IndexIVFTBB (not IndexIVFFlat) to avoid OpenMP/TBB conflicts. The index and inverted lists need to agree on their threading model.
Think of it like this: TBB is async/await for C++. You describe tasks and dependencies, and the runtime figures out optimal execution. It's more work upfront (understanding parallel patterns) but pays dividends in composability and performance.
Ready to parallelize your vector search?
Explore FAISS Extended's TBB implementations and benchmark results on GitHub.