Performance Analysis

MLGraph Performance Benchmarks

Production-proven performance metrics from comprehensive testing with 1M+ vectors. Think of these numbers as the DNA of the system - they tell you what MLGraph can actually do when you push it hard.

Executive Summary

Peak Throughput:1.2M vectors/sec
Search QPS (3-node):5,000 QPS
P50 Latency:0.3 ms
Scalability:Linear

Key Finding: Parquet format delivers 12x faster loading than CSV with 2.5x smaller file size. This isn't just optimization - it's the difference between waiting minutes vs seconds.

Throughput Benchmarks

MLGraph Performance Benchmarks - Throughput Comparison

Data Format Performance

CSV:102K vec/s, 1,249 MB
JSON:50K vec/s, 2,528 MB
Parquet:1,232K vec/s, 494 MB
FAISS (baseline):4,339K vec/s, 488 MB

Parquet achieves 12x speedup over CSV while using 60% less disk space.

System Configuration

Single Server:100K vec/s
3-Node Cluster:300K vec/s
5-Node Cluster:450K vec/s
Direct FAISS:950K vec/s

3-node cluster provides 3x throughput with fault tolerance - the sweet spot for production.

Linear Scalability

MLGraph Linear Scalability - Load Time vs Dataset Size

Scalability Projections

Dataset SizeSingle Server3-Node Cluster10-Node Cluster
10M vectors100 sec33 sec12 sec
100M vectors16.7 min5.5 min2.0 min
1B vectors2.8 hours55 min20 min

Here's what linear scalability really means: add 3x the nodes, get 3x the speed. No magic, no diminishing returns - just clean, predictable scaling. The kind of behavior that makes capacity planning actually work in production.

Query Latency Analysis

MLGraph Search Latency Distribution - Single vs Batch Queries

Single Query

P50:0.3 ms
P95:0.5 ms
P99:0.6 ms
QPS:1,865

Batch Query

P50:0.15 ms
P95:0.25 ms
P99:0.35 ms
QPS:3,284

Improvement

Latency:2x faster
Throughput:76% higher
Batch size:100 queries

Always batch when you can. The numbers don't lie.

Detailed Metrics

Training Performance

10,000 vectors to 256 centroids (128 dimensions):

CSV:123.7 ms (2,069 vec/s)
JSON:192.4 ms (1,331 vec/s)
Parquet:31.9 ms (8,024 vec/s)
FAISS:24.1 ms (10,623 vec/s)

Parquet achieves 3.9x speedup over CSV for training operations.

Resource Utilization

CPU and I/O breakdown during 1M vector loading:

CSV - CPU:96%
CSV - I/O:4%
Parquet - CPU:69%
Parquet - I/O:31%

Parquet provides better balance between CPU and I/O - neither becomes a bottleneck.

MLGraph vs Direct FAISS

MetricDirect FAISSMLGraph SingleMLGraph Distributed
Load Speed950K vec/s100K vec/s300K vec/s
Search QPS2,0001,8005,000
ScalabilityLimitedLimitedLinear
Fault ToleranceNoneNoneYes
Memory LimitSingle NodeSingle NodeUnlimited
ComplexityLowLowMedium

Trade-off Analysis: You sacrifice some raw speed (3-10x slower loading vs direct FAISS) but gain linear scalability, fault tolerance, and the ability to handle datasets that won't fit in a single machine's memory. For billion-scale deployments, that's not a trade-off - it's a requirement.

Performance Tuning Tips

Data Loading

  • Always use Parquet for bulk loading - 12x faster than CSV
  • Batch vectors in groups of 10,000-100,000 for optimal throughput
  • Pre-sort vectors by similarity when possible to improve cache locality
  • Use parallel loading for multiple indices across cluster nodes

Search Optimization

  • Always batch search requests when possible - 76% throughput gain
  • Set nprobe based on accuracy requirements (16 recommended for production)
  • Implement client-side connection pooling to reduce setup overhead
  • Cache results for repeated queries - significant win for read-heavy workloads

Cluster Management

  • Start with 3 nodes, scale incrementally based on measured load
  • Monitor memory usage per centroid - aim for ~700MB per server
  • Implement health checks with 30-second intervals for early failure detection
  • Plan for 2x peak load capacity - headroom prevents cascading failures

Network Requirements

  • Minimum 1 Gbps between nodes - network becomes bottleneck below this
  • Target latency under 1 ms for optimal distributed performance
  • Use cloud placement groups to guarantee low-latency networking
  • Implement retry logic with exponential backoff for transient failures

Deployment Recommendations

Small Scale
Less than 10M vectors
Configuration:Single CentroidService
Deployment:1 server, 256 centroids
Expected Load:100K vectors/sec
Search QPS:1,800
Memory:~2 GB per million vectors

Simple deployment, adequate performance, easy maintenance. The right choice when you're just starting or your dataset fits comfortably in memory.

Medium Scale
10M - 100M vectors
Configuration:Distributed with RawCentroidClient
Deployment:3-5 servers, 256-512 centroids
Expected Load:300-450K vectors/sec
Search QPS:5,000-7,500
Memory:400-700 MB per server per million vectors

Recommended for most production deployments. Best performance/complexity tradeoff, linear scalability, direct communication. This is the sweet spot.

Large Scale
100M - 1B vectors
Configuration:Distributed with ProtoCentroidClient
Deployment:10-20 servers, 1024-2048 centroids
Expected Load:800K-1.4M vectors/sec
Search QPS:12,000-20,000
Memory:200-400 MB per server per million vectors

Built-in health checking, automatic failover, better operational stability. When your dataset won't fit in one machine, this is how you scale.

Advanced Optimizations

Multilevel Indexing

Two-level hierarchical index structure for billion-scale datasets:

Search Speed:2-4x faster
Memory Usage:30-40% reduction
Build Time:2-3x faster
Latency:0.5-1 ms (1B vectors)

The idea is simple: coarse centroids for routing, fine centroids for precision. Level 1 decides which servers to query, Level 2 finds the actual neighbors. Natural fit for MLGraph's distributed architecture.

Future Roadmap

  • GPU Acceleration: CUDA support for 10x search speedup on GPU-enabled nodes
  • Compression: Product Quantization (PQ/OPQ) for 4-8x memory reduction
  • Streaming Updates: Real-time vector insertion without rebuild cycles
  • Auto-Scaling: Dynamic node addition/removal based on load patterns