API Feature
Testing

Test Data Generation API

Generate synthetic vectors, training data, and ground-truth queries via API for automated testing and benchmarking.

Test Data Generation Pipeline

Test data generation workflow with distribution options

API Endpoints

POST /api/test-data/vectors

Generate random vectors with configurable distribution.

{
  "count": 100000,
  "dimensions": 128,
  "distribution": "gaussian",
  "normalize": true,
  "seed": 42
}

POST /api/test-data/clustered

Generate vectors clustered around centroids.

{
  "count": 100000,
  "dimensions": 128,
  "numClusters": 64,
  "clusterSpread": 0.1,
  "clusterSizeVariance": 0.3
}

POST /api/test-data/ground-truth

Generate queries with known nearest neighbors for evaluation.

{
  "indexName": "benchmark-index",
  "queryCount": 1000,
  "k": 100,
  "sampleFrom": "index"  // or "new"
}

POST /api/test-data/benchmark-suite

Generate a complete benchmark suite with vectors, queries, and ground truth.

{
  "name": "recall-benchmark",
  "vectorCount": 1000000,
  "queryCount": 10000,
  "dimensions": 256,
  "distribution": "clustered",
  "k": [1, 10, 100]
}

Full Example

// Generate test vectors
const vectorsResponse = await fetch('/api/test-data/vectors', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    count: 100000,
    dimensions: 128,
    distribution: 'clustered',
    numClusters: 256,
    clusterSpread: 0.05,
    normalize: true,
    seed: 12345,
    format: 'stream'  // Stream to avoid memory issues
  })
});

// Response is a ReadableStream for large datasets
const reader = vectorsResponse.body.getReader();

// Or get download URL for file
const { downloadUrl } = await fetch('/api/test-data/vectors', {
  method: 'POST',
  body: JSON.stringify({
    count: 1000000,
    dimensions: 256,
    format: 'parquet',
    output: 'url'
  })
}).then(r => r.json());

// Download via URL
const file = await fetch(downloadUrl);

Distribution Parameters

DistributionParametersUse Case
uniformmin, maxBaseline testing
gaussianmean, stddevEmbedding simulation
clusterednumClusters, spreadIVF benchmarking
zipfalphaRealistic access patterns
adversarialpatternEdge case testing

Streaming Ingest

For large datasets, generate and ingest directly without intermediate storage:

// Generate and stream directly to index
POST /api/test-data/generate-and-ingest
{
  "indexName": "benchmark-vectors",
  "count": 10000000,
  "dimensions": 128,
  "distribution": "clustered",
  "batchSize": 50000,
  "progressWebhook": "https://example.com/progress"
}

// Response
{
  "jobId": "gen-123",
  "status": "running",
  "progress": {
    "generated": 0,
    "ingested": 0,
    "total": 10000000
  }
}

// Poll for status or use webhook
GET /api/jobs/gen-123