Production Guide

NVFP4

Deployment

SLM Ensemble

Building Production SLMs with NVFP4: A Practical Guide

From research papers to running code: how we quantized 8 specialist models and deployed them on a single GPU

December 21, 2025•18 min read•David Gornshtein, CTO

"The first time you load 8 models in the memory budget of 2, you'll feel like you're breaking physics. You're not—you're just getting NVFP4 right."

From Theory to Reality

Research papers make NVFP4 sound magical. "88% error reduction!" "4x throughput gains!" "Near-lossless quantization!" All true—but they don't tell you about the 3 hours I spent debugging why my calibration data was causing 15% accuracy drops, or why vLLM kept falling back to FP16 on my "supported" GPU.

This is the guide I wish existed when we started quantizing our SLM ensemble. No hand-waving. No "it's straightforward" when it's not. Just the actual workflow we used to take 8 specialist models (C++, CMake, Debug, Shell, Orchestration, Design, Review) from 100GB of FP16 weights down to 28GB of NVFP4, deployed on a single GB200.

You'll learn: Post-Training Quantization (the fast path), Quantization-Aware Training (when PTQ isn't enough), deployment strategies, and how to benchmark properly. Let's build something real.

PTQ vs QAT decision flowchart for NVFP4 quantization

Part 1: Post-Training Quantization (PTQ)

PTQ is your starting point. It's fast (minutes to hours, not days), requires no training data beyond calibration samples, and for models 7B+, it usually just works. Here's the workflow we used for our C++ SLM (8B parameters).

Step 1: Environment Setup

# Install TensorRT Model Optimizer

pip install nvidia-modelopt

# Or for the vLLM ecosystem
pip install llm-compressor

# Verify you have Blackwell GPU
nvidia-smi  # Should show GB200/GB100/RTX5090

⚠️ Common Gotcha #1: GPU Generation

NVFP4 works on Hopper (H100), but only in W4A16 mode (weights quantized, activations stay FP16). You lose most of the throughput benefits. For full W4A4 acceleration, you need Blackwell (SM100+).

Check your compute capability: nvidia-smi --query-gpu=compute_cap --format=csv. Need 10.0 or higher for native FP4.

Step 2: Calibration Data

Calibration is like the training montage in a movie—small effort, massive payoff. You need ~512 samples that are representative of your inference workload. Not your entire training set. Not random Wikipedia. Representative.

# Prepare calibration dataset for C++ SLM

import torch
from datasets import load_dataset

# Load your domain-specific data
# For our C++ SLM, we used a mix of:
# - Function generation prompts
# - Bug fixing scenarios
# - Refactoring requests
calib_data = load_dataset(
    "codeparrot/apps",
    split="train[:512]"  # 512 samples sufficient
)

# Convert to model input format
def prepare_batch(examples):
    return tokenizer(
        examples["prompt"],
        return_tensors="pt",
        max_length=512,
        truncation=True
    )

calib_loader = torch.utils.data.DataLoader(
    calib_data.map(prepare_batch),
    batch_size=8
)

⚠️ Common Gotcha #2: Dataset Mismatch

I spent 3 hours debugging a 15% accuracy drop before realizing my calibration data was all short prompts (<50 tokens) but my production workload was long-context (500+ tokens). The quantization scales optimized for the wrong distribution.

Match your calibration data to your inference distribution: prompt length, complexity, domain.

Step 3: Quantize with TensorRT Model Optimizer

# Full PTQ workflow

import modelopt.torch.quantization as mtq
import modelopt.torch.export as mte

# 1. Load your pretrained model
model = AutoModelForCausalLM.from_pretrained(
    "slm-cpp-8b",
    device_map="auto",
    torch_dtype=torch.float16
)

# 2. Define calibration forward loop
def forward_loop(model):
    """Run calibration data through model to collect stats"""
    model.eval()
    with torch.no_grad():
        for batch in calib_loader:
            model(**batch)  # Just forward pass, no gradients

# 3. Quantize to NVFP4
print("Quantizing to NVFP4...")
quantized_model = mtq.quantize(
    model,
    mtq.NVFP4_DEFAULT_CFG,  # Uses dual-level scaling
    forward_loop
)

# 4. Export for deployment
mte.export_hf_checkpoint(
    quantized_model,
    export_dir="./slm-cpp-8b-nvfp4"
)
print("Quantization complete!")

That's it. Seriously. The NVFP4_DEFAULT_CFG handles the dual-level scaling (16-element blocks with E4M3 scales + tensor-wide FP32 scale). You don't need to tune hyperparameters unless you're chasing that last 0.1% accuracy.

Step 4: Alternative - LLM Compressor (vLLM Ecosystem)

# One-liner quantization with LLM Compressor

from llm_compressor import compress

# Quantize in one call
compress(
    model="slm-cpp-8b",
    recipe="nvfp4_w4a4",  # W4A4 on Blackwell, W4A16 on Hopper
    output_dir="./slm-cpp-8b-nvfp4",
    calibration_data=calib_data
)

# Deploy directly with vLLM
from vllm import LLM
llm = LLM(model="./slm-cpp-8b-nvfp4")
outputs = llm.generate(prompts)

LLM Compressor is faster to get started (one command), but TensorRT Model Optimizer gives you more control over quantization config. We use TensorRT for critical models, LLM Compressor for rapid experimentation.

Step 5: Validate Accuracy

# Benchmark FP16 baseline vs NVFP4

from datasets import load_dataset

# For code models, use HumanEval
humaneval = load_dataset("openai_humaneval")

def evaluate_pass_at_1(model, dataset):
    correct = 0
    for problem in dataset:
        solution = model.generate(
            problem["prompt"],
            max_tokens=512
        )
        if solution_passes_tests(solution, problem["test"]):
            correct += 1
    return correct / len(dataset)

# Compare
fp16_acc = evaluate_pass_at_1(model_fp16, humaneval)
nvfp4_acc = evaluate_pass_at_1(model_nvfp4, humaneval)

print(f"FP16:  {fp16_acc:.1%}")
print(f"NVFP4: {nvfp4_acc:.1%}")
print(f"Delta: {nvfp4_acc - fp16_acc:+.1%}")

# Expected: <1% degradation for 7B+ models

What to Expect: PTQ Accuracy

7B+ models:<1% degradation

4B-7B models:1-2% degradation

<4B models:2-5% degradation (consider QAT)

If you see >5% degradation on any model, your calibration data is probably wrong. Don't rush to QAT—fix your calibration first.

7-model SLM ensemble memory layout on GB200

Part 2: Quantization-Aware Training (QAT)

QAT is like practicing free throws while wearing weights—it feels harder, but you're stronger when it counts. You start with a quantized model and fine-tune it, allowing the weights to adapt to the discretization error.

When to Use QAT

✅ Use QAT When:

• Model <7B parameters (PTQ struggles)
• PTQ degradation >2% (accuracy-critical)
• You have compute budget + training data
• Deploying to production (worth the effort)

❌ Skip QAT When:

• PTQ already gives <1% degradation
• Model >13B parameters (PTQ sufficient)
• No access to training data
• Rapid prototyping (PTQ faster)

QAT Workflow

# Fine-tune quantized model to recover accuracy

import torch

# 1. Start with PTQ model
model = mtq.load_quantized("./slm-cmake-6b-nvfp4")

# 2. Prepare training data (smaller dataset than pretraining)
train_data = load_dataset("cmake-corpus", split="train[:10000]")

# 3. Fine-tune with LOW learning rate
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-5,  # Much lower than initial training (was 1e-4)
    weight_decay=0.01
)

# 4. Short training (1-2 epochs typically enough)
for epoch in range(2):
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        if step % 100 == 0:
            print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

# 5. Export QAT model
mte.export_hf_checkpoint(model, "./slm-cmake-6b-nvfp4-qat")

The key is the learning rate: 1e-5 to 1e-6, about 10x lower than initial pretraining. You're not retraining the model—you're nudging weights to minimize the quantization error. Think gentle adjustment, not aggressive optimization.

Real Example: Our CMake SLM (6B params)

PTQ gave us 2.3% accuracy drop on CMake generation benchmarks. After 2 epochs of QAT (10K samples, 6 hours on GB200), accuracy fully recovered—0.1% better than FP16 baseline.

FP16 baseline: 87.4% correct CMakeLists.txt

NVFP4 PTQ: 85.1% (-2.3%)

NVFP4 QAT: 87.5% (+0.1% vs baseline!)

QAT Best Practices

•Monitor validation loss closely: If it plateaus after epoch 1, stop early. No need to waste compute.
•Use smaller batch sizes: QAT benefits from noisy gradients. We use batch size 4-8 instead of 32.
•Freeze scales initially: First epoch, only train weights. Second epoch, unfreeze scales if needed.
•Don't overtrain: More epochs doesn't mean better. We've seen degradation after epoch 3.

Part 3: Deployment Strategies

You've quantized your models. Now what? Three deployment options, each with different tradeoffs.

Option 1: TensorRT-LLM (Maximum Performance)

# Build TensorRT engine for maximum throughput

# 1. Build engine from NVFP4 checkpoint
trtllm-build \
    --checkpoint_dir ./slm-cpp-8b-nvfp4 \
    --output_dir ./engines/slm-cpp-8b \
    --gemm_plugin nvfp4 \
    --max_batch_size 32 \
    --max_input_len 2048 \
    --max_output_len 512

# 2. Run inference
from tensorrt_llm import LLM

llm = LLM(engine_dir="./engines/slm-cpp-8b")
outputs = llm.generate(
    prompts=["Write a C++ function to..."],
    max_tokens=512,
    temperature=0.7
)

# Expect: 4x throughput vs FP16 on Blackwell

✅ TensorRT-LLM Pros

• Maximum throughput (fully optimized for Blackwell)
• Best latency (native FP4 kernels)
• Production-grade stability

❌ TensorRT-LLM Cons

• Longer build times (engine compilation)
• Less ecosystem flexibility (NVIDIA-specific)
• Harder to debug

Option 2: vLLM (Ecosystem Compatibility)

# Deploy with vLLM for easy scaling

from vllm import LLM, SamplingParams

# Load NVFP4 model
llm = LLM(
    model="./slm-cpp-8b-nvfp4",
    quantization="nvfp4",
    tensor_parallel_size=1,  # Single GPU
    gpu_memory_utilization=0.9
)

# Generate
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)
outputs = llm.generate(prompts, sampling_params)

# Note: W4A4 on SM100+, falls back to W4A16 on older GPUs

✅ vLLM Pros

• Easy to deploy (one command)
• Great ecosystem (integrates with everything)
• Active development (new features constantly)

❌ vLLM Cons

• Slightly lower throughput than TensorRT-LLM
• NVFP4 support still maturing (as of Dec 2025)
• Occasional multi-GPU OOM issues

Option 3: Hugging Face Transformers (Development)

# Quick prototyping with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load NVFP4 model (may dequantize on non-Blackwell)
model = AutoModelForCausalLM.from_pretrained(
    "./slm-cpp-8b-nvfp4",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("./slm-cpp-8b-nvfp4")

# Generate
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512)

# Good for development, not production deployment

We use: TensorRT-LLM for production, vLLM for development/testing, Transformers for quick experiments. Pick the right tool for the job.

Latency comparison: FP16 vs NVFP4 across batch sizes

Part 4: Our 7-Model SLM Ensemble

This is where NVFP4 goes from "cool tech" to "enables our entire architecture." We run 8 specialist models simultaneously on a single GB200. Here's the breakdown:

Memory Budget Breakdown

Model	Parameters	FP16 Memory	NVFP4 Memory
C++ SLM	8B	16 GB	4.6 GB
CMake SLM	6B	12 GB	3.4 GB
Debug SLM	7B	14 GB	4.0 GB
Shell SLM	5B	10 GB	2.9 GB
Orchestration SLM	8B	16 GB	4.6 GB
Design SLM	7B	14 GB	4.0 GB
Review SLM	6B	12 GB	3.4 GB
TOTAL	47B	94 GB	26.9 GB

GB200 has 192 GB HBM3e. We use ~27 GB for models, leaving ~165 GB for KV cache, activations, and batch processing.

Orchestration Logic

# Load all 8 models into memory

from vllm import LLM

models = {
    "cpp": "./slm-cpp-8b-nvfp4",
    "cmake": "./slm-cmake-6b-nvfp4-qat",  # Used QAT
    "debug": "./slm-debug-7b-nvfp4",
    "shell": "./slm-shell-5b-nvfp4",
    "orchestration": "./slm-orch-8b-nvfp4",
    "design": "./slm-design-7b-nvfp4",
    "review": "./slm-review-6b-nvfp4"
}

# Load all models (fits in 27 GB!)
llms = {
    name: LLM(model=path, quantization="nvfp4")
    for name, path in models.items()
}

# Routing function
def route_query(query, llms):
    # First, classify task (uses orchestration SLM)
    task_prompt = f"Classify this query: {query}\nTask type:"
    task_type = llms["orchestration"].generate(
        task_prompt,
        max_tokens=10
    ).outputs[0].text.strip()

    # Route to specialist
    specialist = llms.get(task_type, llms["cpp"])  # Default to C++
    return specialist.generate(query, max_tokens=512)

# Usage
response = route_query(
    "Debug this segfault in my FAISS index",
    llms
)
# Routes to Debug SLM

The orchestration SLM classifies the query (C++ code generation? CMake build fix? Shell script?) and routes to the appropriate specialist. Because all models are in memory, switching is <10ms. Without NVFP4, we'd need model swapping (seconds) or 4+ GPUs.

The Mixed Tokenizer Strategy

Each of our 7 SLMs uses a partially different tokenizer optimized for its domain (C++ has more symbol tokens, CMake has build-specific tokens, etc.). We train "space converters" that translate between tokenizer spaces without going through text.

NVFP4's fast inference makes real-time conversion viable. At FP16 latencies, the overhead would be unacceptable.

Part 5: Benchmarking & Validation

You quantized your models. You deployed them. Now: does it actually work? Here's how we benchmark.

Latency Testing

# Measure p50/p95/p99 latency

import time
import numpy as np

def benchmark_latency(model, prompts, num_runs=100):
    """Measure latency distribution"""
    latencies = []
    for _ in range(num_runs):
        start = time.time()
        model.generate(prompts, max_tokens=128)
        latencies.append(time.time() - start)

    return {
        "p50": np.percentile(latencies, 50),
        "p95": np.percentile(latencies, 95),
        "p99": np.percentile(latencies, 99),
        "mean": np.mean(latencies)
    }

# Compare FP16 vs NVFP4
test_prompts = ["Generate C++ code for..."] * 8  # Batch of 8

fp16_lat = benchmark_latency(model_fp16, test_prompts)
nvfp4_lat = benchmark_latency(model_nvfp4, test_prompts)

print(f"FP16  p50: {fp16_lat['p50']*1000:.1f}ms")
print(f"NVFP4 p50: {nvfp4_lat['p50']*1000:.1f}ms")
print(f"Speedup: {fp16_lat['p50'] / nvfp4_lat['p50']:.2f}x")

Our C++ SLM Results (8B params, GB200)

FP16 Baseline

p50: 87ms

p95: 112ms

p99: 145ms

NVFP4

p50: 24ms (3.6x faster)

p95: 31ms (3.6x faster)

p99: 41ms (3.5x faster)

Throughput Testing

# Measure tokens/second at different batch sizes

def benchmark_throughput(model, batch_sizes=[1, 4, 8, 16, 32]):
    """Test throughput scaling with batch size"""
    results = {}
    for bs in batch_sizes:
        prompts = ["Generate C++ code..."] * bs

        start = time.time()
        outputs = model.generate(prompts, max_tokens=128)
        elapsed = time.time() - start

        # Count total tokens generated
        tokens = sum(len(o.token_ids) for o in outputs)
        results[bs] = {
            "tokens_per_sec": tokens / elapsed,
            "latency_per_batch": elapsed
        }
    return results

# Run benchmark
throughput_fp16 = benchmark_throughput(model_fp16)
throughput_nvfp4 = benchmark_throughput(model_nvfp4)

# Plot or print results
for bs in [1, 8, 32]:
    fp16_tps = throughput_fp16[bs]["tokens_per_sec"]
    nvfp4_tps = throughput_nvfp4[bs]["tokens_per_sec"]
    speedup = nvfp4_tps / fp16_tps
    print(f"Batch {bs}: {nvfp4_tps:.0f} tok/s (FP16: {fp16_tps:.0f}, {speedup:.2f}x)")

Accuracy Testing

# HumanEval for code generation models

from datasets import load_dataset

humaneval = load_dataset("openai_humaneval")

def evaluate_pass_at_k(model, dataset, k=1):
    """Standard HumanEval evaluation"""
    correct = 0
    total = len(dataset)

    for problem in dataset:
        # Generate k solutions
        solutions = model.generate(
            problem["prompt"],
            num_return_sequences=k,
            max_tokens=512,
            temperature=0.8  # Higher temp for diversity
        )

        # Check if any solution passes tests
        passed = any(
            execute_and_test(sol, problem["test"])
            for sol in solutions
        )
        if passed:
            correct += 1

    return correct / total

# Compare
fp16_acc = evaluate_pass_at_k(model_fp16, humaneval, k=1)
nvfp4_acc = evaluate_pass_at_k(model_nvfp4, humaneval, k=1)

print(f"FP16:  {fp16_acc:.1%} pass@1")
print(f"NVFP4: {nvfp4_acc:.1%} pass@1")
print(f"Delta: {nvfp4_acc - fp16_acc:+.1%}")

Expected Results Summary

Latency: 3-4x improvement (Blackwell), 1x (Hopper W4A16)
Throughput: 2-4x tokens/sec depending on batch size
Accuracy: <1% degradation for 7B+ models (PTQ)
Memory: 3.5x reduction vs FP16, 1.8x vs FP8

Part 6: Troubleshooting Guide

Issue 1: Accuracy Degradation >5%

If your accuracy drops more than 5% after PTQ, don't panic. Your model isn't broken—your calibration data is probably wrong.

Solutions:

• Check calibration distribution: Does it match inference workload?
• Increase sample count: 512 → 1024 samples
• Try different calibration methods: SmoothQuant, AWQ
• Last resort: Use QAT (2 epochs usually recovers accuracy)

Issue 2: OOM (Out of Memory) on Multi-GPU

vLLM sometimes struggles with NVFP4 tensor parallelism. We've hit this with our ensemble.

Solutions:

• Use smaller batch sizes: NVFP4 enables larger batches, but not infinite
• Profile memory: torch.cuda.memory_summary()
• Try TensorRT-LLM: Better multi-GPU support for NVFP4
• Reduce KV cache size: Lower max_seq_len if you don't need it

Issue 3: Slower Than Expected

You quantized to NVFP4 but aren't seeing the promised 4x speedup. Three common causes:

Solutions:

• Check GPU generation: Need Blackwell (SM100) for W4A4. Hopper only does W4A16.
• Verify quantization mode: print(model.config.quantization)
• Profile bottlenecks: Is it memory-bound or compute-bound? Use nsys
• Check batch size: NVFP4 shines at larger batches (8+)

Issue 4: Model Not Loading in vLLM

"Quantization config not found" or vLLM crashes on startup.

Solutions:

• Check config.json: Should have "quantization_config" field
• Verify vLLM version: Need 0.8.0+ for NVFP4 support
• Re-export from TensorRT: Sometimes quantization metadata gets lost
• Try TensorRT-LLM instead: More mature NVFP4 support

Conclusion: NVFP4 in Production

We've gone from "NVFP4 looks cool in papers" to running 7 specialized models on a single GPU in production. The journey had bumps (calibration data mishaps, vLLM OOM errors, mysterious accuracy drops), but the destination was worth it.

The NVFP4 Decision Tree

1. Start with PTQ (TensorRT Model Optimizer or LLM Compressor)

↳ If accuracy <1% degradation: Ship it!

↳ If 1-3% degradation: Consider QAT for critical models

↳ If >5% degradation: Fix calibration data first!

2. Use QAT if: Model <7B, accuracy-critical, have compute budget

3. Deploy: TensorRT-LLM (max perf) or vLLM (flexibility)

4. Monitor: Watch quantization error, validate continuously

The honest assessment: NVFP4 requires Blackwell for full benefits. On Hopper, you get memory savings but limited speed gains (W4A16 only). But if you have Blackwell? It's transformative. Our architecture literally couldn't exist without it.

"There's something profound about the fact that discrete optimization (finding the right scales) enables continuous compression (accurate quantization). We're not just compressing models—we're learning which dimensions of the weight space actually matter for computation."

Next up: How we combined regular layers, Transformer Engine layers, and Mamba 3 TE layers into a hybrid architecture. (Spoiler: The tokenizer situation gets weird.)

Code Repository

All code examples, benchmarking scripts, and deployment configs available on GitHub:

View on GitHub →

From Theory to Reality

Part 1: Post-Training Quantization (PTQ)

Step 1: Environment Setup

⚠️ Common Gotcha #1: GPU Generation

Step 2: Calibration Data

⚠️ Common Gotcha #2: Dataset Mismatch

Step 3: Quantize with TensorRT Model Optimizer

Step 4: Alternative - LLM Compressor (vLLM Ecosystem)

Step 5: Validate Accuracy

What to Expect: PTQ Accuracy

Part 2: Quantization-Aware Training (QAT)

When to Use QAT

✅ Use QAT When:

❌ Skip QAT When:

QAT Workflow

Real Example: Our CMake SLM (6B params)

QAT Best Practices

Part 3: Deployment Strategies

Option 1: TensorRT-LLM (Maximum Performance)

✅ TensorRT-LLM Pros

❌ TensorRT-LLM Cons

Option 2: vLLM (Ecosystem Compatibility)

✅ vLLM Pros

❌ vLLM Cons

Option 3: Hugging Face Transformers (Development)

Part 4: Our 7-Model SLM Ensemble

Memory Budget Breakdown

Orchestration Logic

The Mixed Tokenizer Strategy

Part 5: Benchmarking & Validation

Latency Testing

Our C++ SLM Results (8B params, GB200)

FP16 Baseline

NVFP4

Throughput Testing

Accuracy Testing

Expected Results Summary

Part 6: Troubleshooting Guide

Issue 1: Accuracy Degradation >5%

Issue 2: OOM (Out of Memory) on Multi-GPU

Issue 3: Slower Than Expected

Issue 4: Model Not Loading in vLLM

Conclusion: NVFP4 in Production

The NVFP4 Decision Tree

Code Repository

References