SLM Ensemble
8 Models
MoE

SLM Ensemble Architecture

8 specialized language models, each trained for specific C++ engineering tasks. Together, they outperform larger generalist models while using a fraction of the compute.

4B-8B
Parameters per Model
0.8-1.6B
Active Parameters (MoE)
100-200B
Training Tokens Each
NVFP4
Inference Precision

Architecture

Hybrid Layer Stack (Nemotron Nano 3 Inspired)

Each SLM uses a hybrid architecture inspired by NVIDIA Nemotron Nano 3, combining three computational paradigms:

Regular Layers

Standard transformer attention for high-fidelity reasoning

Transformer Engine Layers

NVIDIA TE for efficient mixed-precision computation

Mamba 3 TE Layers

Mamba 3 state-space models for O(n) long-context processing

Mixture of Experts (MoE)

Each model uses MoE architecture with selective expert activation:

  • 128 routed experts per MoE layer
  • 1 shared expert for general knowledge
  • 6 experts activated per token
  • ~10% activation ratio (4B-8B total, 0.8B-1.6B active)

Mixed Tokenizers with Space Converters

Each model has a partially compatible tokenizer optimized for its domain:

  • Domain-specific vocabulary per model
  • Space converters between SLMs for inter-model communication
  • Direct space-to-space conversion training (bypasses tokenizers)
  • Reduces routing overhead and improves efficiency

The Eight Specialists

C++ SLM

Core C++ code generation, templates, concepts, and modern C++20/23 features

Params: 8B
Active: 1.6B

CMake SLM

Build system expertise: CMakeLists.txt generation, target management, cross-compilation

Params: 6B
Active: 1.2B

Debug SLM

Error analysis, stack traces, memory debugging, UBSan/ASan output interpretation

Params: 7B
Active: 1.4B

Shell SLM

Shell scripting, command-line tools, CI/CD pipelines, Makefile expertise

Params: 5B
Active: 1.0B

Orchestration SLM

Multi-agent coordination, task routing, context management between specialists

Params: 8B
Active: 1.6B

Design SLM

Architecture patterns, SOLID principles, API design, refactoring strategies

Params: 7B
Active: 1.4B

Review SLM

Code review, best practices, security analysis, performance optimization suggestions

Params: 6B
Active: 1.2B

Algorithm SLM

Pseudocode generation, algorithm design, complexity analysis, data structure selection

Params: 7B
Active: 1.4B

NVFP4 Quantization

How We Fit 8 Models on One GPU

Using NVIDIA's NVFP4 quantization (4-bit floating point) for inference, our entire 8-model ensemble fits in ~32 GB of VRAM—less than a single 70B generalist model in FP16.

Training reality: We use BF16 on our local GB10 cluster (SM 121 doesn't support NVFP4 training yet) and NVFP4 training on rented B200s (SM 100). Training uses Muon optimizer (not AdamW).

3.5x
Memory Reduction vs FP16
<1%
Accuracy Loss
4x
Throughput Improvement

Related Resources