SLM Ensemble Architecture
8 specialized language models, each trained for specific C++ engineering tasks. Together, they outperform larger generalist models while using a fraction of the compute.
Architecture
Hybrid Layer Stack (Nemotron Nano 3 Inspired)
Each SLM uses a hybrid architecture inspired by NVIDIA Nemotron Nano 3, combining three computational paradigms:
Regular Layers
Standard transformer attention for high-fidelity reasoning
Transformer Engine Layers
NVIDIA TE for efficient mixed-precision computation
Mamba 3 TE Layers
Mamba 3 state-space models for O(n) long-context processing
Mixture of Experts (MoE)
Each model uses MoE architecture with selective expert activation:
- 128 routed experts per MoE layer
- 1 shared expert for general knowledge
- 6 experts activated per token
- ~10% activation ratio (4B-8B total, 0.8B-1.6B active)
Mixed Tokenizers with Space Converters
Each model has a partially compatible tokenizer optimized for its domain:
- Domain-specific vocabulary per model
- Space converters between SLMs for inter-model communication
- Direct space-to-space conversion training (bypasses tokenizers)
- Reduces routing overhead and improves efficiency
The Eight Specialists
C++ SLM
Core C++ code generation, templates, concepts, and modern C++20/23 features
CMake SLM
Build system expertise: CMakeLists.txt generation, target management, cross-compilation
Debug SLM
Error analysis, stack traces, memory debugging, UBSan/ASan output interpretation
Shell SLM
Shell scripting, command-line tools, CI/CD pipelines, Makefile expertise
Orchestration SLM
Multi-agent coordination, task routing, context management between specialists
Design SLM
Architecture patterns, SOLID principles, API design, refactoring strategies
Review SLM
Code review, best practices, security analysis, performance optimization suggestions
Algorithm SLM
Pseudocode generation, algorithm design, complexity analysis, data structure selection
NVFP4 Quantization
How We Fit 8 Models on One GPU
Using NVIDIA's NVFP4 quantization (4-bit floating point) for inference, our entire 8-model ensemble fits in ~32 GB of VRAM—less than a single 70B generalist model in FP16.
Training reality: We use BF16 on our local GB10 cluster (SM 121 doesn't support NVFP4 training yet) and NVFP4 training on rented B200s (SM 100). Training uses Muon optimizer (not AdamW).