Architecture
Resilience
Circuit Breaker Pattern
Prevent cascade failures with automatic circuit breaking that isolates failing services and enables graceful recovery.

State transitions: CLOSED → OPEN → HALF_OPEN → CLOSED
State Machine
CLOSED
Normal operation. All requests pass through.
- • Tracks failure count
- • Resets on success
- • Opens on threshold
OPEN
Fail-fast mode. Requests rejected immediately.
- • No requests sent
- • Timer counts down
- • Transitions to HALF_OPEN
HALF_OPEN
Probe mode. Limited requests allowed.
- • Allows test requests
- • Success → CLOSED
- • Failure → OPEN
Configuration
// CircuitBreaker.h configuration
struct CircuitBreakerConfig {
// Failure threshold to open circuit
int failureThreshold = 5;
// Time window for counting failures (ms)
int failureWindowMs = 60000;
// Time to stay open before half-open (ms)
int openDurationMs = 30000;
// Number of probe requests in half-open
int halfOpenProbeCount = 3;
// Success threshold to close from half-open
int successThreshold = 2;
// Optional: slow call threshold (ms)
int slowCallThresholdMs = 5000;
float slowCallRateThreshold = 0.5;
};Usage Example
#include "CircuitBreaker.h"
// Create circuit breaker for remote node
CircuitBreaker nodeBreaker(CircuitBreakerConfig{
.failureThreshold = 5,
.openDurationMs = 30000
});
// Wrap calls with circuit breaker
Result search(const Query& query, const Node& node) {
// Check if circuit allows request
if (!nodeBreaker.allowRequest()) {
return Result::circuitOpen();
}
try {
auto result = node.search(query);
nodeBreaker.recordSuccess();
return result;
} catch (const std::exception& e) {
nodeBreaker.recordFailure();
throw;
}
}
// Check circuit state
auto state = nodeBreaker.getState();
// State::CLOSED, State::OPEN, or State::HALF_OPENMLGraph Implementation
MLGraph uses circuit breakers at multiple levels:
- Node-level: Each remote node has its own circuit breaker. If a node fails repeatedly, it's temporarily removed from routing.
- Service-level: External dependencies (storage, auth) have breakers. Protects against third-party outages.
- Query-level: Expensive queries can trigger slow-call circuit. Prevents resource exhaustion from pathological queries.
Monitoring
Metrics Exposed
- • circuit_breaker_state{name="node-1"} = 0|1|2
- • circuit_breaker_failure_count{name="node-1"}
- • circuit_breaker_success_count{name="node-1"}
- • circuit_breaker_rejected_count{name="node-1"}
- • circuit_breaker_state_transitions_total{name="node-1"}