Architecture
Resilience

Circuit Breaker Pattern

Prevent cascade failures with automatic circuit breaking that isolates failing services and enables graceful recovery.

Circuit Breaker State Machine

State transitions: CLOSED → OPEN → HALF_OPEN → CLOSED

State Machine

CLOSED

Normal operation. All requests pass through.

  • • Tracks failure count
  • • Resets on success
  • • Opens on threshold

OPEN

Fail-fast mode. Requests rejected immediately.

  • • No requests sent
  • • Timer counts down
  • • Transitions to HALF_OPEN

HALF_OPEN

Probe mode. Limited requests allowed.

  • • Allows test requests
  • • Success → CLOSED
  • • Failure → OPEN

Configuration

// CircuitBreaker.h configuration
struct CircuitBreakerConfig {
  // Failure threshold to open circuit
  int failureThreshold = 5;

  // Time window for counting failures (ms)
  int failureWindowMs = 60000;

  // Time to stay open before half-open (ms)
  int openDurationMs = 30000;

  // Number of probe requests in half-open
  int halfOpenProbeCount = 3;

  // Success threshold to close from half-open
  int successThreshold = 2;

  // Optional: slow call threshold (ms)
  int slowCallThresholdMs = 5000;
  float slowCallRateThreshold = 0.5;
};

Usage Example

#include "CircuitBreaker.h"

// Create circuit breaker for remote node
CircuitBreaker nodeBreaker(CircuitBreakerConfig{
  .failureThreshold = 5,
  .openDurationMs = 30000
});

// Wrap calls with circuit breaker
Result search(const Query& query, const Node& node) {
  // Check if circuit allows request
  if (!nodeBreaker.allowRequest()) {
    return Result::circuitOpen();
  }

  try {
    auto result = node.search(query);
    nodeBreaker.recordSuccess();
    return result;
  } catch (const std::exception& e) {
    nodeBreaker.recordFailure();
    throw;
  }
}

// Check circuit state
auto state = nodeBreaker.getState();
// State::CLOSED, State::OPEN, or State::HALF_OPEN

MLGraph Implementation

MLGraph uses circuit breakers at multiple levels:

  • Node-level: Each remote node has its own circuit breaker. If a node fails repeatedly, it's temporarily removed from routing.
  • Service-level: External dependencies (storage, auth) have breakers. Protects against third-party outages.
  • Query-level: Expensive queries can trigger slow-call circuit. Prevents resource exhaustion from pathological queries.

Monitoring

Metrics Exposed

  • • circuit_breaker_state{name="node-1"} = 0|1|2
  • • circuit_breaker_failure_count{name="node-1"}
  • • circuit_breaker_success_count{name="node-1"}
  • • circuit_breaker_rejected_count{name="node-1"}
  • • circuit_breaker_state_transitions_total{name="node-1"}