Skip to main content

ADR-012: Error Handling and Resilience Strategy

Date: 2026-06-24
Status: Accepted

Context

The ProviderFailover module (src/llm/ProviderFailover.ts) implements a multi-layered resilience strategy, but the design decisions were never formally documented. The v2.0.0 Foundation Audit identified that ADR-011 and ADR-012 were stubs referenced in academy articles but never created. This ADR captures the error handling taxonomy, circuit breaker state machine, retry strategies, and the relationships between resilience components.

The strategy spans three layers:

  1. Rate limiter (proactive — prevents errors before they happen)
  2. Circuit breaker (reactive — handles errors when they occur)
  3. Provider chain (orchestration — rotates through providers)

Decision

Failure classification taxonomy

All provider errors are classified into one of 7 categories by classifyError() in ProviderFailover.ts:

CategoryHTTP CodesRetryablePermanentDefault CooldownBehavior
auth401NoYes0msSkip provider permanently, throw immediately
payment402NoYes0msSkip provider permanently
rate_limited429YesNoRetry-After header × 1000 (default 60s)Trip breaker, probe before cooldown expiry
timeout408YesNo30sTrip breaker, retry with backoff
unavailable503, 500+YesNo60sTrip breaker, probe before cooldown expiry
model_not_found404NoYes0msSkip provider permanently
unknownOtherNoNo0msPropagate to caller — no circuit state change

Classification cascades: instanceof checks first (via typed error classes), then regex match for embedded HTTP status codes in error messages, then falls through to unknown.

Circuit breaker state machine

Three states, stored per-provider in Map<string, CircuitBreakerState>:

                  ┌──────────────────────────────────────┐
                  │                                      │
                  ▼                                      │
    ┌─────────┐       permanent error       ┌────────┐  │
    │  CLOSED │ ──────────────────────────▶  │  OPEN  │  │
    │         │         (cooldown=inf)       │        │──┘
    │ (normal)│                              │ (open) │
    │         │◀──── probe succeeds ────────│        │
    └─────────┘     (onSuccess())            └────────┘
         │                                      │
         │ retryable error          cooldown    │
         │ (tripBreaker())          expires     │
         │                                      │
         ▼                                      ▼
    ┌─────────┐                              ┌──────────┐
    │         │                              │ HALF_OPEN │
    │         │◀───── probe scheduled ──────│          │
    │         │      (cooldown - 30s)        │  (probing)│
    └─────────┘                              └──────────┘
                                                   │
                                          probe     │
                                          fails     │
                                                   ▼
                                               ┌────────┐
                                               │  OPEN  │
                                               └────────┘

Transitions:

  • CLOSED → OPEN: tripBreaker() called when failure count reaches failureThreshold (default: 1). Permanent errors (auth, payment) always open immediately regardless of threshold. Sets cooldownUntil timestamp.
  • OPEN → HALF_OPEN: activeProviders() runs before each request. If Date.now() >= probeScheduledAt, transitions to probing state.
  • HALF_OPEN → CLOSED: onSuccess() resets failure count and cooldown. The probe request succeeded.
  • HALF_OPEN → OPEN: If probe request fails, tripBreaker() is called again, resetting cooldown.

Retry/backoff strategies (5 strategies)

Defined in executeWithStrategy() and the provider loop in complete():

1-2. Immediate retry + exponential backoff — Up to 3 attempts on the same provider with backoff 1000 × 2^(retry-1+index) ms. Permanent errors (except model_not_found) throw immediately without retry.
3. Provider rotation — If all strategies on provider A fail, move to provider B in the chain
4. Model downgrade — When primary model fails on a provider, retry with cheaper/faster model on the same provider using a modelMap per-provider. Implemented in executeWithStrategy() for complete(), and in-stream for stream() and embed(). Falls through to model_not_found when all models are exhausted.
5. Human escalation — NOT YET IMPLEMENTED. When all providers and models fail, surface the error chain to the user

Probe recovery

Probe requests are sent 30 seconds before cooldown expiry (probeScheduledAt = cooldownUntil - 30_000). This preemptive approach:

  • Reduces user-visible latency (no need to wait for full cooldown)
  • Works for cooldowns > 30s (rate_limited: 60s, unavailable: 60s, timeout: 30s — timeout is borderline)
  • Uses activeProviders() filter, called at the start of every complete() and stream() invocation
  • No separate probe request — the next user request to that provider serves as the probe
  • Also transitions OPEN → HALF_OPEN when cooldown expires naturally (Date.now() >= cooldownUntil), providing a second recovery path in activeProviders()

Error propagation across the provider chain

Provider A (preferred)
  ├── Strategy 1: immediate retry
  ├── Strategy 2: exponential backoff
  ├── Strategy 3: provider rotation (to B)
  └── Strategy 4: model downgrade (on A, fallback models via modelMap)

Provider B (fallback 1)  ← same strategy set repeats
Provider C (fallback 2)  ← same strategy set repeats

All failed → AllProvidersFailedError with concatenated error messages

AllProvidersFailedError carries a category field reflecting the most recent failure category, enabling callers to make informed decisions about what to report to the user.

Relationship to RiskManager and SafetyGuard

ComponentConcernInteraction with ProviderFailover
ProviderFailoverProvider reliabilityCircuit breaker, failover, retry
RiskManagerTool-level safety (paths, hosts, file size)Separate — validates inputs before any provider call
SafetyGuardAgent loop safety (caps, profiles, loop detection)Separate — monitors agent behavior, not provider calls
RateLimiterRequest throttling (proposed in ADR-011)Prevents 429s; circuit breaker handles them when they occur

Alternatives Considered

StrategyProsConsWhy Rejected
Exponential backoff only (no circuit breaker)Simpler implementationNo cooldown — hammering provider on every retryWithout circuit breaker, transient errors cause sustained load
Circuit breaker without probe recoverySimpler state machineMust wait full cooldown before retrying — poor UXPreemptive probes reduce visible latency
All errors are permanentSimple, predictablePoor resilience — any transient failure kills the agentUnacceptable for production use
Single provider, no failoverSimplest possibleSingle point of failureIncompatible with M4 requirements

Consequences

Easier:

  • Clear taxonomy for classifying any provider error
  • State machine is simple enough to reason about and test
  • Probe recovery reduces user-visible latency
  • Provider chain is extensible (add a provider → more resilience)

Harder:

  • 5 strategies are split across two code paths (executeWithStrategy + provider loop) — non-obvious
  • No Retry-After header parsing (hardcoded defaults)
  • Probe recovery at cooldown - 30s is a heuristic — may not match all provider rate limit windows

Deferred to #217:

  • Strategy 5 (human escalation) — surface the error chain to the user when all providers and models fail

References