Skip to main content

Spec: Provider Failover with Circuit Breaker

Status: Implemented (v2.0.0)

Issue: #161

Architecture Doc: architecture/provider-failover.md


Problem

The Agenthood runtime currently depends on a single LLM provider per execution. When that provider fails—due to rate limits, temporary outages, or authentication issues—the entire member execution fails and the task is interrupted.

This creates poor production reliability:

  • Groq rate limits (30 req/min free tier) kill sessions during burst activity
  • Provider outages halt all member work, even though alternative providers are available
  • Intermittent failures force manual retry from humans instead of automated recovery
  • No retry logic means transient network blips cause permanent task failure

The user expects the Society to handle provider fragility transparently. If Claude is down, use OpenAI. If Groq hits rate limits, switch to Ollama. The member should continue working—the human should never notice the plumbing.


Proposed Solution

Implement a three-phase progressive enhancement to provider resilience:

Phase 1: Basic Failover (MVP)

Try each configured provider in sequence until one succeeds or all fail.

Behavior:

  • Load provider chain from .agenthood/config.json (llm.providers[])
  • Attempt providers in order: [primary, fallback1, fallback2, ...]
  • On failure, immediately try the next provider
  • If all providers fail, throw AllProvidersFailedError with structured failure details

No circuit breaker yet—this is a simple try/catch loop with failover.

Acceptance Criteria:

  • Provider chain loaded from config
  • Each provider attempted in sequence
  • First success stops the chain and returns result
  • All failures collected and reported in AllProvidersFailedError
  • Existing member execution flow unchanged (transparent drop-in)

Phase 2: Circuit Breaker

Track provider health and skip known-bad providers.

Behavior:

  • Each provider has a circuit state: CLOSED | OPEN | HALF_OPEN
  • CLOSED: Normal operation—requests flow through
  • OPEN: Provider bypassed (failures exceeded threshold)—skip to next provider
  • HALF_OPEN: Cooldown expired—one probe request allowed to test recovery

State Transitions:

CLOSED → (N failures within window) → OPEN
OPEN → (cooldown expires) → HALF_OPEN
HALF_OPEN → (probe succeeds) → CLOSED
HALF_OPEN → (probe fails) → OPEN (extend cooldown)

Configuration:

interface CircuitBreakerConfig {
  failureThreshold: number;      // default: 3 failures
  failureWindow: number;          // default: 60 seconds
  cooldownDuration: number;       // default: 120 seconds
  probeBeforeCooldown: number;    // default: 30 seconds before cooldown expires
}

Acceptance Criteria:

  • Circuit state tracked per provider
  • Providers in OPEN state are skipped during failover
  • Circuit opens after N failures within time window
  • Circuit transitions to HALF_OPEN after cooldown
  • Probe request succeeds → circuit closes → provider restored
  • Probe request fails → circuit re-opens with extended cooldown
  • Circuit state logged (debug level) on every transition
  • Thread execution continues across provider switches

Phase 3: Advanced Recovery

Intelligent failure handling with classification, cooldown strategies, and preemptive probes.

Behavior:

3.1 Failure Classification

Not all failures are equal. Classify errors and apply appropriate cooldown:

HTTP StatusClassificationCooldownBehavior
401Auth failurePermanentSkip provider, log warning once
402Payment requiredPermanentSkip provider, log warning once
429Rate limited60–300sBackoff based on Retry-After header
408Request timeout30sRetry once, then failover
503Service unavailable60sImmediate failover
404Model not foundPermanentSkip provider, log error
Network errorTransient30sRetry with exponential backoff

3.2 Retry-After Header Parsing

If a provider returns 429 with a Retry-After header:

  • Parse the header (seconds or HTTP date)
  • Set cooldown to Retry-After value (capped at 5 minutes)
  • Log the retry time to user

3.3 Preemptive Probe Recovery

Send a lightweight probe request 30 seconds before cooldown expires:

  • Use a minimal prompt (e.g., "Hello")
  • On success → provider restored early
  • On failure → cooldown extended by 50%

Prevents hammering a recovering provider with full-sized requests.

3.4 Exponential Backoff for Transient Failures

For network errors and timeouts:

  • First retry: immediate
  • Second retry: 2 seconds delay
  • Third retry: 4 seconds delay
  • After 3 retries: failover to next provider

Acceptance Criteria:

  • HTTP status codes classified correctly
  • Permanent failures skip provider without cooldown
  • Rate limit failures parse Retry-After header
  • Cooldown durations set per failure classification
  • Probe request sent 30s before cooldown expiry
  • Probe success restores provider early
  • Exponential backoff applies to transient network failures
  • All failure details logged (error level) with classification
  • Metrics collected: provider health, failure counts, recovery times

Out of Scope

The following are explicitly NOT included in this spec:

  • Credential proxy — API key injection via localhost proxy (separate feature)
  • Credential proxy — API key injection via localhost proxy (separate feature)
  • Thread checkpoint — Persisting conversation state for cross-provider continuity (separate feature, may not be needed)
  • Provider cost tracking — Monitoring token usage and costs per provider (separate feature)
  • Human escalation UI — Interactive prompt to choose fallback provider when all fail (future enhancement)

These may be addressed in future specs but are not part of this implementation.


Testing Strategy

Unit Tests

Location: src/llm/providerFailover.test.ts

Test cases:

  • ✅ Single provider success (no failover)
  • ✅ First provider fails → second succeeds
  • ✅ All providers fail → AllProvidersFailedError thrown
  • ✅ Circuit opens after N failures
  • ✅ Circuit transitions to HALF_OPEN after cooldown
  • ✅ Probe success closes circuit
  • ✅ Probe failure re-opens circuit
  • ✅ Permanent failure (401) skips provider without cooldown
  • ✅ Rate limit (429) applies Retry-After cooldown
  • ✅ Transient failure applies exponential backoff
  • ✅ Probe request sent 30s before cooldown expiry
  • ✅ Empty provider chain throws immediately

Coverage target: 95%+ (critical path for production reliability)

Integration Tests

Location: tests/integration/providerFailover.test.ts

Test scenarios:

  • ✅ Real member execution with mocked provider responses
  • ✅ Groq rate limit → Ollama fallback → task completes
  • ✅ Claude auth failure → OpenAI fallback → task completes
  • ✅ All providers fail → error includes all failure reasons
  • ✅ Provider recovers mid-execution → circuit closes
  • ✅ Config validation: invalid provider name rejected
  • ✅ Thread continuity across provider switches

Mock strategy: Use nock to simulate HTTP responses from LLM APIs

E2E Tests

Location: tests/e2e/resilience.test.ts

Test scenarios:

  • agenthood run the-scribe with Groq rate limit → completes via fallback
  • agenthood run the-architect with network timeout → retries and completes
  • ✅ Multi-member execution (The Architect → The Scribe) with provider failure mid-chain
  • ✅ Provider recovery: Groq fails, recovers after cooldown, reused in next execution

Environment: Use Docker Compose to spin up Ollama for local fallback testing


Acceptance Criteria

Phase 1: Basic Failover

  • ProviderChain class exists in src/llm/ProviderFailover.ts
  • Loads provider chain from .agenthood/config.json (via LLMRouter.fromConfig())
  • Attempts each provider in sequence until one succeeds
  • Collects all failure reasons and throws AllProvidersFailedError if all fail
  • Integrated via LLMRouter into run.ts
  • Existing member executions work without changes (transparent integration)
  • Unit tests pass for basic failover logic
  • Integration test demonstrates Groq → Ollama fallback (not yet implemented)

Phase 2: Circuit Breaker

  • Circuit state tracked per provider (in-memory Map)
  • Circuit opens after configurable failureThreshold (default: 1)
  • Open circuit skips provider during failover
  • Circuit transitions to HALF_OPEN after configurable cooldownMs
  • Probe request tests provider recovery in HALF_OPEN state
  • Probe success → CLOSED, probe failure → OPEN with extended cooldown
  • Circuit state accessible via getBreakerState()
  • Unit tests cover all state transitions
  • Integration test demonstrates circuit behavior across multiple executions (not yet implemented)

Phase 3: Advanced Recovery

  • HTTP status codes classified (401, 402, 429, 408, 503, 404, network)
  • Permanent failures (401, 402) skip provider without cooldown
  • ModelNotFoundError (404) skips to fallback model on same provider before tripping
  • Rate limit failures (429) apply cooldown; Retry-After header parsing deferred (hardcoded defaults)
  • Transient failures (408, 503) use exponential backoff (3 attempts: 1000ms, 2000ms)
  • Probe request sent 30 seconds before cooldown expires
  • Probe success restores provider early
  • All failures logged with classification and cooldown duration
  • Unit tests cover all failure classifications
  • E2E test demonstrates rate limit → cooldown → probe recovery (not yet implemented)

Open Questions

Q1: Should circuit state persist across runtime restarts?

Context: Currently, circuit state is in-memory. If the runtime crashes or is restarted, all providers reset to CLOSED.

Options:

  • A: In-memory only (current)
  • B: Persist to .agenthood/cache/circuit-state.json
  • C: Use SQLite for circuit state storage

Deferred because: This is an optimization for long-running production deployments. The MVP can ship without persistence—failures will be re-learned after restart.

Decision by: End of Phase 2 implementation


Q2: Should we emit provider failover events for observability?

Context: When a provider fails and failover occurs, the user has no visibility unless they enable debug logging.

Options:

  • A: Emit events to EventEmitter for runtime subscribers
  • B: Write structured logs to .agenthood/logs/failover.json
  • C: Do nothing—debug logs are sufficient

Deferred because: This depends on the broader observability strategy (metrics, telemetry, UI). Can be added later without changing failover logic.

Decision by: When observability/telemetry system is designed


Q3: How should member-specific provider preferences interact with failover?

Context: The Doorman wants fast/cheap providers (Groq, Ollama). The Architect wants deep reasoning (Claude, OpenAI). Should the failover chain be member-aware?

Options:

  • A: Global failover chain (all members use same chain)
  • B: Per-member provider preferences override global chain
  • C: Per-member primary provider, global chain for fallback

Deferred because: This is explicitly out of scope for this spec. Member preferences are a separate feature that can layer on top of failover later.

Decision by: When member preferences spec is written (issue TBD)


Implementation Notes

Directory Structure

src/llm/
├── ProviderFailover.ts          # ProviderChain + classifyError + circuit breaker
├── ILLMProvider.ts              # Unified provider interface
├── LLMRouter.ts                 # Router: builds chains from config
├── types.ts                     # ProviderEntry, LLMConfig, LLMRequest/Response
├── errors.ts                    # Error classes: AuthError, RateLimitedError, etc.
├── AnthropicProvider.ts
├── GroqProvider.ts
├── OpenAIProvider.ts
└── OllamaProvider.ts

tests/unit/llm/
└── ProviderFailover.test.ts     # 36 tests covering all failover scenarios

Configuration Schema

Add to .agenthood/config.json:

{
  "llm": {
    "providers": [
      {
        "name": "anthropic",
        "model": "claude-sonnet-4-20250514",
        "apiKey": "...",
        "models": ["claude-sonnet-4-20250514", "claude-haiku-3-20250301"]
      },
      {
        "name": "groq",
        "model": "llama-3.1-70b-versatile",
        "apiKey": "..."
      }
    ],
    "failover": {
      "failureThreshold": 1,
      "cooldownMs": 30000,
      "probeEnabled": true
    }
  }
}

The models array on a provider entry defines the model downgrade chain for
Strategy 4. The first entry is the primary model; subsequent entries are
fallbacks tried before failing over to the next provider.

Error Handling

class AllProvidersFailedError extends Error {
  readonly category: string

  constructor(errors: string[], category: string = 'unknown') {
    super(`All providers failed: ${errors.join('; ')}`)
    this.name = 'AllProvidersFailedError'
    this.category = category
  }
}

Thread Continuity

Assumption: Threads are stateless—each executeMember() call is independent. No checkpoint/restore needed for MVP.

If thread continuity is required later:

  • Persist conversation history to .agenthood/cache/threads/{threadId}.json
  • Reload history on provider switch
  • This is a separate feature—not blocking this spec

Success Metrics

How we know this is working in production:

  • Failover success rate: % of executions that succeed after failover (target: >95%)
  • Mean time to recovery: Average time from failure to successful provider switch (target: <5s)
  • Provider health score: % uptime per provider over 24h window
  • Circuit breaker effectiveness: % of avoided requests to known-bad providers
  • Probe recovery rate: % of providers restored via probe vs. timeout

These metrics can be logged to .agenthood/logs/failover-metrics.json or emitted via telemetry (out of scope for this spec, but design should allow future instrumentation).


References

  • Architecture Doc: architecture/provider-failover.md
  • Issue: #161 — Implement ProviderFailover for resilience
  • Pattern: Circuit Breaker Pattern (Martin Fowler)
  • Related ADRs:
    • ADR-008: TypeScript Runtime over Python
    • ADR-009: Groq as Default LLM Provider
    • ADR-011: Rate Limiter and Shared State Store
    • ADR-012: Error Handling and Resilience Strategy
    • ADR-013: Distribution Channel Priority