Spec: Provider Failover with Circuit Breaker
Status: Implemented (v2.0.0)
Issue: #161
Architecture Doc: architecture/provider-failover.md
Problem
The Agenthood runtime currently depends on a single LLM provider per execution. When that provider fails—due to rate limits, temporary outages, or authentication issues—the entire member execution fails and the task is interrupted.
This creates poor production reliability:
- Groq rate limits (30 req/min free tier) kill sessions during burst activity
- Provider outages halt all member work, even though alternative providers are available
- Intermittent failures force manual retry from humans instead of automated recovery
- No retry logic means transient network blips cause permanent task failure
The user expects the Society to handle provider fragility transparently. If Claude is down, use OpenAI. If Groq hits rate limits, switch to Ollama. The member should continue working—the human should never notice the plumbing.
Proposed Solution
Implement a three-phase progressive enhancement to provider resilience:
Phase 1: Basic Failover (MVP)
Try each configured provider in sequence until one succeeds or all fail.
Behavior:
- Load provider chain from
.agenthood/config.json(llm.providers[]) - Attempt providers in order:
[primary, fallback1, fallback2, ...] - On failure, immediately try the next provider
- If all providers fail, throw
AllProvidersFailedErrorwith structured failure details
No circuit breaker yet—this is a simple try/catch loop with failover.
Acceptance Criteria:
- Provider chain loaded from config
- Each provider attempted in sequence
- First success stops the chain and returns result
- All failures collected and reported in
AllProvidersFailedError - Existing member execution flow unchanged (transparent drop-in)
Phase 2: Circuit Breaker
Track provider health and skip known-bad providers.
Behavior:
- Each provider has a circuit state:
CLOSED|OPEN|HALF_OPEN - CLOSED: Normal operation—requests flow through
- OPEN: Provider bypassed (failures exceeded threshold)—skip to next provider
- HALF_OPEN: Cooldown expired—one probe request allowed to test recovery
State Transitions:
CLOSED → (N failures within window) → OPEN
OPEN → (cooldown expires) → HALF_OPEN
HALF_OPEN → (probe succeeds) → CLOSED
HALF_OPEN → (probe fails) → OPEN (extend cooldown)
Configuration:
interface CircuitBreakerConfig { failureThreshold: number; // default: 3 failures failureWindow: number; // default: 60 seconds cooldownDuration: number; // default: 120 seconds probeBeforeCooldown: number; // default: 30 seconds before cooldown expires }
Acceptance Criteria:
- Circuit state tracked per provider
- Providers in
OPENstate are skipped during failover - Circuit opens after N failures within time window
- Circuit transitions to
HALF_OPENafter cooldown - Probe request succeeds → circuit closes → provider restored
- Probe request fails → circuit re-opens with extended cooldown
- Circuit state logged (debug level) on every transition
- Thread execution continues across provider switches
Phase 3: Advanced Recovery
Intelligent failure handling with classification, cooldown strategies, and preemptive probes.
Behavior:
3.1 Failure Classification
Not all failures are equal. Classify errors and apply appropriate cooldown:
| HTTP Status | Classification | Cooldown | Behavior |
|---|---|---|---|
401 | Auth failure | Permanent | Skip provider, log warning once |
402 | Payment required | Permanent | Skip provider, log warning once |
429 | Rate limited | 60–300s | Backoff based on Retry-After header |
408 | Request timeout | 30s | Retry once, then failover |
503 | Service unavailable | 60s | Immediate failover |
404 | Model not found | Permanent | Skip provider, log error |
| Network error | Transient | 30s | Retry with exponential backoff |
3.2 Retry-After Header Parsing
If a provider returns 429 with a Retry-After header:
- Parse the header (seconds or HTTP date)
- Set cooldown to
Retry-Aftervalue (capped at 5 minutes) - Log the retry time to user
3.3 Preemptive Probe Recovery
Send a lightweight probe request 30 seconds before cooldown expires:
- Use a minimal prompt (e.g., "Hello")
- On success → provider restored early
- On failure → cooldown extended by 50%
Prevents hammering a recovering provider with full-sized requests.
3.4 Exponential Backoff for Transient Failures
For network errors and timeouts:
- First retry: immediate
- Second retry: 2 seconds delay
- Third retry: 4 seconds delay
- After 3 retries: failover to next provider
Acceptance Criteria:
- HTTP status codes classified correctly
- Permanent failures skip provider without cooldown
- Rate limit failures parse
Retry-Afterheader - Cooldown durations set per failure classification
- Probe request sent 30s before cooldown expiry
- Probe success restores provider early
- Exponential backoff applies to transient network failures
- All failure details logged (error level) with classification
- Metrics collected: provider health, failure counts, recovery times
Out of Scope
The following are explicitly NOT included in this spec:
- Credential proxy — API key injection via localhost proxy (separate feature)
- Credential proxy — API key injection via localhost proxy (separate feature)
- Thread checkpoint — Persisting conversation state for cross-provider continuity (separate feature, may not be needed)
- Provider cost tracking — Monitoring token usage and costs per provider (separate feature)
- Human escalation UI — Interactive prompt to choose fallback provider when all fail (future enhancement)
These may be addressed in future specs but are not part of this implementation.
Testing Strategy
Unit Tests
Location: src/llm/providerFailover.test.ts
Test cases:
- ✅ Single provider success (no failover)
- ✅ First provider fails → second succeeds
- ✅ All providers fail → AllProvidersFailedError thrown
- ✅ Circuit opens after N failures
- ✅ Circuit transitions to HALF_OPEN after cooldown
- ✅ Probe success closes circuit
- ✅ Probe failure re-opens circuit
- ✅ Permanent failure (401) skips provider without cooldown
- ✅ Rate limit (429) applies Retry-After cooldown
- ✅ Transient failure applies exponential backoff
- ✅ Probe request sent 30s before cooldown expiry
- ✅ Empty provider chain throws immediately
Coverage target: 95%+ (critical path for production reliability)
Integration Tests
Location: tests/integration/providerFailover.test.ts
Test scenarios:
- ✅ Real member execution with mocked provider responses
- ✅ Groq rate limit → Ollama fallback → task completes
- ✅ Claude auth failure → OpenAI fallback → task completes
- ✅ All providers fail → error includes all failure reasons
- ✅ Provider recovers mid-execution → circuit closes
- ✅ Config validation: invalid provider name rejected
- ✅ Thread continuity across provider switches
Mock strategy: Use nock to simulate HTTP responses from LLM APIs
E2E Tests
Location: tests/e2e/resilience.test.ts
Test scenarios:
- ✅
agenthood run the-scribewith Groq rate limit → completes via fallback - ✅
agenthood run the-architectwith network timeout → retries and completes - ✅ Multi-member execution (The Architect → The Scribe) with provider failure mid-chain
- ✅ Provider recovery: Groq fails, recovers after cooldown, reused in next execution
Environment: Use Docker Compose to spin up Ollama for local fallback testing
Acceptance Criteria
Phase 1: Basic Failover
-
ProviderChainclass exists insrc/llm/ProviderFailover.ts - Loads provider chain from
.agenthood/config.json(viaLLMRouter.fromConfig()) - Attempts each provider in sequence until one succeeds
- Collects all failure reasons and throws
AllProvidersFailedErrorif all fail - Integrated via
LLMRouterintorun.ts - Existing member executions work without changes (transparent integration)
- Unit tests pass for basic failover logic
- Integration test demonstrates Groq → Ollama fallback (not yet implemented)
Phase 2: Circuit Breaker
- Circuit state tracked per provider (in-memory Map)
- Circuit opens after configurable
failureThreshold(default: 1) - Open circuit skips provider during failover
- Circuit transitions to HALF_OPEN after configurable
cooldownMs - Probe request tests provider recovery in HALF_OPEN state
- Probe success → CLOSED, probe failure → OPEN with extended cooldown
- Circuit state accessible via
getBreakerState() - Unit tests cover all state transitions
- Integration test demonstrates circuit behavior across multiple executions (not yet implemented)
Phase 3: Advanced Recovery
- HTTP status codes classified (401, 402, 429, 408, 503, 404, network)
- Permanent failures (401, 402) skip provider without cooldown
- ModelNotFoundError (404) skips to fallback model on same provider before tripping
- Rate limit failures (429) apply cooldown;
Retry-Afterheader parsing deferred (hardcoded defaults) - Transient failures (408, 503) use exponential backoff (3 attempts: 1000ms, 2000ms)
- Probe request sent 30 seconds before cooldown expires
- Probe success restores provider early
- All failures logged with classification and cooldown duration
- Unit tests cover all failure classifications
- E2E test demonstrates rate limit → cooldown → probe recovery (not yet implemented)
Open Questions
Q1: Should circuit state persist across runtime restarts?
Context: Currently, circuit state is in-memory. If the runtime crashes or is restarted, all providers reset to CLOSED.
Options:
- A: In-memory only (current)
- B: Persist to
.agenthood/cache/circuit-state.json - C: Use SQLite for circuit state storage
Deferred because: This is an optimization for long-running production deployments. The MVP can ship without persistence—failures will be re-learned after restart.
Decision by: End of Phase 2 implementation
Q2: Should we emit provider failover events for observability?
Context: When a provider fails and failover occurs, the user has no visibility unless they enable debug logging.
Options:
- A: Emit events to EventEmitter for runtime subscribers
- B: Write structured logs to
.agenthood/logs/failover.json - C: Do nothing—debug logs are sufficient
Deferred because: This depends on the broader observability strategy (metrics, telemetry, UI). Can be added later without changing failover logic.
Decision by: When observability/telemetry system is designed
Q3: How should member-specific provider preferences interact with failover?
Context: The Doorman wants fast/cheap providers (Groq, Ollama). The Architect wants deep reasoning (Claude, OpenAI). Should the failover chain be member-aware?
Options:
- A: Global failover chain (all members use same chain)
- B: Per-member provider preferences override global chain
- C: Per-member primary provider, global chain for fallback
Deferred because: This is explicitly out of scope for this spec. Member preferences are a separate feature that can layer on top of failover later.
Decision by: When member preferences spec is written (issue TBD)
Implementation Notes
Directory Structure
src/llm/
├── ProviderFailover.ts # ProviderChain + classifyError + circuit breaker
├── ILLMProvider.ts # Unified provider interface
├── LLMRouter.ts # Router: builds chains from config
├── types.ts # ProviderEntry, LLMConfig, LLMRequest/Response
├── errors.ts # Error classes: AuthError, RateLimitedError, etc.
├── AnthropicProvider.ts
├── GroqProvider.ts
├── OpenAIProvider.ts
└── OllamaProvider.ts
tests/unit/llm/
└── ProviderFailover.test.ts # 36 tests covering all failover scenarios
Configuration Schema
Add to .agenthood/config.json:
{ "llm": { "providers": [ { "name": "anthropic", "model": "claude-sonnet-4-20250514", "apiKey": "...", "models": ["claude-sonnet-4-20250514", "claude-haiku-3-20250301"] }, { "name": "groq", "model": "llama-3.1-70b-versatile", "apiKey": "..." } ], "failover": { "failureThreshold": 1, "cooldownMs": 30000, "probeEnabled": true } } }
The models array on a provider entry defines the model downgrade chain for
Strategy 4. The first entry is the primary model; subsequent entries are
fallbacks tried before failing over to the next provider.
Error Handling
class AllProvidersFailedError extends Error { readonly category: string constructor(errors: string[], category: string = 'unknown') { super(`All providers failed: ${errors.join('; ')}`) this.name = 'AllProvidersFailedError' this.category = category } }
Thread Continuity
Assumption: Threads are stateless—each executeMember() call is independent. No checkpoint/restore needed for MVP.
If thread continuity is required later:
- Persist conversation history to
.agenthood/cache/threads/{threadId}.json - Reload history on provider switch
- This is a separate feature—not blocking this spec
Success Metrics
How we know this is working in production:
- Failover success rate: % of executions that succeed after failover (target: >95%)
- Mean time to recovery: Average time from failure to successful provider switch (target: <5s)
- Provider health score: % uptime per provider over 24h window
- Circuit breaker effectiveness: % of avoided requests to known-bad providers
- Probe recovery rate: % of providers restored via probe vs. timeout
These metrics can be logged to .agenthood/logs/failover-metrics.json or emitted via telemetry (out of scope for this spec, but design should allow future instrumentation).
References
- Architecture Doc: architecture/provider-failover.md
- Issue: #161 — Implement ProviderFailover for resilience
- Pattern: Circuit Breaker Pattern (Martin Fowler)
- Related ADRs:
- ADR-008: TypeScript Runtime over Python
- ADR-009: Groq as Default LLM Provider
- ADR-011: Rate Limiter and Shared State Store
- ADR-012: Error Handling and Resilience Strategy
- ADR-013: Distribution Channel Priority