Skip to main content

Provider Failover

The Society does not depend on any single intelligence. It has contingencies.


Overview

The Agenthood is LLM-agnostic. Any member can run on any supported AI provider.
When a provider fails — rate limit, outage, auth error — the system automatically
switches to the next available provider without interrupting the task.

The human sees the work continue. The Society handles the plumbing.


Supported Providers

The Agenthood runtime is LLM-agnostic. The six providers below are implemented
in src/llm/providers/. Any member can run on any provider.

ProviderDefault modelNotes
AnthropicClaude Sonnet 4.6Primary for most members; precise, detailed reasoning
Groqllama-3.1-70b-versatileDefault when no provider is configured; free tier
OpenAIGPT-4oBroad general capability; fallback for Anthropic
OllamaLocal model (configurable)Air-gapped / offline use; default for the Doorman

All providers use a unified ILLMProvider interface
(src/llm/ILLMProvider.ts) with normalized request and
response types. Member skills are written once and run on any provider via LLMRouter
(src/llm/LLMRouter.ts).

Supported providers

ProviderStatusDetails
OpenAI✅ Shippedsrc/llm/providers/OpenAIProvider.ts
Anthropic✅ Shippedsrc/llm/providers/AnthropicProvider.ts
Groq✅ Shippedsrc/llm/providers/GroqProvider.ts
Ollama✅ Shippedsrc/llm/providers/OllamaProvider.ts (local, no API key)
OpenCode Zen✅ Shippedsrc/llm/providers/OpenCodeProvider.ts — pay-as-you-go at api.opencode.ai/zen/v1
OpenCode Go✅ Shippedsrc/llm/providers/OpenCodeGoProvider.ts — subscription at api.opencode.ai/zen/go/v1

Additional providers (DeepSeek, Qwen) may be added in future releases.
When added, they will be slotted into the failover chain behind the
six supported providers.


Failover Chain

The failover chain is user-configured or auto-detected from available API keys:

Primary → Fallback 1 → Fallback 2 → ... → Error (all exhausted)

Example chain:

OpenCode (Zen/Go) → Claude Sonnet 4.6 → GPT-4o → Groq → Ollama

Thread continuity is preserved across failovers via checkpoint-based thread_id.
The member picks up exactly where it left off on the new provider.


Failure Classification

Not all failures are equal. The system classifies them and applies the right response:

HTTP StatusClassificationCooldownAction
401Auth failurePermanentSkip provider, alert user
402Payment requiredPermanentSkip provider, alert user
429Rate limited60–300sCool down, try next provider
408Timeout30sRetry once, then failover
503Service unavailable60sFailover immediately
404Model not foundPermanentSkip provider

Probe Recovery

A provider on cooldown is not written off permanently.

  • 30 seconds before cooldown expiry, the system sends a lightweight probe request
  • If the probe succeeds → provider returns to the active pool
  • If the probe fails → cooldown is extended

This prevents the system from hammering a recovering provider with full requests.


Circuit Breaker

The Agenthood implements a three-state circuit breaker per provider:

CLOSED      → Normal operation, requests flow through
    ↓ (threshold of failures exceeded)
OPEN        → Provider bypassed, failover active
    ↓ (cooldown period expires)
HALF_OPEN   → One probe request allowed
    ↓ (probe succeeds)
CLOSED      → Provider restored
    ↓ (probe fails)
OPEN        → Back to bypass

The circuit breaker is configurable per chain:

ParameterDefaultDescription
failureThreshold1Consecutive failures before circuit opens. Permanent errors (auth, payment, model_not_found) always open immediately regardless.
cooldownMsError-specificOverride the cooldown duration in ms (e.g., 5000 to wait 5s before probe).
probeEnabledtrueWhen false, disables preemptive probe recovery. Providers still recover naturally when cooldown expires.

Five recovery strategies are available for sustained failures:

  1. Immediate retry — for transient network blips
  2. Exponential backoff — up to 3 attempts with increasing delay (1000ms, 2000ms)
  3. Provider rotation — move to next in chain
  4. Model downgrade — switch to cheaper/faster model on same provider; applies to complete(), stream(), and embed()
  5. Human escalation — all providers exhausted, alert the human

Credential Security (Planned)

API keys never reach the agent directly.

An HTTP proxy on localhost injects credentials from the OS keychain into outbound LLM requests
(not yet implemented). Currently, credentials are injected at the provider constructor level
from environment variables.


Member-Level Provider Preferences

Different members prefer different providers based on their task type. These
preferences are encoded in MemberSpec.preferredProvider in
src/members/MemberRegistry.ts and respected
by LLMRouter when building the failover chain.

MemberPreferred ProviderReason
The ScribeAnthropicStrong at natural language writing
The ArchitectAnthropicLong context, reasoning
The ReviewerAnthropicPrecise, detailed analysis
The AuditorAnthropicSecurity reasoning, caution
The TesterAnthropicStructured output, deterministic
The DebuggerAnthropicBroad training on error patterns
The HeraldAnthropicTemplated output, low complexity
The LibrarianAnthropicDocumentation quality
The DoormanOllama → GroqFast, simple validation — no need for top-tier model
The OracleAnthropicLong-context retrieval
The EnvoyAnthropicCross-runtime translation
The SentinelAnthropicStandards enforcement
The WardenAnthropicCode health analysis
The StewardGroqLightweight routing, low cost