Provider Failover

The Society does not depend on any single intelligence. It has contingencies.

Overview

The Agenthood is LLM-agnostic. Any member can run on any supported AI provider.
When a provider fails — rate limit, outage, auth error — the system automatically
switches to the next available provider without interrupting the task.

The human sees the work continue. The Society handles the plumbing.

Supported Providers

The Agenthood runtime is LLM-agnostic. The six providers below are implemented
in src/llm/providers/. Any member can run on any provider.

Provider	Default model	Notes
Anthropic	Claude Sonnet 4.6	Primary for most members; precise, detailed reasoning
Groq	llama-3.1-70b-versatile	Default when no provider is configured; free tier
OpenAI	GPT-4o	Broad general capability; fallback for Anthropic
Ollama	Local model (configurable)	Air-gapped / offline use; default for the Doorman

All providers use a unified ILLMProvider interface
(src/llm/ILLMProvider.ts) with normalized request and
response types. Member skills are written once and run on any provider via LLMRouter
(src/llm/LLMRouter.ts).

Supported providers

Provider	Status	Details
OpenAI	✅ Shipped	`src/llm/providers/OpenAIProvider.ts`
Anthropic	✅ Shipped	`src/llm/providers/AnthropicProvider.ts`
Groq	✅ Shipped	`src/llm/providers/GroqProvider.ts`
Ollama	✅ Shipped	`src/llm/providers/OllamaProvider.ts` (local, no API key)
OpenCode Zen	✅ Shipped	`src/llm/providers/OpenCodeProvider.ts` — pay-as-you-go at `api.opencode.ai/zen/v1`
OpenCode Go	✅ Shipped	`src/llm/providers/OpenCodeGoProvider.ts` — subscription at `api.opencode.ai/zen/go/v1`

Additional providers (DeepSeek, Qwen) may be added in future releases.
When added, they will be slotted into the failover chain behind the
six supported providers.

Failover Chain

The failover chain is user-configured or auto-detected from available API keys:

Primary → Fallback 1 → Fallback 2 → ... → Error (all exhausted)

Example chain:

OpenCode (Zen/Go) → Claude Sonnet 4.6 → GPT-4o → Groq → Ollama

Thread continuity is preserved across failovers via checkpoint-based thread_id.
The member picks up exactly where it left off on the new provider.

Failure Classification

Not all failures are equal. The system classifies them and applies the right response:

HTTP Status	Classification	Cooldown	Action
`401`	Auth failure	Permanent	Skip provider, alert user
`402`	Payment required	Permanent	Skip provider, alert user
`429`	Rate limited	60–300s	Cool down, try next provider
`408`	Timeout	30s	Retry once, then failover
`503`	Service unavailable	60s	Failover immediately
`404`	Model not found	Permanent	Skip provider

Probe Recovery

A provider on cooldown is not written off permanently.

30 seconds before cooldown expiry, the system sends a lightweight probe request
If the probe succeeds → provider returns to the active pool
If the probe fails → cooldown is extended

This prevents the system from hammering a recovering provider with full requests.

Circuit Breaker

The Agenthood implements a three-state circuit breaker per provider:

CLOSED      → Normal operation, requests flow through
    ↓ (threshold of failures exceeded)
OPEN        → Provider bypassed, failover active
    ↓ (cooldown period expires)
HALF_OPEN   → One probe request allowed
    ↓ (probe succeeds)
CLOSED      → Provider restored
    ↓ (probe fails)
OPEN        → Back to bypass

The circuit breaker is configurable per chain:

Parameter	Default	Description
`failureThreshold`	1	Consecutive failures before circuit opens. Permanent errors (auth, payment, model_not_found) always open immediately regardless.
`cooldownMs`	Error-specific	Override the cooldown duration in ms (e.g., `5000` to wait 5s before probe).
`probeEnabled`	`true`	When `false`, disables preemptive probe recovery. Providers still recover naturally when cooldown expires.

Five recovery strategies are available for sustained failures:

Immediate retry — for transient network blips
Exponential backoff — up to 3 attempts with increasing delay (1000ms, 2000ms)
Provider rotation — move to next in chain
Model downgrade — switch to cheaper/faster model on same provider; applies to complete(), stream(), and embed()
Human escalation — all providers exhausted, alert the human

Credential Security (Planned)

API keys never reach the agent directly.

An HTTP proxy on localhost injects credentials from the OS keychain into outbound LLM requests
(not yet implemented). Currently, credentials are injected at the provider constructor level
from environment variables.

Member-Level Provider Preferences

Different members prefer different providers based on their task type. These
preferences are encoded in MemberSpec.preferredProvider in
src/members/MemberRegistry.ts and respected
by LLMRouter when building the failover chain.

Member	Preferred Provider	Reason
The Scribe	Anthropic	Strong at natural language writing
The Architect	Anthropic	Long context, reasoning
The Reviewer	Anthropic	Precise, detailed analysis
The Auditor	Anthropic	Security reasoning, caution
The Tester	Anthropic	Structured output, deterministic
The Debugger	Anthropic	Broad training on error patterns
The Herald	Anthropic	Templated output, low complexity
The Librarian	Anthropic	Documentation quality
The Doorman	Ollama → Groq	Fast, simple validation — no need for top-tier model
The Oracle	Anthropic	Long-context retrieval
The Envoy	Anthropic	Cross-runtime translation
The Sentinel	Anthropic	Standards enforcement
The Warden	Anthropic	Code health analysis
The Steward	Groq	Lightweight routing, low cost