Skip to content

Supervisor

The Supervisor is SPEED’s decision-maker for failure recovery. It analyzes the current state of the system, diagnoses failures, and determines the recovery strategy. Its most critical function is pattern detection: recognizing when multiple tasks fail for the same underlying reason and prescribing a single root-cause fix instead of retrying each task individually.

PropertyValue
CommandAutomatic (invoked by orchestrator)
Model tierplanning_model (Opus)
TriggerTask failure, blocked status, repeated failure patterns, coherence/contract check failure
InputSourceDescription
Current SPEED stateOrchestratorTask statuses, timings, costs
Failed/blocked task detailsAgent outputError logs, Debugger analysis
Overall progress metricsOrchestratorCompletion rates, failure counts
Failure history.speed/ statePrevious failures and their resolutions
OutputLocationDescription
Recovery decisionStdout (JSON)Diagnosis, pattern analysis, recovery actions, halt recommendation

The Supervisor applies different decision frameworks depending on the failure type:

Failure TypeDecisionAction
Transient (network error, timeout, rate limit)Retry with same configurationretry
Capability (task too complex for model)Escalate model tierretry_escalated (sonnet to opus)
Specification (task description unclear or impossible)Re-plan the taskreplan
Dependency (upstream task output was wrong)Fix upstream firstretry_dependency
Systemic (multiple tasks failing similarly)Escalate to humanescalate

Blocked status means the Developer was honest instead of fabricating. High-priority input, not an error.

Block TypeAction
Ambiguous specescalate to human with the specific question (highest priority)
Missing dependencyreorder task priority, or replan if the dependency is absent from the plan
Contradictory requirementsreplan with contradiction highlighted, or escalate if the product spec itself is contradictory
Impossible as specifiedReview the claim; if valid, replan; if uncertain, escalate

When the same failure repeats across multiple tasks:

  • 2+ tasks fail with import errors for the same module: missing foundational task
  • 2+ tasks fail with schema mismatches: migration is wrong or missing
  • 2+ tasks fail with “file not found”: scaffolding task was skipped or failed

Pattern failures get diagnosed as a single root cause with a single fix (fix_root_cause), not retried individually.

Interface mismatch: retry the task that needs to change. Schema inconsistency: identify the authoritative definition, retry deviating tasks. Missing connections: retry the task that should have made the connection. Plan-level coherence failure: replan affected tasks.

Post-integration contract check failures (missing table, missing FK, core query not traversable) are traced to the responsible task for retry, or trigger replan if no task covers the gap.

The Supervisor is invoked by the orchestrator through _invoke_supervisor (lib/shared.sh:153+). It runs in two modes: async (background, for non-blocking diagnosis during speed run) and sync (inline, for decisions that must complete before the pipeline continues).

_invoke_supervisor(task_id, trigger_type, mode)
├─ 1. Gather SPEED state
│ ├─ Task counts by status (total, done, running, pending, failed, blocked)
│ ├─ Triggering task details (status, error, title)
│ └─ Debugger analysis (if available)
├─ 2. Load failure history
│ └─ .speed/features/<feature>/failure_history.jsonl
├─ 3. Build supervisor message
│ ├─ Trigger type (blocked, pattern, coherence, contract)
│ ├─ SPEED state summary
│ ├─ Task details + debugger log
│ └─ Full failure history
├─ 4. Send to Supervisor agent
│ ├─ async → _spawn_support_agent (background)
│ └─ sync → claude_run (inline, returns output)
└─ 5. Orchestrator applies recovery actions

Lines 161-168 query task counts by status using task_count_by_status. Lines 178-187 load the triggering task’s details and, if present, the Debugger’s JSON analysis from ${LOGS_DIR}/debugger-${task_id}.json.

Lines 172-175 load the cumulative failure history from failure_history.jsonl. Each line is a JSON record of a previous failure with its resolution. Pattern detection depends on this history: the Supervisor compares current failures against past ones to identify systemic issues.

Lines 189-200 build a prompt with four sections: trigger type label, SPEED state summary, triggering task details (with debugger analysis if available), and the full failure history.

Two execution modes (lines 202-211):

ModeFunctionUse case
async_spawn_support_agentNon-blocking during parallel task execution
syncclaude_runContract failures, coherence failures (pipeline must wait for decision)

Both use Sonnet and read-only tools. The Supervisor never modifies files directly.

The orchestrator (_handle_support_completion) parses the Supervisor’s JSON output and applies each action:

ActionWhat the orchestrator does
retryReset task to pending, re-enqueue with same config
retry_escalatedReset task, upgrade model (sonnet → opus)
replanMark affected tasks for re-planning by the Architect
retry_dependencyReorder dependency, retry the blocked task after upstream completes
fix_root_causeApply a single fix, then retry all affected tasks
escalateLog the human_question, halt the task, notify the user
skipMark task as skipped, unblock dependents if safe
TriggerWhenTypical pattern
blockedDeveloper reports status: "blocked"Ambiguous spec, missing dependency
pattern2+ tasks fail with similar errorsMissing foundational task, wrong migration
coherenceCoherence check finds interface mismatchesSignature mismatch across branches
contractPost-integration contract check failsMissing table, missing FK

During speed run, Tasks 3 and 5 both fail with import_error for src/backend/models/reservation.py. The Debugger analyzed both and identified upstream_dependency: "4" in each case.

SPEED State:

Total tasks: 8
Done: 2
Running: 0
Pending: 1
Failed: 2
Blocked: 0

Failure History:

{"task_id": "3", "classification": "import_error", "root_cause": "reservation.py not found", "upstream": "4"}
{"task_id": "5", "classification": "import_error", "root_cause": "reservation.py not found", "upstream": "4"}
{
"diagnosis": "Tasks 3 and 5 both fail importing from reservation.py, which is created by Task 4. Task 4 is still pending because its dependency (Task 2) was running. This is a scheduling issue, not a code issue.",
"pattern_detected": {
"description": "Multiple tasks import a module created by a not-yet-completed task",
"affected_tasks": ["3", "5"],
"root_cause": "Task 4 (Create Reservation model) has not run yet. Tasks 3 and 5 should depend on Task 4 but don't."
},
"actions": [
{
"task_id": "4",
"action": "reorder",
"reason": "Task 4 must complete before Tasks 3 and 5 can succeed"
},
{
"task_id": "3",
"action": "retry_dependency",
"reason": "Will succeed once Task 4 provides reservation.py",
"additional_context": "No code changes needed — just needs the import to resolve"
},
{
"task_id": "5",
"action": "retry_dependency",
"reason": "Same root cause as Task 3"
}
],
"recommendations": [
"Consider adding depends_on edges from Tasks 3 and 5 to Task 4 to prevent this pattern in future runs"
],
"should_halt": false,
"halt_reason": null
}

The orchestrator reorders Task 4 to run immediately. Once Task 4 completes, Tasks 3 and 5 are retried. Both succeed because reservation.py now exists.

  • Patterns over individuals. If the same failure appears twice, stop fixing individual tasks and find the root cause.
  • Blocked is not failed. Treat it as high-priority input.
  • Escalate early. Retrying a wrong approach three times costs more than asking once.
  • Be conservative with replans. Replanning discards existing work. Only replan when the plan itself is wrong.
  • If more than 30% of tasks fail, recommend halting and escalating to human.
  • Consider cost implications of retries and model escalations.
{
"diagnosis": "Summary of the current situation",
"pattern_detected": {
"description": "What pattern is observed",
"affected_tasks": ["task IDs"],
"root_cause": "The single underlying issue"
},
"actions": [
{
"task_id": "3",
"action": "retry | retry_escalated | replan | reorder | fix_root_cause | skip | escalate",
"reason": "Why this action",
"new_model": "opus",
"additional_context": "Extra info for the retried agent",
"human_question": "If escalating, what specific question to ask"
}
],
"recommendations": ["Human-readable suggestions"],
"should_halt": false,
"halt_reason": null
}