Supervisor
Mission
Section titled “Mission”The Supervisor is SPEED’s decision-maker for failure recovery. It analyzes the current state of the system, diagnoses failures, and determines the recovery strategy. Its most critical function is pattern detection: recognizing when multiple tasks fail for the same underlying reason and prescribing a single root-cause fix instead of retrying each task individually.
Invocation
Section titled “Invocation”| Property | Value |
|---|---|
| Command | Automatic (invoked by orchestrator) |
| Model tier | planning_model (Opus) |
| Trigger | Task failure, blocked status, repeated failure patterns, coherence/contract check failure |
Inputs
Section titled “Inputs”| Input | Source | Description |
|---|---|---|
| Current SPEED state | Orchestrator | Task statuses, timings, costs |
| Failed/blocked task details | Agent output | Error logs, Debugger analysis |
| Overall progress metrics | Orchestrator | Completion rates, failure counts |
| Failure history | .speed/ state | Previous failures and their resolutions |
Outputs
Section titled “Outputs”| Output | Location | Description |
|---|---|---|
| Recovery decision | Stdout (JSON) | Diagnosis, pattern analysis, recovery actions, halt recommendation |
Process
Section titled “Process”The Supervisor applies different decision frameworks depending on the failure type:
Failed Tasks
Section titled “Failed Tasks”| Failure Type | Decision | Action |
|---|---|---|
| Transient (network error, timeout, rate limit) | Retry with same configuration | retry |
| Capability (task too complex for model) | Escalate model tier | retry_escalated (sonnet to opus) |
| Specification (task description unclear or impossible) | Re-plan the task | replan |
| Dependency (upstream task output was wrong) | Fix upstream first | retry_dependency |
| Systemic (multiple tasks failing similarly) | Escalate to human | escalate |
Blocked Tasks
Section titled “Blocked Tasks”Blocked status means the Developer was honest instead of fabricating. High-priority input, not an error.
| Block Type | Action |
|---|---|
| Ambiguous spec | escalate to human with the specific question (highest priority) |
| Missing dependency | reorder task priority, or replan if the dependency is absent from the plan |
| Contradictory requirements | replan with contradiction highlighted, or escalate if the product spec itself is contradictory |
| Impossible as specified | Review the claim; if valid, replan; if uncertain, escalate |
Pattern Failures
Section titled “Pattern Failures”When the same failure repeats across multiple tasks:
- 2+ tasks fail with import errors for the same module: missing foundational task
- 2+ tasks fail with schema mismatches: migration is wrong or missing
- 2+ tasks fail with “file not found”: scaffolding task was skipped or failed
Pattern failures get diagnosed as a single root cause with a single fix (fix_root_cause), not retried individually.
Coherence Failures
Section titled “Coherence Failures”Interface mismatch: retry the task that needs to change. Schema inconsistency: identify the authoritative definition, retry deviating tasks. Missing connections: retry the task that should have made the connection. Plan-level coherence failure: replan affected tasks.
Contract Failures
Section titled “Contract Failures”Post-integration contract check failures (missing table, missing FK, core query not traversable) are traced to the responsible task for retry, or trigger replan if no task covers the gap.
How It Works
Section titled “How It Works”The Supervisor is invoked by the orchestrator through _invoke_supervisor (lib/shared.sh:153+). It runs in two modes: async (background, for non-blocking diagnosis during speed run) and sync (inline, for decisions that must complete before the pipeline continues).
_invoke_supervisor(task_id, trigger_type, mode) │ ├─ 1. Gather SPEED state │ ├─ Task counts by status (total, done, running, pending, failed, blocked) │ ├─ Triggering task details (status, error, title) │ └─ Debugger analysis (if available) ├─ 2. Load failure history │ └─ .speed/features/<feature>/failure_history.jsonl ├─ 3. Build supervisor message │ ├─ Trigger type (blocked, pattern, coherence, contract) │ ├─ SPEED state summary │ ├─ Task details + debugger log │ └─ Full failure history ├─ 4. Send to Supervisor agent │ ├─ async → _spawn_support_agent (background) │ └─ sync → claude_run (inline, returns output) └─ 5. Orchestrator applies recovery actionsPhase 1: State gathering
Section titled “Phase 1: State gathering”Lines 161-168 query task counts by status using task_count_by_status. Lines 178-187 load the triggering task’s details and, if present, the Debugger’s JSON analysis from ${LOGS_DIR}/debugger-${task_id}.json.
Phase 2: Failure history
Section titled “Phase 2: Failure history”Lines 172-175 load the cumulative failure history from failure_history.jsonl. Each line is a JSON record of a previous failure with its resolution. Pattern detection depends on this history: the Supervisor compares current failures against past ones to identify systemic issues.
Phase 3: Prompt assembly
Section titled “Phase 3: Prompt assembly”Lines 189-200 build a prompt with four sections: trigger type label, SPEED state summary, triggering task details (with debugger analysis if available), and the full failure history.
Phase 4: Agent execution
Section titled “Phase 4: Agent execution”Two execution modes (lines 202-211):
| Mode | Function | Use case |
|---|---|---|
async | _spawn_support_agent | Non-blocking during parallel task execution |
sync | claude_run | Contract failures, coherence failures (pipeline must wait for decision) |
Both use Sonnet and read-only tools. The Supervisor never modifies files directly.
Phase 5: Action application
Section titled “Phase 5: Action application”The orchestrator (_handle_support_completion) parses the Supervisor’s JSON output and applies each action:
| Action | What the orchestrator does |
|---|---|
retry | Reset task to pending, re-enqueue with same config |
retry_escalated | Reset task, upgrade model (sonnet → opus) |
replan | Mark affected tasks for re-planning by the Architect |
retry_dependency | Reorder dependency, retry the blocked task after upstream completes |
fix_root_cause | Apply a single fix, then retry all affected tasks |
escalate | Log the human_question, halt the task, notify the user |
skip | Mark task as skipped, unblock dependents if safe |
Trigger types
Section titled “Trigger types”| Trigger | When | Typical pattern |
|---|---|---|
blocked | Developer reports status: "blocked" | Ambiguous spec, missing dependency |
pattern | 2+ tasks fail with similar errors | Missing foundational task, wrong migration |
coherence | Coherence check finds interface mismatches | Signature mismatch across branches |
contract | Post-integration contract check fails | Missing table, missing FK |
Worked Example
Section titled “Worked Example”During speed run, Tasks 3 and 5 both fail with import_error for src/backend/models/reservation.py. The Debugger analyzed both and identified upstream_dependency: "4" in each case.
What the supervisor receives
Section titled “What the supervisor receives”SPEED State:
Total tasks: 8Done: 2Running: 0Pending: 1Failed: 2Blocked: 0Failure History:
{"task_id": "3", "classification": "import_error", "root_cause": "reservation.py not found", "upstream": "4"}{"task_id": "5", "classification": "import_error", "root_cause": "reservation.py not found", "upstream": "4"}What the agent returns
Section titled “What the agent returns”{ "diagnosis": "Tasks 3 and 5 both fail importing from reservation.py, which is created by Task 4. Task 4 is still pending because its dependency (Task 2) was running. This is a scheduling issue, not a code issue.", "pattern_detected": { "description": "Multiple tasks import a module created by a not-yet-completed task", "affected_tasks": ["3", "5"], "root_cause": "Task 4 (Create Reservation model) has not run yet. Tasks 3 and 5 should depend on Task 4 but don't." }, "actions": [ { "task_id": "4", "action": "reorder", "reason": "Task 4 must complete before Tasks 3 and 5 can succeed" }, { "task_id": "3", "action": "retry_dependency", "reason": "Will succeed once Task 4 provides reservation.py", "additional_context": "No code changes needed — just needs the import to resolve" }, { "task_id": "5", "action": "retry_dependency", "reason": "Same root cause as Task 3" } ], "recommendations": [ "Consider adding depends_on edges from Tasks 3 and 5 to Task 4 to prevent this pattern in future runs" ], "should_halt": false, "halt_reason": null}What happens next
Section titled “What happens next”The orchestrator reorders Task 4 to run immediately. Once Task 4 completes, Tasks 3 and 5 are retried. Both succeed because reservation.py now exists.
Constraints
Section titled “Constraints”- Patterns over individuals. If the same failure appears twice, stop fixing individual tasks and find the root cause.
- Blocked is not failed. Treat it as high-priority input.
- Escalate early. Retrying a wrong approach three times costs more than asking once.
- Be conservative with replans. Replanning discards existing work. Only replan when the plan itself is wrong.
- If more than 30% of tasks fail, recommend halting and escalating to human.
- Consider cost implications of retries and model escalations.
Output Schema
Section titled “Output Schema”{ "diagnosis": "Summary of the current situation", "pattern_detected": { "description": "What pattern is observed", "affected_tasks": ["task IDs"], "root_cause": "The single underlying issue" }, "actions": [ { "task_id": "3", "action": "retry | retry_escalated | replan | reorder | fix_root_cause | skip | escalate", "reason": "Why this action", "new_model": "opus", "additional_context": "Extra info for the retried agent", "human_question": "If escalating, what specific question to ask" } ], "recommendations": ["Human-readable suggestions"], "should_halt": false, "halt_reason": null}