Session Intelligence
Warden tracks multiple signals in real time to model session health:
| Signal | What it detects |
|---|---|
| Focus score | How concentrated the agent’s work is. Drops when scope widens or subsystems change rapidly. |
| Loop detection | Repeated command patterns that aren’t producing progress. |
| Verification debt | Edit-heavy sessions that aren’t running tests or builds. |
| Drift velocity | How quickly the session is diverging from its original goal. |
| Trust score | How well the agent is behaving. High trust = fewer interventions. |
| Phase tracking | Whether the session is in exploration, implementation, testing, or finishing. |
When signals indicate degradation, Warden emits a targeted advisory. When the session is healthy, it stays completely silent.
Session Phases
Warden models every session as moving through five phases. The current phase determines intervention thresholds — how sensitive Warden is to signals and how aggressively it intervenes.
| Phase | Turns | Behavior |
|---|---|---|
| Warmup | 0-5 | The agent is reading files, understanding the task, running initial commands. Warden is lenient — exploratory reads and broad searches are expected. |
| Productive | 6-30 | The core working phase. The agent is making edits, running builds, iterating. Warden watches for loops and drift but keeps a light touch. |
| Exploring | varies | The agent has shifted from its original task to investigate something tangential. This isn’t necessarily bad, but Warden tracks how far the exploration goes. |
| Struggling | varies | Error rates are climbing, commands are repeating, the same files are being edited without progress. Warden increases advisory frequency. |
| Late | 80+ | Deep into the session, context pressure is mounting. Warden tightens output compression, issues more targeted advisories, and may suggest wrapping up. |
Phases aren’t strictly sequential. A session can move from Productive to Struggling and back to Productive if the agent resolves its issues. Phase transitions are driven by the signals described below.
Signals in Detail
Focus score tracks how concentrated the agent’s work is across the file system. When the agent edits src/auth/login.ts, then src/auth/session.ts, then src/auth/token.ts, focus is high — it’s working in one subsystem. When it jumps to src/database/connection.ts, then src/ui/header.tsx, then package.json, focus drops. Rapid subsystem switching usually indicates the agent is flailing.
Real scenario: The agent is asked to add a new API endpoint. It starts in the right directory, but after hitting a type error, it starts modifying the database schema, then the test fixtures, then the CI config. Focus score drops from 90 to 40. Warden injects: “Focus has dropped. Consider addressing the type error in the auth module before changing other subsystems.”
Loop detection identifies repeated command patterns that aren’t producing progress. The simplest loop is running the same command and getting the same error — but Warden also detects more subtle patterns like edit-build-fail cycles where the edits aren’t addressing the actual error.
Real scenario: The agent runs cargo build and gets a lifetime error. It edits a function signature, builds again, gets a different lifetime error. Edits again, builds again, gets the original error back. After 4 iterations of this cycle, Warden detects the loop: “Loop detected: 4 build-fail cycles on lifetime errors. Consider a different approach — the borrow structure may need redesigning rather than signature tweaks.”
Verification debt accumulates when the agent edits files without running tests or builds. Editing 2-3 files before building is normal. Editing 8 files across 4 directories without a single build or test run is risky — any of those changes could be broken, and the agent won’t find out until much later.
Real scenario: The agent edits src/handler.rs, src/model.rs, src/routes.rs, src/middleware.rs, src/types.rs, src/config.rs, tests/integration.rs, and Cargo.toml — all without running cargo build or cargo test. Warden counts 8 unverified files and injects: “Verification debt: 8 files edited since last build. Run cargo build or cargo test to catch errors early.”
Drift velocity measures how quickly the session is diverging from its initial goal. It combines subsystem switching, file distance from the original working set, and whether the agent is still touching files related to the original request.
Trust score is a composite metric (0-100) that starts at 100 and decreases based on errors (-5), verification debt (-3 per file), subsystem switches (-2), and dead ends (-4). It’s used to control the advisory injection budget: high-trust sessions get minimal intervention, low-trust sessions get more guidance.
Trust Score Formula
The trust score is recalculated on every turn:
trust = 100
- errors * 5
- verification_debt * 3
- subsystem_switches * 2
- dead_ends * 4
- turns_since_checkpoint
- denials * 3
+ milestone_bonus
+ balance_bonus
The result is clamped to 0-100. milestone_bonus is awarded when the agent completes a verifiable step (e.g., a passing build after edits). balance_bonus rewards sessions that interleave edits with verification rather than batching all edits first. turns_since_checkpoint decays trust gradually when the agent goes many turns without a verifiable checkpoint (build, test, or lint).
Injection Budget
Warden doesn’t inject advisories randomly. It uses a utility-ranked budget system gated by the trust score.
Signal utilities (how valuable each signal type is when injected):
| Signal | Utility | When it fires |
|---|---|---|
| Safety | 1.0 | Dangerous command about to execute |
| Loop | 0.9 | Repeated command pattern detected |
| Verification | 0.8 | Edit-heavy session without builds/tests |
| Phase | 0.7 | Session phase transition (e.g., entering Struggling) |
| Recovery | 0.6 | Suggestion after a command failure |
| Focus | 0.5 | Subsystem drift detected |
| Pressure | 0.4 | Context budget running low |
Trust-gated budget (how many signals can fire per turn):
| Trust Score | Budget |
|---|---|
| > 85 | Top 1 signal only (highest utility) |
| 50-85 | Top 3 signals |
| 25-50 | Top 5 signals |
| < 25 | Up to 15 signals (aggressive recovery mode) |
This means a healthy, high-trust session only sees the single most critical signal per turn. A struggling session gets comprehensive guidance. Safety signals (utility 1.0) always make the cut regardless of budget.
The Scorecard System
At the end of each session (or on demand via warden scorecard), Warden generates a quality scorecard with four dimensions:
| Dimension | What it measures |
|---|---|
| Safety | How many safety rules fired, how many were critical, whether any were bypassed |
| Efficiency | Token utilization, output compression ratio, unnecessary command count |
| Focus | Average focus score, number of subsystem switches, drift velocity |
| UX | How many advisories were emitted, whether loops were detected and broken, verification debt at end |
Each dimension scores 0-100. The overall session quality is the weighted average. Scorecard data feeds into the dream state for cross-session learning.
Dream State (Between Sessions)
When a session ends, Warden’s background worker processes the session data through 10 learning tasks:
| Task | Purpose |
|---|---|
| LearnEffectiveness | Which rules and advisories preceded progress vs. were ignored |
| BuildResumePacket | Compact summary of the session for the next one to pick up from |
| LearnSequences | Successful action sequences worth repeating |
| ClusterErrors | Group repeated errors into durable knowledge |
| LearnRepairPatterns | Map error types to fixes that worked |
| LearnConventions | Project conventions from recurring patterns |
| UpdateWorkingSetRanking | Which files/directories are most important by recency-frequency-outcome |
| BuildDeadEndMemory | Approaches that were tried and failed (so the next session avoids them) |
| ScoreArtifacts | Prune weak or outdated learning artifacts |
| ConsolidateEvents | Compress raw event logs into higher-level facts |
These tasks run in priority order during daemon idle time. The highest-value work (effectiveness learning, resume packet) runs first. Lower-priority housekeeping (event consolidation) runs last.
The dream state produces a resume packet — a compact summary of what the session learned. When the next session starts (or after context compaction), the resume packet is injected so the agent doesn’t lose hard-won context. This includes the working set (top files by importance), dead ends to avoid, and conventions discovered.
Dream Artifacts (V2)
Beyond the resume packet, the dream state produces four typed artifact categories that persist across sessions:
| Artifact | Source Task | What it contains |
|---|---|---|
| DreamPlaybook | LearnEffectiveness | Ranked list of which rules and advisories actually preceded progress. Feeds the injection budget’s utility weights over time. |
| RepairPattern | LearnRepairPatterns | Maps error signatures (e.g., “E0308 mismatched types”) to the fix sequences that resolved them in past sessions. |
| ProjectConvention | LearnConventions | Durable project patterns: naming conventions, file structure expectations, testing style, preferred imports. |
| SuccessfulSequence | LearnSequences | Multi-step action patterns that led to successful outcomes (e.g., “run lint, fix warnings, then build” or “read test file before editing implementation”). |
Artifacts are scored by the ScoreArtifacts task based on how often they’re used and whether they correlate with positive outcomes. Low-scoring artifacts are pruned during consolidation to keep the knowledge base compact.
Adaptive Thresholds
Warden doesn’t use fixed thresholds for intervention. The session phase and trust score adjust them in real time:
| Setting | Warmup | Productive | Struggling | Late |
|---|---|---|---|---|
| Output compression max lines | 80 | 80 | 60 | 40 |
| Advisory injection budget | 1 | 1-3 | 3-5 | uncapped |
| Loop detection sensitivity | low | medium | high | high |
| Verification debt warning threshold | 8 files | 5 files | 3 files | 2 files |
This means a fresh session tolerates broad exploration and messy output. A late, struggling session gets tight compression and aggressive guidance. The agent doesn’t need to know about any of this — the adjustments happen transparently based on what Warden observes.
How Interventions Work in Practice
Warden’s interventions are injected as text into the tool call response. The agent reads them as part of the output and can act on them (or ignore them). Here’s what a typical intervention sequence looks like in a real session:
Turn 1-10 (Warmup): The agent reads files, runs rg searches, examines the directory structure. Warden is silent — this is expected exploration behavior.
Turn 15 (Productive): The agent edits src/handler.rs and src/model.rs. No advisory — two files without a build is fine.
Turn 22 (Productive, verification debt rising): The agent has now edited 6 files without running cargo build or cargo test. Warden injects: “Verification debt: 6 files edited since last build. Consider running tests before continuing.”
Turn 28 (Struggling): The agent runs cargo build, gets an error, edits a file, builds again, gets the same error. After the third cycle, Warden injects: “Loop detected: 3 build-fail cycles on the same error. The approach may need rethinking rather than incremental fixes.”
Turn 35 (Productive): The agent takes a different approach, the build passes, and tests are green. Trust score recovers. Warden goes silent again.