PRD-001: lessons-learned — Automatic Mistake Capture & Proactive Lesson Injection¶
| Field | Value |
|---|---|
| Status | Draft |
| Author | Joe Black |
| Created | 2026-03-28 |
| Last Updated | 2026-03-29 |
| Stakeholders | Individual developers using AI coding agents |
1. Problem Statement¶
AI coding agents (Claude Code, Codex, Gemini CLI) repeatedly make the same categories of mistakes across sessions. Each recurrence costs the developer time, tokens, and flow state. The agent has no memory of past mistakes — every session starts from zero.
Current state: Mistakes are corrected in-context, then forgotten. The next session hits the same pitfall.
Desired state: A system that automatically captures mistakes from conversation history, structures them as indexed lessons, and proactively injects relevant warnings before the agent can repeat the mistake.
Developer Pain Points (Representative Examples)¶
These examples span the full development workflow — any developer using AI agents will recognize at least several:
- Test runners: pytest hangs due to TTY detection; jest
--forceExitneeded in watch mode; mocha requires explicit--exitin CI - Package management:
pip install -e .in wrong venv;npm linkdoesn't resolve peer deps;pnpmhoisting behavior differs from npm - Mock/patch targets: Python
mock.patchmust target the importing module's namespace, not the source module - CI/CD isolation: pre-commit hooks run in isolated venvs — deps must be declared explicitly; GitHub Actions
services:containers don't share localhost with the runner - CLI architecture: Typer/Click subcommand patterns break positional args; argparse
nargs='*'swallows subcommand names - File I/O race conditions: sandbox filesystem tools can't see dirs created by shell; Docker build context doesn't include
.gitignored files - Git footguns:
git stashdrops untracked files silently without-u; rebase onto wrong base loses commits - Database migrations: Alembic autogenerate misses index changes; Django migration dependency cycles from circular model imports
- Async pitfalls: forgetting
awaiton Python coroutine (silently returns coroutine object); Node.js unhandled promise rejection silently exits - Environment leaks:
.envloaded in wrong order;NODE_ENV=productionleft set in dev shell
2. Goals and Non-Goals¶
Goals¶
- Automatically mine conversation logs for mistake → correction patterns, with zero manual intervention after initial setup
- Structure lessons with trigger patterns that enable proactive injection before the mistake recurs
- Inject relevant lessons into the agent's context at the exact moment they're useful — when the agent is about to call a tool in a way that historically causes problems
- Support incremental discovery — continuously learn from new sessions without re-processing old data
- Be fast — the hot-path hook must complete in <50ms to avoid degrading agent performance
- Be cross-agent compatible (V2) — core logic decoupled from any specific agent platform
Non-Goals¶
- Replacing agent training or fine-tuning — this is a runtime context injection system, not a model improvement
- Handling non-coding domains — focused on software engineering tool usage
- Real-time correction during a mistake — this is proactive (before the tool call), not reactive (after failure)
- Building a general-purpose knowledge base — strictly scoped to mistake patterns and their prevention
3. User Stories¶
US-1: Automatic Lesson Discovery¶
As a developer using AI coding agents, I want the system to automatically scan my past conversation logs and extract mistake patterns so that I don't have to manually identify and document every pitfall.
Acceptance Criteria:
- Scanner processes all session JSONL files in
~/.claude/projects/ - Incremental scanning: only processes new data since last scan
- Detects mistake → correction sequences with configurable heuristics
- Outputs structured candidate lessons with confidence scores
- Deduplicates against existing lessons via content hashing
US-2: Proactive Lesson Injection¶
As a developer, I want the system to automatically warn the AI agent about known pitfalls before it makes a tool call that historically causes problems, so that mistakes are prevented rather than corrected.
Acceptance Criteria:
- PreToolUse hook fires before Bash, Read, Edit, Write, and Glob tool calls
- Matches current tool input against lesson trigger patterns (command regex, file path globs)
- Injects relevant lessons as
additionalContextthat the agent sees before executing - Respects injection budget (configurable, default 4KB) and cap (configurable, default 3 lessons)
- Negative lookahead in patterns prevents injection when the fix is already applied
US-3: Session-Scoped Dedup¶
As a developer, I want each lesson injected at most once per session (unless context is compacted), so that the agent isn't repeatedly nagged about the same thing.
Acceptance Criteria:
- 3-layer dedup: environment variable + session temp file + O_EXCL claim directory
- Handles parallel subagents without double-injection
- On context compaction: high-priority lessons (>= configurable threshold) are cleared from dedup for re-injection
- On session clear: all dedup state wiped
US-4: Manual Lesson Management¶
As a developer, I want to manually add, review, and manage lessons via CLI commands, so that I can contribute domain knowledge and curate auto-discovered lessons.
Acceptance Criteria:
/scan-lessonscommand triggers a scan and presents candidates for review/add-lessoncommand accepts structured lesson input- Manifest auto-rebuilds when lessons are added or modified
- Lessons have
needsReviewflag for auto-discovered entries below confidence threshold
US-5: CLI Tool Intelligence Aggregation¶
As a developer, when enough lessons accumulate for a specific tool (e.g., 5+ for pytest), I want them auto-aggregated into a coherent "tool intelligence" skill file rather than injected individually, reducing noise and improving context quality.
Acceptance Criteria:
- Build script groups lessons by
tool:*tags - When a tool reaches 5+ lessons, generates a skill file in
skills/cli-intel/ - Skill files use standard SKILL.md frontmatter with commandPatterns
- Hook prefers the aggregated skill over individual lessons when available
4. Architecture Overview¶
System Components¶
┌─────────────────────────────────────────────────────────────────────┐
│ lessons-learned Plugin │
│ │
│ ┌──────────────┐ ┌──────────────────┐ ┌───────────────────┐ │
│ │ Log Scanner │───▶│ Lesson Store │───▶│ Manifest Builder │ │
│ │ (scripts/) │ │ (data/lessons) │ │ (scripts/) │ │
│ └──────┬───────┘ └──────────────────┘ └───────┬───────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌───────────────────┐ │
│ │ Session JSONL │ │ Manifest JSON │ │
│ │ (~/.claude/) │ │ (data/manifest) │ │
│ └──────────────┘ └───────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────┐ │
│ │ PreToolUse Hook │ │
│ │ (hooks/) │ │
│ └───────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────┐ │
│ │ additionalContext │ │
│ │ → Agent sees │ │
│ │ lesson before │ │
│ │ tool executes │ │
│ └───────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Data Flow¶
- Offline: Scanner reads session JSONL files → detects mistake patterns → produces candidates → classified into lessons → stored in
lessons.json - Build: Manifest builder compiles
lessons.json→lesson-manifest.json(pre-compiled regex, pre-rendered injection text) - Runtime: PreToolUse hook loads manifest → matches tool input against patterns → dedup check → injects relevant lessons as
additionalContext
Cross-Agent Compatibility (V2)¶
lessons-learned/
├── core/ # Pure Node.js — no agent-specific APIs
│ ├── matcher.mjs # Pattern matching (regex test, priority sort)
│ ├── store.mjs # Lesson CRUD (read, add, dedup, hash)
│ ├── manifest.mjs # Manifest build/load/query
│ └── scanner/ # Log scanner (JSONL stream processing)
├── adapters/
│ ├── claude-code/ # hooks.json, stdin/stdout contract, O_EXCL dedup
│ ├── codex/ # Codex hook format (TBD)
│ └── gemini/ # Gemini CLI hook format (TBD)
└── data/ # Shared lesson store + manifest
The core never imports agent-specific modules. Adapters handle:
- Hook registration format (hooks.json for Claude Code, equivalent for others)
- stdin/stdout JSON contract translation
- Dedup state persistence (each agent has different session ID formats)
- Context injection format (
additionalContextvs. equivalent)
5. Reference Architecture: How the Vercel Plugin's Hook System Works¶
This plugin models its hook architecture on the Vercel plugin for Claude Code, which is the most sophisticated example of the pattern. Understanding it is essential for contributors.
The Hook Lifecycle¶
When Claude Code is about to execute a tool (e.g., Bash with command pytest -v tests/):
- Claude Code checks
hooks.jsonfor matchingPreToolUsehooks - The hook matcher (e.g.,
"Bash|Read|Edit|Write") determines if this tool triggers the hook - Claude Code spawns a child process:
node "${CLAUDE_PLUGIN_ROOT}/hooks/pretooluse-lesson-inject.mjs" - Claude Code pipes a JSON payload to the process's stdin:
{
"tool_name": "Bash",
"tool_input": { "command": "pytest -v tests/" },
"session_id": "abc-123",
"cwd": "/Users/joe/project",
"agent_id": "main"
}
- The hook runs its pipeline and writes JSON to stdout:
{
"hookSpecificOutput": {
"additionalContext": "## Lesson: pytest TTY hanging\npytest hangs in Claude Code..."
},
"env": { "LESSONS_SEEN": "pytest-tty-hanging-x7k2" }
}
- Claude Code prepends
additionalContextto the tool call's context — the agent sees this warning before it sees the tool's output - The
envkeys become environment variables available to subsequent hook invocations - The tool executes normally
The Vercel Plugin's Six-Stage Pipeline¶
| Stage | What it does | Perf |
|---|---|---|
| 1. parseInput | Read stdin, extract tool_name, tool_input, session_id. Reject unsupported tools immediately. | <1ms |
| 2. loadSkills | Load skill-manifest.json — pre-compiled regex sources, summaries, injection text. Falls back to scanning SKILL.md files if manifest is missing. |
<1ms |
| 3. matchSkills | For file tools: test file_path against glob-derived regex. For Bash: test command against regex. Returns MatchReason objects with the matching pattern and type. |
<2ms |
| 4. deduplicateSkills | Merge 3 dedup sources (env var + session file + O_EXCL claim dir), filter already-seen, apply context-specific priority boosts, sort by effective priority. | <2ms |
| 5. injectSkills | Read SKILL.md content for matched skills. Budget enforcement: first skill always fits; subsequent checked against 18KB budget. Falls back to summary field if too large. Claims via O_EXCL atomically. | <3ms |
| 6. formatOutput | Wrap in HTML comment markers (<!-- skill:name -->), embed metadata comment for debugging, write JSON to stdout. |
<1ms |
Total: Consistently under 10ms, well within the 5-second hook timeout.
6. Our PreToolUse Hook: Stage-by-Stage Design¶
Our hook follows the same 6-stage architecture but is simpler: lessons are smaller than skills, and pre-rendered in the manifest (no file I/O at injection time).
Stage 1: Parse Input¶
Read stdin JSON. Extract tool_name, tool_input, session_id, agent_id. If tool_name is not in {Bash, Read, Edit, Write, Glob}, output {} and exit immediately. This rejects ~40% of tool calls (Agent, WebSearch, etc.) with zero pattern matching.
Stage 2: Load Manifest¶
readFileSync of data/lesson-manifest.json (~15KB for 100 lessons, <1ms). Reconstruct RegExp objects from stored regexSources. Each hook invocation is a fresh process, but the file is small enough that cold-loading is negligible.
If manifest grows >50KB: Pre-group lessons by toolNames into separate files and load only the relevant one.
Stage 3: Match Lessons¶
First-pass filter: Skip any lesson whose toolNames array doesn't include the current tool_name. O(1) per lesson via Set.
Pattern matching (only for lessons passing first-pass):
- Bash: Test
tool_input.commandagainstcommandRegexSources - Read/Edit/Write/Glob: Test
tool_input.file_pathagainstpathRegexSources
Each match produces: { lessonId, slug, matchedPattern, matchType: "command"|"path" }
Negative lookahead: Patterns like \bpytest\b(?!.*(--no-header)) prevent injection when the fix is already applied, avoiding nagging.
Stage 4: Deduplicate & Rank¶
- Merge dedup state from 3 layers into
Set<string> - Filter already-seen lessons
- Optional tag boost: if project stack is detected (e.g.,
pyproject.tomlexists → Python), boost matchinglang:lessons by +1 - Sort by
priorityDESC, thenconfidenceDESC - Cap at
config.maxLessonsPerInjection
Stage 5: Inject¶
For each ranked lesson:
- Read
injectionfrom manifest (pre-rendered markdown, ~100-200 bytes) - Budget check: first lesson always fits; subsequent checked against remaining
config.injectionBudgetBytes - If
injectionexceeds budget, trysummaryfallback - Claim atomically:
fs.openSync(claimDir/slug, 'wx')— EEXIST means another agent claimed it
Stage 6: Format Output¶
{
"hookSpecificOutput": {
"additionalContext": "[lessons-learned] Matched 2 lessons for Bash: pytest -v tests/\n\n## Lesson: pytest TTY hanging\n...\n\n## Lesson: verbose output stalls\n...\n\n<!-- lessonInjection: {\"version\":1,\"injected\":[...],\"dropped\":[]} -->"
},
"env": {
"LESSONS_SEEN": "slug1,slug2,previously-seen-slug"
}
}
HTML comment metadata enables debugging — inspect what was injected and why.
7. The 3-Layer Dedup System¶
Prevents the same lesson from being injected multiple times in a session. Three layers address different failure modes:
Layer 1: Environment Variable (LESSONS_SEEN)¶
- Mechanism: Hook outputs
env: { "LESSONS_SEEN": "slug1,slug2" }. Claude Code passes this as an env var to the next hook invocation. - Strengths: Fast reads within a single agent's linear execution chain.
- Limitation: Subagents spawned in parallel don't share env vars.
Layer 2: Session Temp File¶
- Mechanism:
$TMPDIR/lessons-<sha256(sessionId)>-seen.txt— comma-delimited list of seen slugs. - Strengths: Cross-agent persistence. Both Agent A and Agent B read/write the same file.
- Limitation: Not atomic under concurrent writes.
Layer 3: O_EXCL Claim Directory¶
- Mechanism: Directory at
$TMPDIR/lessons-<sha256(sessionId)>-seen.d/. To claim a lesson:
If the file already exists, openSync throws EEXIST — another agent already claimed it.
- Strengths: Atomic concurrent dedup. Even if two parallel subagents match the same lesson, only one wins.
Merge Strategy¶
On each hook invocation: seen = union(envVarSlugs, sessionFileSlugs, claimDirSlugs). A lesson is injected only if its slug is NOT in this merged set.
Context Compaction Re-injection¶
When Claude's context window fills, Claude Code runs compaction — summarizing the conversation to free space. After compaction, Claude no longer remembers previously-injected lesson text. If the same pitfall scenario arises again, the lesson needs re-injection.
- On
compactevent: clear lessons withpriority >= config.compactionReinjectionThreshold(default 7) from dedup state - On
clearevent: wipe all dedup state - On
startup/resume: no-op
8. Priority and Confidence Scoring¶
Priority Computation (Auto-Discovered Lessons)¶
Priority is a composite score from observable signals. All weights are configurable via data/config.json under the scoring key.
basePriority = 3 (all auto-discovered lessons start here)
+multiSessionBonus (default +2) if pattern seen across 2+ sessions
+multiProjectBonus (default +1) if pattern seen across 2+ projects
+hangTimeoutBonus (default +1) if mistake caused a hang or timeout
+dataLossBonus (default +1) if mistake caused data loss or silent failure
+userCorrectionBonus (default +1) if user explicitly corrected the agent
+fixConfirmedBonus (default +1) if the correction was followed by success
+singleOccurrencePenalty (default -1) if only seen once
Final priority = clamp(sum, 1, 10)
Examples:
| Pattern | Signals | Score |
|---|---|---|
| pytest TTY hang | 5 sessions, 3 projects, hang, user corrected | 3+2+1+1+1 = 8 |
| Mock patch namespace | 3 sessions, 2 projects, user corrected, fix confirmed | 3+2+1+1+1 = 8 |
| Obscure pip flag | 1 session, 1 project, no user correction | 3-1 = 2 |
Manually curated seed lessons start at priority 7-9 (human judgment > heuristics).
Confidence Computation¶
Confidence reflects certainty that this is a real, recurring pattern (not noise):
baseConfidence = 0.4 (heuristic detection is inherently uncertain)
+0.20 if error-correction pair clearly identified (tool error → fix → success)
+0.15 if user explicitly corrected the agent (strongest signal)
+0.10 if same pattern seen in 2+ sessions
+0.10 if same pattern seen in 2+ projects
+0.05 if correction text contains causal language ("because", "root cause", "the issue is")
Final confidence = clamp(sum, 0.0, 1.0)
Lessons with confidence < config.minConfidence (default 0.5) are stored but excluded from the manifest. They're flagged "needsReview": true for manual inspection via /scan-lessons.
9. Configuration¶
All tunable settings live in data/config.json:
{
"$schema": "./schemas/config.schema.json",
"type": "lessons-learned-config",
"version": 1,
// Injection behavior
"injectionBudgetBytes": 4096,
"maxLessonsPerInjection": 3,
"minConfidence": 0.5,
"minPriority": 1,
"compactionReinjectionThreshold": 7,
// Scanner behavior
"scanPaths": ["~/.claude/projects/"],
"autoScanIntervalHours": 24,
"maxCandidatesPerScan": 50,
// Scoring weights
"scoring": {
"multiSessionBonus": 2,
"multiProjectBonus": 1,
"hangTimeoutBonus": 1,
"dataLossBonus": 1,
"userCorrectionBonus": 1,
"fixConfirmedBonus": 1,
"singleOccurrencePenalty": -1,
},
}
All data files (config.json, lessons.json, lesson-manifest.json) include $schema, type, and version fields. JSON Schema files in schemas/ provide IDE autocomplete, validation, and hover docs.
The manifest snapshots config values at build time so the hook never reads config.json at runtime.
10. Data Schemas¶
10.1 Lesson Store (data/lessons.json)¶
{
"$schema": "./schemas/lessons.schema.json",
"type": "lessons-learned-store",
"version": 1,
"lessons": [
{
// --- Identity ---
"id": "01JQXYZ...",
// ULID — collision-free, naturally sorted by creation time.
// 48-bit ms timestamp + 80-bit random. Lexicographic sort = chronological.
"slug": "pytest-tty-hanging-x7k2",
// Human-readable slug + 4-char random suffix for guaranteed uniqueness.
// Format: kebab-case-summary-XXXX (X = base36 alphanumeric).
// Used in dedup claim filenames, log output, CLI references.
// --- Content ---
"summary": "pytest hangs in non-interactive envs due to TTY detection",
// One-line description. Self-contained. Fallback injection text when full
// injection exceeds budget. Max ~100 chars.
"mistake": "Running bare `pytest` or `pytest -v` in Claude Code causes the process to hang indefinitely because pytest's rich output module detects a non-interactive terminal and stalls.",
// Root cause explanation. Explains WHY the failure occurs, not just the symptom.
"remediation": "Use `python -m pytest --no-header -rN -p no:faulthandler` or prepend `TERM=dumb`. Pipe through `cat` if rich output is still suspected.",
// Concrete fix. Actionable commands or code changes. Copy-pasteable where possible.
"injection": "## Lesson: pytest TTY hanging\npytest hangs in Claude Code. Use:\n`python -m pytest --no-header -rN -p no:faulthandler`\nor prepend `TERM=dumb`.",
// Pre-rendered markdown for hook injection. The ONLY field read in the hot path.
// Kept under 200 bytes. Generated from summary + remediation at build time.
// --- Trigger Patterns ---
"triggers": {
"toolNames": ["Bash"],
// Which tools this lesson applies to. First-pass O(1) filter.
"commandPatterns": ["\\bpytest\\b(?!.*(--no-header|-p no:faulthandler|TERM=dumb))"],
// Regex patterns tested against command strings (Bash tool_input.command).
// Negative lookahead prevents injection when the fix is already applied.
"pathPatterns": [],
// Glob patterns tested against file_path (Read/Edit/Write/Glob tools).
// Compiled to regex at manifest build time.
"contentPatterns": [],
// (Tentative) Regex for file content or command output. TBD post-harvest.
},
// --- Metadata ---
"priority": 8,
// 1-10. Computed from scoring signals. See § Priority Computation.
"confidence": 0.95,
// 0.0-1.0. How certain this is a real, recurring pattern. See § Confidence.
"needsReview": false,
// True for auto-discovered lessons below confidence threshold.
// Stored but not injected until confirmed.
"tags": ["lang:python", "tool:pytest", "topic:testing", "env:claude-code", "severity:hang"],
// Labeled tags (Datadog-style). Format: "category:value".
// lang: — programming language (python, typescript, go, rust)
// tool: — CLI tool or library (pytest, git, npm, pip, docker)
// topic: — conceptual domain (testing, packaging, ci, filesystem, async)
// env: — execution environment (claude-code, codex, docker, ci)
// severity: — failure type (hang, error, silent, data-loss)
// platform: — OS-specific (macos, linux, windows)
// --- Provenance ---
"sourceSessionIds": ["abc-123"],
// Session IDs where this pattern was observed. Empty for manually authored seeds.
"occurrenceCount": 5,
// Times the scanner detected this pattern across sessions.
"createdAt": "2026-03-28T14:00:00Z",
"updatedAt": "2026-03-28T14:00:00Z",
"contentHash": "sha256:a1b2c3...",
// SHA-256 of (mistake + remediation + triggers). Scanner uses for dedup.
},
],
}
10.2 Manifest (data/lesson-manifest.json)¶
{
"$schema": "./schemas/manifest.schema.json",
"type": "lessons-learned-manifest",
"version": 1,
"generatedAt": "2026-03-28T14:00:00Z",
// When this manifest was last built. If lessons.json is newer, manifest is stale.
"config": {
// Snapshot of config values at build time.
// Hook reads these instead of loading config.json at runtime.
"injectionBudgetBytes": 4096,
"maxLessonsPerInjection": 3,
"minConfidence": 0.5,
"minPriority": 1,
"compactionReinjectionThreshold": 7,
},
"lessons": {
"01JQXYZ...": {
// Keyed by ULID for direct lookup.
"slug": "pytest-tty-hanging-x7k2",
// For logging and dedup claim filenames.
"priority": 8,
// For sort-time ranking without loading the full store.
"toolNames": ["Bash"],
// First-pass filter. Hook skips if current tool not in this array.
"commandRegexSources": [
{ "source": "\\bpytest\\b(?!.*(--no-header|-p no:faulthandler|TERM=dumb))", "flags": "i" },
],
// Pre-compiled regex sources for command matching.
// Reconstruct at load time: new RegExp(source, flags).
// No glob compilation or pattern parsing at runtime.
"pathRegexSources": [],
// Pre-compiled regex sources for file path matching.
// Globs from lessons.json pathPatterns → regex at build time.
"tags": ["lang:python", "tool:pytest"],
// For optional tag-based priority boosting at runtime.
"injection": "## Lesson: pytest TTY hanging\npytest hangs in Claude Code. Use:\n`python -m pytest --no-header -rN -p no:faulthandler`\nor prepend `TERM=dumb`.",
// Pre-rendered markdown. The ONLY content the hook reads for injection.
// No file I/O, no template rendering.
"summary": "pytest hangs in non-interactive envs due to TTY detection",
// Fallback if injection exceeds remaining budget.
},
},
}
11. Structured Self-Reporting via #lesson Tags¶
The Core Insight¶
Instead of building complex heuristics to retroactively identify mistakes from the thousand ways an agent might phrase a correction, we define the output format and let the agent self-report. This inverts the problem: the scanner becomes a simple grep, not a natural language classifier.
How It Works¶
A SessionStart hook injects a standing instruction into every session telling the agent to use a deterministic #lesson tag whenever it identifies a mistake, troubleshoots an issue, or resolves a problem. The instruction is compact and specific:
## Lesson Reporting Protocol
When you encounter or recover from a mistake during this session, emit a structured
lesson tag in your response. This enables automatic capture for future prevention.
Format:
#lesson
tool: <tool_name>
trigger: <what_command_or_action_triggered_the_issue>
mistake: <what_went_wrong_and_why>
fix: <the_correction_that_resolved_it>
tags: <comma_separated_category:value_tags>
#/lesson
Example:
#lesson
tool: Bash
trigger: pytest -v tests/
mistake: pytest hangs in non-interactive environments due to TTY rich output detection
fix: Use `python -m pytest --no-header -rN -p no:faulthandler` or prepend TERM=dumb
tags: lang:python, tool:pytest, severity:hang
#/lesson
Emit this tag naturally as part of your response whenever you:
- Discover why a tool call failed and apply a different approach
- Catch yourself about to repeat a known mistake
- Receive a user correction ("no", "wrong", "that's not right")
- Identify a root cause after debugging
Do NOT force lesson tags where none apply. Only tag genuine mistake→correction sequences.
Why This Is Transformative¶
| Aspect | Heuristic Detection (Before) | Structured Self-Reporting (After) |
|---|---|---|
| Scanner complexity | Sliding window, 5+ signal types, NLP heuristics | grep '#lesson' + JSON-like block parse |
| Accuracy | ~80% with false positives | ~95%+ (agent understands its own context) |
| Trigger patterns | Must be reverse-engineered from error text | Agent provides them directly (trigger: field) |
| Root cause quality | Inferred from correction text | Agent explains it with full context |
| Token cost | Large context windows for heuristic analysis | Minimal — structured blocks are small |
| Speed | ~2s for full scan | ~200ms for full scan (simple string match) |
| Cross-agent | Each agent phrases corrections differently | Same tag format works for Claude, Codex, Gemini |
The Two-Tier Scanner¶
The #lesson tag creates a two-tier detection architecture:
Tier 1 (Primary): Structured tag detection
- Scan for
#lesson/#/lessonblock boundaries in assistant messages - Parse the semi-structured fields (tool, trigger, mistake, fix, tags)
- Extremely fast: raw string search, no JSON.parse needed for detection
- High confidence: the agent consciously decided to emit this tag
Tier 2 (Fallback): Heuristic detection
- For historical sessions that predate the
#lessontag injection - For sessions where the agent didn't comply with the protocol
- Same sliding-window approach described in the original scanner design
- Lower confidence scores (the agent didn't self-identify these as lessons)
Over time, as more sessions include the #lesson tag instruction, Tier 2 becomes less important. Eventually it serves only as a safety net for edge cases.
Compliance Validation¶
The critical question: will agents actually emit #lesson tags consistently?
This requires empirical validation before we can rely on it as the primary detection mechanism.
Validation plan:
-
Phase 0 (Experiment): Before building the full scanner, inject the
#lessonprotocol via SessionStart hook for 2 weeks of normal development work. -
Measure compliance: After 2 weeks, scan sessions for:
- Count of
#lessontags emitted vs. count of mistake patterns detected by Tier 2 heuristics - Compliance rate = tags / (tags + heuristic-only detections)
-
Quality assessment: are the self-reported tags accurate and well-structured?
-
Compliance thresholds:
- >80% compliance: Proceed with Tier 1 as primary, Tier 2 as fallback
- 50-80% compliance: Use both tiers equally, investigate why compliance drops (compaction? competing instructions? edge cases?)
-
<50% compliance: Re-evaluate the injection strategy — the instruction may need to be stronger, differently positioned, or the format simplified
-
Known risks to compliance:
- Context compaction: After summarization, the
#lessoninstruction may be lost. Mitigation: the SessionStart hook re-injects oncompactevents. - Instruction competition: Other plugins/skills inject their own instructions. The
#lessonprotocol must be concise enough to survive priority triage. - Agent discretion: The instruction says "do NOT force lesson tags where none apply" — agents may be too conservative. We may need to tune the prompt.
- Subagent inheritance: Subagents may not receive the SessionStart injection. Mitigation: inject via
SubagentStarthook as well.
Integration with the Hook System¶
The #lesson tag injection adds one new hook to hooks.json:
{
"SessionStart": [
{
"matcher": "startup|clear|compact",
"hooks": [
{
"type": "command",
"command": "node \"${CLAUDE_PLUGIN_ROOT}/hooks/session-start-lesson-protocol.mjs\""
}
]
}
],
"SubagentStart": [
{
"matcher": ".+",
"hooks": [
{
"type": "command",
"command": "node \"${CLAUDE_PLUGIN_ROOT}/hooks/subagent-start-lesson-protocol.mjs\""
}
]
}
]
}
Both hooks inject the same #lesson protocol instruction via additionalContext. The SessionStart hook also handles dedup reset (as designed in §7). The SubagentStart hook ensures spawned agents also know the protocol.
Token cost: The protocol instruction is ~200 tokens. Injected once per session (and once per subagent). Negligible compared to typical session length.
Impact on Lesson Schema¶
Self-reported lessons arrive with richer, more accurate data than heuristic-detected ones:
| Field | Heuristic-detected | Self-reported via #lesson |
|---|---|---|
triggers.toolNames |
Inferred from surrounding tool_use blocks | Provided directly (tool: field) |
triggers.commandPatterns |
Reverse-engineered from error context | Provided directly (trigger: field) |
mistake |
Reconstructed from correction text | Agent's own explanation (mistake: field) |
remediation |
Extracted from the successful retry | Agent's own fix (fix: field) |
tags |
Inferred from context | Provided directly (tags: field) |
confidence |
0.4-0.8 (heuristic uncertainty) | 0.85+ (agent consciously reported it) |
priority |
Computed from signals | Computed from signals + self-report bonus |
The scoring formulas gain a new signal:
basePriority adjustment:
+1 if lesson was self-reported via #lesson tag (agent consciously identified it)
baseConfidence adjustment:
+0.25 if self-reported via #lesson tag (replaces the +0.20 error-correction pair bonus)
Impact on Implementation Phases¶
This shifts Phase 1 significantly:
Phase 0 (NEW): Compliance Experiment
- Implement
hooks/session-start-lesson-protocol.mjs(just the instruction injection) - Implement
hooks/subagent-start-lesson-protocol.mjs - Add both to
hooks/hooks.json - Install the plugin (even with no scanner or PreToolUse hook yet)
- Use normally for 2 weeks
- Run a simple grep-based audit: count
#lessonoccurrences in new session files - Assess compliance rate and tag quality
- Decide whether Tier 1 or Tier 2 is the primary scanner
Phase 1 then becomes: Build scanner with Tier 1 (tag detection) primary + Tier 2 (heuristic) fallback, informed by real compliance data.
12. Log Scanner¶
Overview¶
A Node.js CLI that processes session JSONL files to discover mistake → correction patterns. Uses a two-tier detection architecture: structured #lesson tag parsing (primary) and heuristic detection (fallback).
Incremental Scanning¶
scan-state.json tracks per-file progress:
{
"version": 1,
"lastScanAt": "2026-03-28T14:00:00Z",
"files": {
"/path/to/session.jsonl": {
"byteOffset": 1548320,
"mtimeMs": 1774723046519,
"sizeBytes": 2750000
}
}
}
On each scan:
- Enumerate
*.jsonlfiles inconfig.scanPaths - Check
mtimeMsandsizeBytesagainst scan state — skip unchanged files - For grown files:
createReadStream({ start: byteOffset })to read only new data - Update scan state after processing
Streaming Architecture¶
createReadStream({ start: byteOffset })
→ readline (line-by-line, constant ~64KB buffer)
→ fast pre-filter: regex match "type":"assistant" before JSON.parse
→ Tier 1: scan for #lesson tags (simple string match)
→ Tier 2: sliding window heuristic detector (fallback)
Memory: Constant ~1MB regardless of file size.
Speed: ~100MB/s throughput. Full scan of 200MB (595 files): ~2 seconds. Incremental scan of 5MB new data: ~50ms.
Tier 1: Structured Tag Detection (Primary)¶
Scans assistant message text blocks for #lesson / #/lesson boundaries:
#lesson
tool: Bash
trigger: pytest -v tests/
mistake: pytest hangs due to TTY detection
fix: Use pytest --no-header -rN -p no:faulthandler
tags: lang:python, tool:pytest, severity:hang
#/lesson
Detection: Simple regex /#lesson\n([\s\S]*?)#\/lesson/g on each assistant text block. No sliding window, no NLP heuristics.
Parsing: Split block by newlines, extract key: value pairs. Flexible — unknown keys are ignored, missing keys get defaults.
Confidence: Self-reported lessons start at confidence: 0.85 (agent consciously identified the pattern).
Tier 2: Heuristic Detection (Fallback)¶
For historical sessions and compliance gaps. Operates on a sliding window of conversation turns:
| Pattern | Signal | Confidence Boost |
|---|---|---|
| Tool error output → assistant correction text → new tool call | Error-correction pair | +0.20 |
| User says "no"/"wrong"/"that's not right" → assistant acknowledges | Explicit user correction | +0.15 |
| Same tool called 3+ times with modifications | Retry loop | +0.10 |
| Bash timeout or empty output after >30s | Hang/stall | +0.10 |
| Assistant text contains "can't", "doesn't", "the issue is", "root cause" | Self-diagnosis | +0.05 |
Classification Pipeline¶
Three modes:
-
Tier 1 auto-classify (for
#lessontagged entries): Parse structured fields directly into lesson schema. Minimal LLM involvement — just generate theinjectionfield and refinecommandPatternsregex from thetrigger:text. Can run fully automated. -
LLM-assisted (via
/scan-lessonscommand): Scanner outputs Tier 2 heuristic candidates → current Claude session reviews, classifies, and structures them → writes to lesson store. Highest quality for untagged history. -
Fully automated heuristic (
--autoflag): Heuristic-only classification for Tier 2. Lower confidence, flaggedneedsReview: true.
13. CLI Tool Intelligence Aggregation¶
When lessons accumulate densely for a specific tool (5+ lessons tagged tool:<name>), they can be auto-aggregated into a coherent skill file:
tool:pytest → 7 lessons → skills/cli-intel/pytest.md
tool:git → 12 lessons → skills/cli-intel/git.md
tool:docker → 5 lessons → skills/cli-intel/docker.md
Each generated skill follows SKILL.md format with frontmatter:
---
metadata:
name: cli-intel-pytest
commandPatterns: ["\\bpytest\\b"]
priority: 6
summary: 'Known pytest pitfalls in AI agent environments'
---
# pytest: Known Pitfalls
## TTY Detection Hanging
pytest hangs in non-interactive environments...
## Verbose Output Stalling
The `...` in progress output triggers REPL detection...
When to switch: Once individual lessons for a tool exceed the 3-lesson cap, a unified skill is more coherent and provides complete tool-level guidance in one injection.
Coexistence: Individual lessons remain in the store for exact-match scenarios. The hook prefers the aggregated skill when available.
This is a Phase 3+ feature.
14. Directory Structure¶
lessons-learned/
├── .plugin/
│ └── plugin.json # Plugin manifest
├── hooks/
│ ├── hooks.json # Hook registrations
│ ├── pretooluse-lesson-inject.mjs # Core: 6-stage match → inject pipeline
│ ├── session-start-reset.mjs # Reset dedup on clear/compact
│ ├── session-start-lesson-protocol.mjs # Inject #lesson self-reporting protocol
│ ├── subagent-start-lesson-protocol.mjs # Inject #lesson protocol into subagents
│ └── lib/
│ ├── stdin.mjs # Parse hook stdin JSON
│ ├── dedup.mjs # O_EXCL file-lock dedup (3-layer)
│ └── output.mjs # Format hook stdout JSON
├── commands/
│ ├── scan-lessons.md # /scan-lessons slash command
│ └── add-lesson.md # /add-lesson for manual entry
├── scripts/
│ ├── scan.mjs # CLI: scan logs for candidates
│ ├── build-manifest.mjs # CLI: compile lessons.json → manifest
│ ├── add-lesson.mjs # CLI: add structured lesson to store
│ └── scanner/
│ ├── incremental.mjs # Byte-offset tracking per file
│ ├── structured.mjs # Tier 1: #lesson tag parser (primary)
│ ├── detector.mjs # Tier 2: heuristic pattern detection (fallback)
│ └── extractor.mjs # Extract candidate windows from matches
├── schemas/
│ ├── config.schema.json # JSON Schema for config.json
│ ├── lessons.schema.json # JSON Schema for lessons.json
│ └── manifest.schema.json # JSON Schema for lesson-manifest.json
├── data/
│ ├── config.json # Plugin configuration
│ ├── lessons.json # Source of truth (seed + discovered)
│ ├── lesson-manifest.json # Pre-compiled patterns (generated)
│ └── scan-state.json # Incremental scan bookmarks
├── package.json
└── README.md
15. Implementation Phases¶
Phase 0: Compliance Experiment¶
- Implement
session-start-lesson-protocol.mjs— inject#lessonself-reporting protocol - Run 10-20 real coding sessions with the protocol active
- Measure compliance: what % of mistakes produce a well-formed
#lessontag? - Categorize failures: missed entirely, malformed, incomplete fields, wrong timing
- Decision gate: If compliance > 70%, Tier 1 (structured) is the primary scanner. If < 40%, Tier 2 (heuristic) is primary. Between 40-70%, both tiers run and results merge.
Phase 1: Harvest & Schema Validation¶
- Build
scripts/scan.mjsandscripts/scanner/(structured parser + heuristic detector, extractor, incremental) - Run full scan against all existing sessions — collect ALL candidates
- Analyze candidate shapes: what fields do they need? What trigger patterns emerge?
- Finalize the lesson schema based on real data
- Curate seed lessons from the best candidates
Phase 2: Plugin Skeleton + Hook¶
- Create
.plugin/plugin.json,package.json,schemas/ - Implement
scripts/build-manifest.mjs - Implement
hooks/lib/stdin.mjs,hooks/lib/dedup.mjs,hooks/lib/output.mjs - Implement
hooks/pretooluse-lesson-inject.mjs(6-stage pipeline) - Implement
hooks/session-start-reset.mjs,hooks/subagent-start-lesson-protocol.mjs, andhooks/hooks.json - Test: Install plugin, verify lessons inject correctly
Phase 3: Store Operations + Commands + CLI Intelligence¶
- Implement
scripts/add-lesson.mjs(ULID generation, slug with random suffix) - Auto-rebuild manifest on lesson store changes
- Create
commands/scan-lessons.mdandcommands/add-lesson.md - Begin aggregating dense tool clusters into CLI intelligence skills
Phase 4: Automation + Cross-Agent¶
- Add SessionStart background scan trigger
- Add
--automode for heuristic-only classification - Refactor core logic into agent-agnostic
core/module - Document adapter interface for Codex/Gemini
16. Key Design Decisions¶
| Decision | Choice | Rationale |
|---|---|---|
| IDs | ULID | Collision-free, chronological sort, no coordination needed |
| Slugs | kebab-case + 4-char random suffix | Human-readable + guaranteed unique |
| Tags | Datadog-style category:value |
Enables category-aware scoring and CLI tool aggregation |
| Pattern naming | commandPatterns (not bashPatterns) |
Agent-agnostic — applies to any shell/command tool |
| Config | data/config.json with JSON Schema |
Single source of truth, IDE autocomplete via $schema |
| All data files | $schema + type + version fields |
IDE validation, type discrimination, schema evolution |
| Indexing | Linear scan with pre-compiled RegExp | <500 lessons expected; proven at 50+ skills in <5ms |
| Priority | Composite score from configurable signals | Transparent, reproducible, tunable |
| Confidence | Composite score gating injection | Low-confidence lessons suppressed until reviewed |
| Scanner | Node.js streaming JSONL, byte-offset tracking | Constant memory, incremental, ~100MB/s |
| CLI intelligence | Auto-aggregate 5+ lessons per tool into skill | Reduces noise, provides coherent tool-level guidance |
| Self-reporting | #lesson structured tags injected via SessionStart |
Deterministic scanning (grep) vs. NLP heuristics; requires compliance validation |
| Two-tier scanner | Tier 1 structured (primary) + Tier 2 heuristic (fallback) | Structured tags for new sessions; heuristics for historical logs without tags |
| Dependencies | Zero npm deps for hooks | Reliability — hooks must never fail due to missing packages |
17. Verification Plan¶
| Test | Method | Criteria |
|---|---|---|
| Scanner accuracy | Run against sessions with known mistakes | Detects at least 80% of manually-identified patterns |
| Hook unit test | Pipe JSON to hook, inspect stdout | Correct lesson injected for matching input |
| Hook performance | Time 100 invocations with 100+ lesson manifest | p99 < 50ms |
| Dedup correctness | Same tool called twice in one session | Lesson injected exactly once |
| Concurrent dedup | Simulate 2 parallel subagents matching same lesson | Only one injection via O_EXCL |
| Compaction re-injection | Trigger compact event, re-match same tool | High-priority lesson re-injects |
| Budget enforcement | Create lesson with >4KB injection text | Falls back to summary or drops |
| CLI intelligence | Store 6 lessons tagged tool:pytest |
Skill file auto-generated |
#lesson compliance |
Run 10-20 sessions with protocol active | >70% well-formed tags from real mistakes |
| Structured scanner | Feed sessions with #lesson tags to Tier 1 parser |
All well-formed tags extracted with correct fields |
| End-to-end | Install plugin, run pytest tests/ |
Lesson appears in agent context |
18. Dependencies¶
| Dependency | Scope | Notes |
|---|---|---|
| Node.js built-ins (fs, path, crypto, os, readline) | All | Zero external dependencies for hooks and scanner |
ulid (npm) or inline implementation |
Scripts | ~20 lines if inlined; only used at lesson creation time |
| JSON Schema files | Dev/IDE | Manual or generated from TypeScript types |
19. Risks and Mitigations¶
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Hook adds latency to every tool call | Low | Medium | <10ms measured; early exit for unmatched tools |
| False positive injections (irrelevant lessons) | Medium | Low | Negative lookahead patterns; confidence threshold; dedup prevents repeat |
| Scanner produces too much noise | Medium | Low | Confidence scoring; needsReview flag; manual curation via /scan-lessons |
| Session JSONL format changes | Low | High | Version-check JSONL structure; fail gracefully on unknown formats |
| Lesson store grows unbounded | Low | Low | Content-hash dedup; CLI intelligence aggregation reduces individual count |
| Concurrent subagent race conditions | Low | Low | O_EXCL claim directory provides atomic dedup |
#lesson tag non-compliance |
Medium | Medium | Phase 0 compliance experiment; Tier 2 heuristic fallback; iterate on protocol wording |
#lesson protocol drift across models |
Low | Medium | Protocol injected at session start, not baked into model weights; version the protocol format |
20. Success Metrics¶
| Metric | Target | How to Measure |
|---|---|---|
| Mistake recurrence rate | 50% reduction | Compare pre/post: count retry loops in sessions |
| Hook latency | p99 < 50ms | Performance test suite |
| Lesson coverage | 80%+ of common patterns | Cross-reference scanner output with manual audit |
| False positive rate | < 10% of injections | Sample injections and assess relevance |
| Developer time saved | Measurable reduction in token waste | Compare session token usage before/after |
21. Open Questions¶
The following questions need answers before or during implementation. They are grouped by phase to indicate when each becomes blocking.
Phase 1 (Harvest & Schema — Blocking)¶
-
Session JSONL stability: Is the
~/.claude/projects/*/session.jsonlformat documented/stable, or should we expect breaking changes? Do we need a version check at parse time? -
Codex/Gemini log formats: For cross-agent compatibility (V2), where do Codex and Gemini CLI store conversation logs? Same JSONL format, or entirely different? This determines whether the scanner core can be shared.
-
Schema validation approach: Should we generate JSON Schemas from TypeScript types (single source of truth, requires build step) or maintain them manually (simpler, risk of drift)?
-
Content hash scope: The
contentHashdeduplicates lessons. Should it hash justmistake + remediation, or alsotriggers? Including triggers means a lesson with refined patterns counts as "new" — is that desirable?
Phase 2 (Plugin + Hook — Blocking)¶
-
Plugin distribution: How will this be installed? Personal GitHub repo +
extraKnownMarketplaces? Or a local path for development? Is there a plugin publishing/registry process we should follow? -
Hook timeout: The Vercel plugin uses a 5-second timeout. Is this the right value for our hook, or should we be more aggressive (e.g., 2s) to avoid impacting agent responsiveness?
-
Tag-based priority boosting: Should the hook attempt project stack detection (checking for
pyproject.toml,package.json, etc.) to boost relevant lessons? This adds I/O to the hot path — is it worth it, or should we defer to Phase 3? -
Injection format: Should lessons inject as plain markdown, or wrapped in a custom HTML tag/comment for structured parsing by the agent? The Vercel plugin uses
<!-- skill:name -->...<!-- /skill:name -->— should we follow suit?
Phase 3 (Commands + CLI Intelligence — Non-blocking)¶
-
CLI intelligence threshold: The current proposal auto-aggregates at 5+ lessons per tool. Is this the right threshold? Should it be configurable? Should aggregation be opt-in or opt-out?
-
Lesson lifecycle: Should lessons have an expiration or "last seen" date? If a lesson hasn't matched in 6 months, should it be automatically archived or demoted?
-
Community contribution model: If this becomes a shared tool, how should community-contributed lessons be submitted, reviewed, and merged? PR-based? A submit command? A lesson marketplace?
Phase 4 (Automation — Non-blocking)¶
-
Background scan trigger: Should automated scanning be triggered by SessionStart (adds latency), SessionEnd (may not exist in all agents), or a system-level cron? What's the right tradeoff?
-
Auto-discovered lesson review UX: When the scanner finds new candidates in
--automode, how should they surface for review? A notification on next session start? A counter in the status bar? A pending queue in/scan-lessons?
Phase 0 (#lesson Compliance — Blocking)¶
-
#lessontag format stability: Is the proposed#lesson/#/lessondelimiter format robust enough, or should we use a more structured format (e.g., YAML frontmatter, JSON block)? The current format optimizes for grep-ability — is that the right tradeoff vs. parse reliability? -
Compliance across model families: The
#lessonprotocol is injected as context, not trained into the model. Different models (Claude, GPT, Gemini) may comply at different rates. Should we tune the protocol wording per model, or keep it universal and accept varying compliance rates? -
Subagent compliance: Subagents spawned via the Agent tool get the protocol via
SubagentStarthook. Do they comply at the same rate as the main agent? Are there tool-calling patterns where the subagent never encounters a "mistake moment" to report?
Architecture (Informed by Real-World Demand)¶
-
Project-specific vs. generalized lesson scoping: Real-world demand signals (obra/superpowers#907, #601, #551) reveal two distinct user expectations:
- Generalized lessons: "pytest hangs in non-interactive envs" — applies everywhere, regardless of project. These are the seed lessons we've built so far.
- Project-specific lessons: "our CI uses custom runner X", "don't run migration Y in this repo", "reviewer prefers Z pattern" — meaningful only within a specific codebase/team.
Our current design treats all lessons as global. But the demand signals suggest project-scoped lessons are equally (perhaps more) valuable. Key questions: - Should the lesson store have a
scopefield (global|project:<path>|team:<name>)? - Should project-specific lessons live in the repo (e.g.,.lessons/lessons.json) vs. the global plugin data dir? - How do global and project-scoped lessons interact at injection time? Priority boost for project-local matches? - Does the scanner detect project-specific vs. general patterns differently?
Cross-Cutting¶
-
Privacy and sharing: Session logs may contain sensitive data (API keys, internal URLs, proprietary code). Should the scanner redact sensitive content from lesson provenance fields? Should lessons ever be shareable across users/teams?
-
Telemetry: Should we track injection frequency, match rates, and dedup behavior to tune the system? If so, where does telemetry go — local file, or opt-in remote?
-
Offline vs. connected: Should this plugin ever phone home (for lesson sharing, updates, telemetry)? Or is it strictly offline-first?
This PRD is a living document. Update it as open questions are resolved and as harvesting results refine the schema.