Use when testing Ralph's hat collection presets, validating preset configurations, or auditing the preset library for bugs and UX issues.
Installation
Details
Usage
After installing, this skill will be available to your AI coding assistant.
Verify installation:
skills listSkill Instructions
name: evaluate-presets description: Use when testing Ralph's hat collection presets, validating preset configurations, or auditing the preset library for bugs and UX issues.
Evaluate Presets
Overview
Systematically test all hat collection presets using shell scripts. Direct CLI invocation—no meta-orchestration complexity.
When to Use
- Testing preset configurations after changes
- Auditing the preset library for quality
- Validating new presets work correctly
- After modifying hat routing logic
Quick Start
Evaluate a single preset:
./tools/evaluate-preset.sh tdd-red-green claude
Evaluate all presets:
./tools/evaluate-all-presets.sh claude
Arguments:
- First arg: preset name (without
.ymlextension) - Second arg: backend (
claudeorkiro, defaults toclaude)
Bash Tool Configuration
IMPORTANT: When invoking these scripts via the Bash tool, use these settings:
- Single preset evaluation: Use
timeout: 600000(10 minutes max) andrun_in_background: true - All presets evaluation: Use
timeout: 600000(10 minutes max) andrun_in_background: true
Since preset evaluations can run for hours (especially the full suite), always run in background mode and use the TaskOutput tool to check progress periodically.
Example invocation pattern:
Bash tool with:
command: "./tools/evaluate-preset.sh tdd-red-green claude"
timeout: 600000
run_in_background: true
After launching, use TaskOutput with block: false to check status without waiting for completion.
What the Scripts Do
evaluate-preset.sh
- Loads test task from
tools/preset-test-tasks.yml(ifyqavailable) - Creates merged config with evaluation settings
- Runs Ralph with
--record-sessionfor metrics capture - Captures output logs, exit codes, and timing
- Extracts metrics: iterations, hats activated, events published
Output structure:
.eval/
├── logs/<preset>/<timestamp>/
│ ├── output.log # Full stdout/stderr
│ ├── session.jsonl # Recorded session
│ ├── metrics.json # Extracted metrics
│ ├── environment.json # Runtime environment
│ └── merged-config.yml # Config used
└── logs/<preset>/latest -> <timestamp>
evaluate-all-presets.sh
Runs all 12 presets sequentially and generates a summary:
.eval/results/<suite-id>/
├── SUMMARY.md # Markdown report
├── <preset>.json # Per-preset metrics
└── latest -> <suite-id>
Presets Under Evaluation
| Preset | Test Task |
|---|---|
tdd-red-green | Add is_palindrome() function |
adversarial-review | Review user input handler for security |
socratic-learning | Understand HatRegistry |
spec-driven | Specify and implement StringUtils::truncate() |
mob-programming | Implement a Stack data structure |
scientific-method | Debug failing mock test assertion |
code-archaeology | Understand history of config.rs |
performance-optimization | Profile hat matching |
api-design | Design a Cache trait |
documentation-first | Document RateLimiter |
incident-response | Respond to "tests failing in CI" |
migration-safety | Plan v1 to v2 config migration |
Interpreting Results
Exit codes from evaluate-preset.sh:
0— Success (LOOP_COMPLETE reached)124— Timeout (preset hung or took too long)- Other — Failure (check
output.log)
Metrics in metrics.json:
iterations— How many event loop cycleshats_activated— Which hats were triggeredevents_published— Total events emittedcompleted— Whether completion promise was reached
Hat Routing Performance
Critical: Validate that hats get fresh context per Tenet #1 ("Fresh Context Is Reliability").
What Good Looks Like
Each hat should execute in its own iteration:
Iter 1: Ralph → publishes starting event → STOPS
Iter 2: Hat A → does work → publishes next event → STOPS
Iter 3: Hat B → does work → publishes next event → STOPS
Iter 4: Hat C → does work → LOOP_COMPLETE
Red Flags (Same-Iteration Hat Switching)
BAD: Multiple hat personas in one iteration:
Iter 2: Ralph does Blue Team + Red Team + Fixer work
^^^ All in one bloated context!
How to Check
1. Count iterations vs events in session.jsonl:
# Count iterations
grep -c "_meta.loop_start\|ITERATION" .eval/logs/<preset>/latest/output.log
# Count events published
grep -c "bus.publish" .eval/logs/<preset>/latest/session.jsonl
Expected: iterations ≈ events published (one event per iteration) Bad sign: 2-3 iterations but 5+ events (all work in single iteration)
2. Check for same-iteration hat switching in output.log:
grep -E "ITERATION|Now I need to perform|Let me put on|I'll switch to" \
.eval/logs/<preset>/latest/output.log
Red flag: Hat-switching phrases WITHOUT an ITERATION separator between them.
3. Check event timestamps in session.jsonl:
cat .eval/logs/<preset>/latest/session.jsonl | jq -r '.ts'
Red flag: Multiple events with identical timestamps (published in same iteration).
Routing Performance Triage
| Pattern | Diagnosis | Action |
|---|---|---|
| iterations ≈ events | ✅ Good | Hat routing working |
| iterations << events | ⚠️ Same-iteration switching | Check prompt has STOP instruction |
| iterations >> events | ⚠️ Recovery loops | Agent not publishing required events |
| 0 events | ❌ Broken | Events not being read from JSONL |
Root Cause Checklist
If hat routing is broken:
-
Check workflow prompt in
hatless_ralph.rs:- Does it say "CRITICAL: STOP after publishing"?
- Is the DELEGATE section clear about yielding control?
-
Check hat instructions propagation:
- Does
HatInfoincludeinstructionsfield? - Are instructions rendered in the
## HATSsection?
- Does
-
Check events context:
- Is
build_prompt(context)using the context parameter? - Does prompt include
## PENDING EVENTSsection?
- Is
Autonomous Fix Workflow
After evaluation, delegate fixes to subagents:
Step 1: Triage Results
Read .eval/results/latest/SUMMARY.md and identify:
❌ FAIL→ Create code tasks for fixes⏱️ TIMEOUT→ Investigate infinite loops⚠️ PARTIAL→ Check for edge cases
Step 2: Dispatch Task Creation
For each issue, spawn a Task agent:
"Use /code-task-generator to create a task for fixing: [issue from evaluation]
Output to: tasks/preset-fixes/"
Step 3: Dispatch Implementation
For each created task:
"Use /code-assist to implement: tasks/preset-fixes/[task-file].code-task.md
Mode: auto"
Step 4: Re-evaluate
./tools/evaluate-preset.sh <fixed-preset> claude
Prerequisites
- yq (optional): For loading test tasks from YAML. Install:
brew install yq - Cargo: Must be able to build Ralph
Related Files
tools/evaluate-preset.sh— Single preset evaluationtools/evaluate-all-presets.sh— Full suite evaluationtools/preset-test-tasks.yml— Test task definitionstools/preset-evaluation-findings.md— Manual findings docpresets/— The preset collection being evaluated
More by mikeyobrien
View allEvalKit is a conversational evaluation framework for AI agents that guides you through creating robust evaluations using the Strands Evals SDK. Through natural conversation, you can plan evaluations, generate test data, execute evaluations, and analyze results.
Use when bumping ralph-orchestrator version for a new release, after fixes are committed and ready to publish
This sop generates structured code task files from rough descriptions, ideas, or PDD implementation plans. It automatically detects the input type and creates properly formatted code task files following Amazon's code task format specification. For PDD plans, it processes implementation steps one at a time to allow for learning and adaptation between steps.
Use when creating new Ralph hat collection presets, designing multi-agent workflows, or adding hats to existing presets
