Design and implement evaluation frameworks for AI agents. Use when testing agent reasoning quality, building graders, doing error analysis, or establishing regression protection. Framework-agnostic concepts that apply to any SDK.
Installation
Details
Usage
After installing, this skill will be available to your AI coding assistant.
Verify installation:
npx agent-skills-cli listSkill Instructions
name: agent-evals description: Design and implement evaluation frameworks for AI agents. Use when testing agent reasoning quality, building graders, doing error analysis, or establishing regression protection. Framework-agnostic concepts that apply to any SDK.
Agent Evaluations: Measuring Reasoning Quality
Core Thesis: "One of the biggest predictors for whether someone is able to build agentic workflows really well is whether or not they're able to drive a really disciplined evaluation process." β Andrew Ng
Evaluations (evals) are exams for your agent's reasoning. Unlike traditional testing (TDD) that checks code correctness with PASS/FAIL outcomes, evals measure reasoning quality with probabilistic scores. The distinction is critical:
| Aspect | TDD (Code Testing) | Evals (Agent Evaluation) |
|---|---|---|
| Tests | Does function return correct output? | Did agent make the right decision? |
| Outcome | PASS or FAIL (deterministic) | Scores (probabilistic) |
| Example | "Does get_weather() return valid JSON?" | "Did agent correctly interpret user intent?" |
| Analogy | Testing if calculator works | Testing if student knows WHEN to use multiplication |
When to Activate
Activate this skill when:
- Building systematic quality checks for any AI agent
- Designing evaluation datasets (typical, edge, error cases)
- Creating graders to define "good" automatically
- Performing error analysis to find failure patterns
- Setting up regression protection for agent changes
- Deciding when to use end-to-end vs component-level evals
Core Concepts
1. Evals as Exams
Think of evals like course exams for your agent:
| Eval Type | Analogy | Purpose |
|---|---|---|
| Initial Eval | Final exam | Does agent pass the course? Handles all scenarios? |
| Regression Eval | Pop quiz | Did the update break what was working? |
| Component Eval | Subject test | Test individual skills (routing, tool use, output) |
| End-to-End Eval | Comprehensive exam | Test the full experience |
2. The Two Evaluation Axes
Evals vary on two dimensions:
| Objective (Code) | Subjective (LLM Judge) | |
|---|---|---|
| Per-example ground truth | Invoice dates, expected values | Gold standard talking points |
| No per-example ground truth | Word count limits, format rules | Rubric-based grading |
Examples:
- Invoice date extraction: Objective + per-example ground truth
- Marketing copy length: Objective + no per-example ground truth
- Research article quality: Subjective + per-example ground truth
- Chart clarity rubric: Subjective + no per-example ground truth
3. Graders
What: Automated quality checks that turn subjective assessment into measurable scores.
Key Insight: Don't use 1-5 scales (LLMs are poorly calibrated). Use binary criteria instead:
β BAD: "Rate this response 1-5 on quality"
β
GOOD: "Check these 5 criteria (yes/no each):
1. Does it have a clear title?
2. Are axis labels present?
3. Is it the appropriate chart type?
4. Is the data accurately represented?
5. Is the legend clear?"
Binary criteria β sum up β get reliable scores (0-5).
LLM-as-Judge Pattern:
Determine how many of the 5 gold standard talking points are present
in the provided essay.
Talking points: {talking_points}
Essay: {essay_text}
Return JSON: {"score": <0-5>, "explanation": "..."}
Position Bias: Many LLMs prefer the first option when comparing two outputs. Avoid pairwise comparisons; use rubric-based grading instead.
4. Error Analysis (Most Critical Skill)
The Build-Analyze Loop:
Build β Look at outputs β Find issues β Build evals β Improve β Repeat
Don't guess what's wrongβMEASURE:
- Build spreadsheet: Case | Component | Error Type
- Count patterns: "45% of errors from web search results"
- Focus effort where errors cluster
- Prioritize by: Error frequency Γ Feasibility to fix
Trace Analysis Terminology:
- Trace: All intermediate outputs from agent run
- Span: Output of a single step
- Error Analysis: Reading traces to find which component caused failures
Example Error Analysis Table:
| Prompt | Search Terms | Search Results | Best Sources | Final Output |
|---|---|---|---|---|
| Black holes | OK | Too many blogs (45%) | OK | Missing key points |
| Seattle rent | OK | OK | Missed blog | OK |
| Fruit robots | Generic (5%) | Poor quality | Poor | Missing company |
5. End-to-End vs Component-Level Evals
End-to-End Evals:
- Test entire agent output quality
- Expensive to run (full workflow)
- Noisy (multiple components introduce variance)
- Use for: Ship decisions, production monitoring
Component-Level Evals:
- Test single component in isolation
- Faster, clearer signal
- Use for: Debugging, tuning specific components
- Example: Eval just the web search quality, not full research agent
Decision Framework:
- Start with end-to-end to find overall quality
- Use error analysis to identify problem component
- Build component-level eval for that component
- Tune component using component eval
- Verify improvement with end-to-end eval
6. Dataset Design
Quality Over Quantity: Start with 10-20 high-quality cases, NOT 1000 random ones.
Three Categories:
| Category | Count | Purpose |
|---|---|---|
| Typical | 10 | Common use cases |
| Edge | 5 | Unusual but valid inputs |
| Error | 5 | Should fail gracefully |
Use REAL Data: Pull from actual user queries, support tickets, production logs. Synthetic data misses the messiness of reality.
Grow Dataset Over Time: When evals fail to capture your judgment about quality, add more cases to fill the gap.
7. Regression Protection
Run evals on EVERY change:
Change code β Run eval suite β Compare to baseline
β
If pass rate drops β Investigate before shipping
β
If pass rate stable/improved β Safe to deploy
The Eval-Driven Development Loop:
prompt v1 β eval 70% β error analysis β fix routing β eval 85%
β error analysis β fix output format β eval 92%
β ship
Practical Guidance
Building Quick-and-Dirty Evals
- Start immediately (don't wait for perfect)
- 10-20 examples is fine to start
- Look at outputs manually alongside metrics
- Iterate on evals as you iterate on agent
- Upgrade evals when they fail to capture your judgment
Creating Effective Graders
# Grader for structured feedback
def grader_feedback_structure(output: dict) -> dict:
"""
Check if feedback follows required structure:
1. Strengths section present
2. Gaps section present
3. Actionable suggestions present
"""
feedback = output.get("student_feedback", "")
checks = {
"has_strengths": "strength" in feedback.lower(),
"has_gaps": "improvement" in feedback.lower() or "gap" in feedback.lower(),
"has_actions": "suggest" in feedback.lower() or "recommend" in feedback.lower()
}
score = sum(checks.values())
return {
"passed": score == 3,
"score": score,
"checks": checks,
"explanation": f"Passed {score}/3 structure checks"
}
LLM Grader Template
GRADER_PROMPT = """
Evaluate the agent response against these criteria:
Response: {response}
Criteria: {criteria}
For each criterion, answer YES or NO:
{criteria_list}
Return JSON:
{
"criteria_results": {"criterion_1": true/false, ...},
"total_passed": <count>,
"total_criteria": <count>,
"passed": <true if all passed>
}
"""
Error Analysis Workflow
def analyze_errors(test_results: list) -> dict:
"""
Systematic error analysis across test cases.
"""
error_counts = {
"routing": 0,
"tool_selection": 0,
"output_format": 0,
"content_quality": 0,
"other": 0
}
for result in test_results:
if not result["passed"]:
# Analyze trace to find error source
error_type = classify_error(result["trace"])
error_counts[error_type] += 1
# Prioritize by frequency
total_errors = sum(error_counts.values())
return {
"error_counts": error_counts,
"percentages": {
k: v / total_errors * 100
for k, v in error_counts.items() if total_errors > 0
},
"recommendation": max(error_counts, key=error_counts.get)
}
The Complete Quality Loop
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β THE EVAL-DRIVEN LOOP β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. BUILD quick-and-dirty agent v1 β
β β β
β 2. CREATE eval dataset (10-20 cases) β
β β β
β 3. RUN evals β Find 70% pass rate β
β β β
β 4. ERROR ANALYSIS β "45% errors from routing" β
β β β
β 5. FIX routing β Re-run evals β 85% pass rate β
β β β
β 6. ERROR ANALYSIS β "30% errors from output format" β
β β β
β 7. FIX format β Re-run evals β 92% pass rate β
β β β
β 8. DEPLOY with regression protection β
β β β
β 9. MONITOR production β Add failed cases to dataset β
β β β
β 10. REPEAT (continuous improvement) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Anti-Patterns to Avoid
| Anti-Pattern | Why It's Bad | What to Do Instead |
|---|---|---|
| Waiting for perfect evals | Delays useful feedback | Start with 10 quick cases |
| 1000+ test cases first | Quantity without quality | 20 thoughtful cases |
| 1-5 scale ratings | LLMs poorly calibrated | Binary criteria summed |
| Ignoring traces | Miss root cause | Read intermediate outputs |
| End-to-end only | Too noisy for debugging | Add component-level evals |
| Synthetic test data | Misses real-world messiness | Use actual user queries |
| Going by gut | May work on wrong component | Count errors systematically |
| Skipping regression tests | Breaks working features | Run evals on every change |
Integration with Other Skills
This skill connects to:
- building-with-openai-agents: Evaluating OpenAI agents specifically
- building-with-claude-agent-sdk: Evaluating Claude agents
- building-with-google-adk: Evaluating Google ADK agents
- evaluation: Broader context engineering evaluation
- context-degradation: Detecting context-related failures
Framework-Agnostic Application
These concepts apply to ANY agent framework:
| Framework | Trace Access | Grader Integration | Dataset Storage |
|---|---|---|---|
| OpenAI Agents SDK | Built-in tracing | Custom graders | JSON/CSV files |
| Claude Agent SDK | Hooks for tracing | Custom graders | JSON/CSV files |
| Google ADK | Evaluation module | Built-in graders | Vertex AI datasets |
| LangChain | LangSmith traces | LangSmith evals | LangSmith datasets |
| Custom | Logging middleware | Custom graders | Any storage |
The thinking is portable. The skill is permanent.
Skill Metadata
Created: 2025-12-30 Source: Andrew Ng's Agentic AI Course + OpenAI AgentKit Build Hour Author: Claude Agent Factory Version: 1.0.0
More by panaversity
View allFetches official documentation for external libraries and frameworks (React, Next.js, Prisma, FastAPI, Express, Tailwind, MongoDB, etc.) with 60-90% token savings via content-type filtering. Use this skill when implementing features using library APIs, debugging library-specific errors, troubleshooting configuration issues, installing or setting up frameworks, integrating third-party packages, upgrading between library versions, or looking up correct API patterns and best practices. Triggers automatically during coding work - fetch docs before writing library code to get correct patterns, not after guessing wrong.
Browser automation using Playwright MCP. Navigate websites, fill forms, click elements, take screenshots, and extract data. Use for web browsing, form submission, web scraping, or UI testing. NOT for static content (use curl/wget).
Validates skills against production-level criteria with 9-category scoring. This skill should be used when reviewing, auditing, or improving skills to ensure quality standards. Evaluates structure, content, user interaction, documentation, domain standards, technical robustness, maintainability, zero-shot implementation, and reusability. Returns actionable validation report with scores and improvement recommendations.
This skill conducts discovery conversations to understand user intent and agree on approach before taking action. It should be used when the user explicitly calls /interview, asks for recommendations, needs brainstorming, wants to clarify, or when the request could be misunderstood. Prevents building the wrong thing by uncovering WHY behind WHAT.
