Agent SkillsAgent Skills
tylertitsworth

llm-evaluation

@tylertitsworth/llm-evaluation
tylertitsworth
0
0 forks
Updated 4/1/2026
View on GitHub

LLM evaluation with lm-evaluation-harness — MMLU, HumanEval, GSM8K benchmarks, custom tasks, vLLM/HF/OpenAI backends, metrics, and LLM-as-judge. Use when evaluating or benchmarking language models. NOT for training, fine-tuning, dataset preprocessing, or model serving.

Installation

$npx agent-skills-cli install @tylertitsworth/llm-evaluation
Claude Code
Cursor
Copilot
Codex
Antigravity

Details

Pathllm-evaluation/SKILL.md
Branchmain
Scoped Name@tylertitsworth/llm-evaluation

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions


name: llm-evaluation description: "LLM evaluation with lm-evaluation-harness — MMLU, HumanEval, GSM8K benchmarks, custom tasks, vLLM/HF/OpenAI backends, metrics, and LLM-as-judge. Use when evaluating or benchmarking language models. NOT for training, fine-tuning, dataset preprocessing, or model serving."

LLM Evaluation

lm-evaluation-harness (EleutherAI) — the standard framework with 60+ benchmarks. Version: 0.4.x+ (latest: 0.4.11).

v0.4.10 Breaking Change: The base package no longer installs model backends. Install the extras you need:

pip install lm_eval[hf]    # HuggingFace/PyTorch backend
pip install lm_eval[vllm]  # vLLM backend
pip install lm_eval[api]   # OpenAI-compatible API backends
pip install lm_eval[sglang]  # SGLang backend
pip install lm_eval[hf,vllm]  # multiple backends

Model Backends and Configuration

Backend Reference

Backendmodel valueWhen to Use
HuggingFace (local)hfDirect model loading on GPU
HuggingFace multimodalhf-multimodalVLMs (LLaVA, InternVL, Qwen-VL) on multimodal tasks
vLLM (local)vllmFast GPU inference, tensor parallel
vLLM multimodalvllm-vlmVLMs via vLLM backend (faster than hf-multimodal for large models)
SGLang (local)sglangSGLang engine for fast local inference
SGLang Generate APIsglang-generateSGLang running as a server (native generate endpoint)
OpenAI-compatible (completions)local-completionsExisting vLLM/Ollama server
OpenAI-compatible (chat)local-chat-completionsChat-format endpoints
OpenAI APIopenai-completionsOpenAI hosted models

model_args Reference

All backends accept model_args as a comma-separated key=value string.

HuggingFace (hf) model_args:

ArgPurposeDefault
pretrainedModel ID or local pathrequired
dtypeWeight dtype (auto, float16, bfloat16)auto
revisionModel revision/branchmain
trust_remote_codeAllow custom model codeFalse
parallelizeNaive model parallelism across GPUsFalse
max_lengthOverride max context lengthModel default
deviceDevice (cuda, cuda:0, cpu)Auto
peftPath to PEFT/LoRA adapterNone
deltaPath to delta weightsNone
autogptqUse AutoGPTQ quantizationFalse
add_bos_tokenPrepend BOS tokenFalse

vLLM (vllm) model_args:

ArgPurposeDefault
pretrainedModel ID or local pathrequired
dtypeWeight dtypeauto
tensor_parallel_sizeTP across GPUs1
gpu_memory_utilizationKV cache memory fraction0.9
max_model_lenMax context lengthModel default
quantizationQuantization methodNone
trust_remote_codeAllow custom codeFalse
data_parallel_sizeData parallel replicas1
max_num_seqsMax concurrent sequences256

SGLang (sglang): same model_args as vllm. sglang-generate: targets a running server with base_url=http://localhost:30000. See references/multimodal.md for examples.

API (local-completions / local-chat-completions) model_args:

ArgPurposeDefault
modelModel name in APIrequired
base_urlServer base URLrequired
tokenizer_backendTokenizer source (huggingface)None
num_concurrentConcurrent API requests1
max_retriesRetry count3
tokenized_requestsSend token IDs instead of textFalse

CLI (v0.4.10+)

Breaking change in v0.4.10: The CLI now uses explicit subcommands (lm-eval run, lm-eval ls, lm-eval validate). The legacy lm-eval --model ... form still works but is deprecated.

# Run evaluation (primary command)
lm-eval run \
  --model vllm \
  --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct,tensor_parallel_size=2 \
  --tasks mmlu,gsm8k,hellaswag \
  --num_fewshot 5 \
  --batch_size auto \
  --apply_chat_template \
  --log_samples \
  --output_path ./results/

# Load all args from YAML config file (v0.4.11+)
lm-eval run --config eval_config.yaml

# Discover available tasks
lm-eval ls tasks                        # list all task names
lm-eval ls tasks --json | jq '.[].name' # machine-readable

# Validate task configs before running
lm-eval validate --tasks mmlu,gsm8k     # catch YAML errors early
lm-eval validate --tasks my_custom_task --include_path ./tasks/

YAML config file (eval_config.yaml) — recommended for reproducible evaluations:

# eval_config.yaml
model: vllm
model_args: pretrained=meta-llama/Llama-3.1-8B-Instruct,tensor_parallel_size=2,gpu_memory_utilization=0.9
tasks:
  - mmlu
  - gsm8k
  - hellaswag
num_fewshot: 5
batch_size: auto
apply_chat_template: true
log_samples: true
output_path: ./results/
seed: [0, 1234, 1234, 1234]
lm-eval run --config eval_config.yaml   # all flags from file; CLI overrides file
lm-eval run --config eval_config.yaml --limit 100  # override: only 100 samples

Python API

import lm_eval

results = lm_eval.simple_evaluate(
    model="vllm",
    model_args="pretrained=meta-llama/Llama-3.1-8B-Instruct,dtype=auto,tensor_parallel_size=2,gpu_memory_utilization=0.9",
    tasks=["mmlu", "gsm8k", "hellaswag"],
    num_fewshot=5,
    batch_size="auto",
    log_samples=True,
    apply_chat_template=True,
)

for task_name, task_result in results["results"].items():
    print(f"{task_name}: {task_result}")

Evaluation Settings

Core Settings

SettingPurposeValues
tasksBenchmark tasks to runComma-separated names or list
num_fewshotFew-shot examplesInteger (0 = zero-shot)
batch_sizeBatch sizeInteger or "auto"
limitSample limit per taskInteger (count) or float (fraction)
apply_chat_templateUse model's chat templateTrue/False
fewshot_as_multiturnFew-shot as multi-turn conversationTrue/False
system_instructionSystem prompt for chat templateString
log_samplesLog individual predictionsTrue/False
output_pathResults output directoryPath string
seedRandom seed for reproducibilityList [0,1234,1234,1234] (default)
use_cacheCache model responsesPath string or None

Generation Settings (for generate_until tasks)

These are set per-task in YAML configs, not globally:

generation_kwargs:
  max_gen_toks: 1024          # max tokens to generate
  temperature: 0.0            # 0.0 = greedy
  top_p: 1.0
  do_sample: false
  stop_sequences: ["\n\n"]    # stop on these strings
  until: ["\n\nQuestion:"]    # alternative stop format

Filter Pipeline

Filters post-process model output before metric computation:

filter_list:
  - name: get-answer
    filter:
      - function: regex
        regex_pattern: "#### (\\d+)"   # extract final answer
      - function: take_first
  - name: remove-whitespace
    filter:
      - function: strip
      - function: lowercase

Built-in filter functions: regex, take_first, strip, lowercase, uppercase, remove_whitespace, map, at.

W&B Integration

results = lm_eval.simple_evaluate(
    model="vllm",
    model_args="pretrained=my-model",
    tasks=["mmlu"],
    wandb_args="project=eval,name=llama-8b-mmlu,job_type=eval",
)

Benchmarks

Knowledge and Reasoning

BenchmarkTaskMetricShotsWhat It Measures
MMLUmmluacc5Broad knowledge (57 subjects)
MMLU-Prommlu_proacc5Harder MMLU (10 answer choices, CoT)
ARC-Challengearc_challengeacc_norm25Grade-school science reasoning
HellaSwaghellaswagacc_norm10Commonsense completion
Winograndewinograndeacc5Commonsense coreference
TruthfulQAtruthfulqa_mc2mc20Resistance to misconceptions
GPQAgpqa_main_zeroshotacc0Graduate-level QA
BBHbbhacc3Big Bench Hard (23 tasks, CoT)
MuSRmusracc0Multi-step reasoning

Math and Code

BenchmarkTaskMetricShotsWhat It Measures
GSM8Kgsm8kexact_match5Grade-school math (multi-step)
MATHminerva_mathexact_match4Competition mathematics
MATH-Hardminerva_math_hardexact_match4Hard subset of MATH
HumanEvalhumanevalpass@10Python code generation
MBPPmbpppass@13Python programming

Instruction Following

BenchmarkTaskMetricShotsWhat It Measures
IFEvalifevalstrict_acc0Verifiable instruction following

Knowledge Probing (v0.4.11+)

BenchmarkTaskMetricShotsWhat It Measures
BEARbearacc0Fill-in-the-blank world knowledge probing

Open LLM Leaderboard v2

results = lm_eval.simple_evaluate(
    model="vllm",
    model_args="pretrained=my-model,tensor_parallel_size=4",
    tasks=["mmlu_pro", "gpqa_main_zeroshot", "musr", "minerva_math_hard", "ifeval", "bbh"],
    batch_size="auto",
    apply_chat_template=True,
)

Metrics Reference

MetricDescriptionUsed By
accAccuracy (multiple choice)MMLU, ARC, BoolQ
acc_normLength-normalized accuracyHellaSwag, ARC, PIQA
exact_matchExact string match (after normalization)GSM8K, MATH
exact_match,strict-matchStrict exact matchGSM8K
exact_match,flexible-extractFlexible number extractionGSM8K
pass@kk/n code samples pass testsHumanEval, MBPP
f1Token-level F1SQuAD
mc2Weighted multi-choice accuracyTruthfulQA
word_perplexityWord-level perplexityWikiText
byte_perplexityByte-level perplexityWikiText

Metric Gotchas

  • acc vs acc_norm — HellaSwag scores differ by 5+ points depending on which you use; always use acc_norm
  • GSM8K exact_match — answer extraction regex matters hugely; strict-match vs flexible-extract can differ by 10+ points
  • Few-shot count — MMLU 0-shot vs 5-shot can differ by 10+ points; always report shot count
  • Chat template — instruction-tuned models need apply_chat_template=True or scores will be significantly lower

Custom Evaluation Tasks

Task YAML Schema

task: my_custom_task
dataset_path: json                    # or huggingface dataset name
dataset_name: null
dataset_kwargs:
  data_files:
    test: ./data/test.jsonl
output_type: generate_until           # or multiple_choice, loglikelihood
doc_to_text: "Question: {{question}}\nAnswer:"
doc_to_target: "{{answer}}"
generation_kwargs:
  max_gen_toks: 256
  temperature: 0.0
  do_sample: false
metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true
filter_list:
  - name: get-answer
    filter:
      - function: regex
        regex_pattern: "(\\d+)"
      - function: take_first
num_fewshot: 0

output_type Reference

TypeDescriptionMetrics
multiple_choiceModel scores each optionacc, acc_norm
loglikelihoodLog probability of targetperplexity, acc
loglikelihood_rollingRolling log probabilityword_perplexity, byte_perplexity
generate_untilFree-form generationexact_match, pass@k, custom

Task Groups

group: my_eval_suite
task:
  - my_custom_task
  - mmlu
  - gsm8k
aggregate_metric_list:
  - metric: acc
    aggregation: mean
    weight_by_size: true

Task Inheritance

include: _default_template.yaml
task: my_variant
dataset_name: hard_subset
num_fewshot: 0

LLM-as-Judge Evaluation

Use a custom harness task with process_results: !function to score model outputs with a local vLLM judge. Results appear in results["results"] alongside standard metrics — loggable to W&B/MLflow with no extra plumbing.

Judge Task YAML

# judge_quality.yaml — custom task that scores responses via LLM judge
task: judge_quality
dataset_path: json
dataset_kwargs:
  data_files:
    test: ./data/judge_test.jsonl  # {"question": "...", "reference": "..."}
output_type: generate_until
doc_to_text: "{{question}}"
doc_to_target: "{{reference}}"
generation_kwargs:
  max_gen_toks: 512
  temperature: 0.0
  do_sample: false
metric_list:
  - metric: judge_score
    aggregation: mean
    higher_is_better: true
process_results: !function utils.judge_score
metadata:
  version: 1.0

Judge Utility (utils.py)

The judge runs against a local vLLM endpoint — any model works as judge, no per-token API costs:

# utils.py — placed alongside judge_quality.yaml
import openai

# Point at your local vLLM (or any OpenAI-compatible) endpoint
_client = openai.OpenAI(
    base_url="http://vllm-svc:8000/v1",  # local vLLM service
    api_key="unused",                      # vLLM doesn't need a real key
)
_JUDGE_MODEL = "meta-llama/Llama-3.1-70B-Instruct"  # whatever model vLLM is serving

JUDGE_PROMPT = """Rate the response on a scale of 1-5 for accuracy and completeness.
Return ONLY a JSON object: {{"score": <int>, "reason": "<one sentence>"}}

Question: {question}
Reference: {reference}
Response: {response}"""

def judge_score(doc, results):
    """Called by the harness via process_results: !function utils.judge_score"""
    response_text = results[0]
    msg = JUDGE_PROMPT.format(
        question=doc["question"],
        reference=doc.get("reference", "N/A"),
        response=response_text,
    )
    completion = _client.chat.completions.create(
        model=_JUDGE_MODEL,
        messages=[{"role": "user", "content": msg}],
        temperature=0.0,
        response_format={"type": "json_object"},  # structured output — no fragile parsing
    )
    import json
    parsed = json.loads(completion.choices[0].message.content)
    return {"judge_score": int(parsed["score"])}

Running the Judge Task

# Evaluate a model and score its outputs with the local vLLM judge
lm-eval run \
  --model vllm \
  --model_args pretrained=my-org/my-model,tensor_parallel_size=2 \
  --tasks judge_quality \
  --include_path ./tasks/ \
  --log_samples \
  --output_path ./results/

# Judge scores appear in standard results alongside any other metrics
# results["results"]["judge_quality"]["judge_score"] = 3.72

Integration with W&B / MLflow

Judge scores flow through the standard results dict — no special handling needed:

results = lm_eval.simple_evaluate(
    model="vllm",
    model_args="pretrained=my-org/my-model",
    tasks=["mmlu", "judge_quality"],
    include_path="./tasks/",
    log_samples=True,
    wandb_args="project=eval,name=my-model-judge",
)
# results["results"]["judge_quality"]["judge_score"] logged automatically to W&B

Key Design Decisions

  • Local vLLM judge (base_url: http://vllm-svc:8000/v1) — zero per-token cost at evaluation scale; swap any model as judge
  • response_format={"type": "json_object"} — structured output replaces fragile int(result.strip()) parsing
  • process_results: !function — native harness hook; scores appear in standard output, not a separate pipeline
  • Any model as judge — change _JUDGE_MODEL to whatever vLLM is serving (Llama, Qwen, Mistral, etc.)

See also: assets/judge_quality.yaml, assets/judge_utils.py, MT-Bench, AlpacaEval

Best Practices

  1. Consistent settings across compared models — same shots, batch size, chat template
  2. Report standard error_stderr fields in results; small differences may be noise
  3. Multiple tasks — no single benchmark tells the whole story
  4. vLLM backend for faster GPU evaluation (tensor_parallel_size for large models)
  5. Save raw results (log_samples=True) for debugging unexpected scores
  6. Match eval to use case — coding? HumanEval. Reasoning? GSM8K/MATH. General? MMLU.

VLM Evaluation (Multimodal)

Use hf-multimodal or vllm-vlm backends (see Backend Reference above). Use --apply_chat_template; set fixed --batch_size for hf-multimodal. Key benchmarks: mmbench_en, mmstar, textvqa, docvqa, scienceqa. See references/multimodal.md for full CLI examples, model_args, and benchmark table.

References

  • benchmarks.md — Benchmark descriptions, scoring baselines, and evaluation gotchas
  • troubleshooting.md — Benchmark execution failures, scoring issues, resource problems, and framework-specific fixes

Cross-References

  • vllm — vLLM backend for fast GPU evaluation
  • openai-api — OpenAI-compatible API backends (local-completions, local-chat-completions)
  • wandb — Log evaluation results to W&B
  • huggingface-transformers — HF model loading for hf backend
  • ollama — Local inference alternative for evaluation
  • flyte-sdk — Orchestrate evaluation pipelines as Flyte workflows
  • sglang — SGLang as evaluation backend
  • mlflow — MLflow GenAI evaluation as alternative framework
  • dvc — Version evaluation datasets and results
  • prometheus-grafana — Monitor evaluation infrastructure metrics

Reference

  • lm-evaluation-harness GitHub
  • Available tasks
  • Task YAML guide
  • Open LLM Leaderboard
  • references/benchmarks.md — detailed benchmark descriptions and scoring baselines
  • references/multimodal.md — VLM backend options, multimodal benchmarks, and SGLang usage
  • assets/custom_task.yaml — example custom evaluation task with generate_until, regex filter, and task group definition
  • assets/judge_quality.yaml — LLM-as-Judge task YAML for the harness
  • assets/judge_utils.py — Judge scoring function using local vLLM endpoint