llm-evaluation

@tylertitsworth/llm-evaluation

0 forks

Updated 4/1/2026

LLM evaluation with lm-evaluation-harness — MMLU, HumanEval, GSM8K benchmarks, custom tasks, vLLM/HF/OpenAI backends, metrics, and LLM-as-judge. Use when evaluating or benchmarking language models. NOT for training, fine-tuning, dataset preprocessing, or model serving.

Installation

$npx agent-skills-cli install @tylertitsworth/llm-evaluation

Claude Code

Cursor

Copilot

Codex

Antigravity

Details

Repositorytylertitsworth/skills

Pathllm-evaluation/SKILL.md

Branchmain

Scoped Name@tylertitsworth/llm-evaluation

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions

name: llm-evaluation description: "LLM evaluation with lm-evaluation-harness — MMLU, HumanEval, GSM8K benchmarks, custom tasks, vLLM/HF/OpenAI backends, metrics, and LLM-as-judge. Use when evaluating or benchmarking language models. NOT for training, fine-tuning, dataset preprocessing, or model serving."

LLM Evaluation

lm-evaluation-harness (EleutherAI) — the standard framework with 60+ benchmarks. Version: 0.4.x+ (latest: 0.4.11).

v0.4.10 Breaking Change: The base package no longer installs model backends. Install the extras you need:

pip install lm_eval[hf]    # HuggingFace/PyTorch backend
pip install lm_eval[vllm]  # vLLM backend
pip install lm_eval[api]   # OpenAI-compatible API backends
pip install lm_eval[sglang]  # SGLang backend
pip install lm_eval[hf,vllm]  # multiple backends

Model Backends and Configuration

Backend Reference

Backend	`model` value	When to Use
HuggingFace (local)	`hf`	Direct model loading on GPU
HuggingFace multimodal	`hf-multimodal`	VLMs (LLaVA, InternVL, Qwen-VL) on multimodal tasks
vLLM (local)	`vllm`	Fast GPU inference, tensor parallel
vLLM multimodal	`vllm-vlm`	VLMs via vLLM backend (faster than `hf-multimodal` for large models)
SGLang (local)	`sglang`	SGLang engine for fast local inference
SGLang Generate API	`sglang-generate`	SGLang running as a server (native generate endpoint)
OpenAI-compatible (completions)	`local-completions`	Existing vLLM/Ollama server
OpenAI-compatible (chat)	`local-chat-completions`	Chat-format endpoints
OpenAI API	`openai-completions`	OpenAI hosted models

model_args Reference

All backends accept model_args as a comma-separated key=value string.

HuggingFace (hf) model_args:

Arg	Purpose	Default
`pretrained`	Model ID or local path	required
`dtype`	Weight dtype (`auto`, `float16`, `bfloat16`)	`auto`
`revision`	Model revision/branch	`main`
`trust_remote_code`	Allow custom model code	`False`
`parallelize`	Naive model parallelism across GPUs	`False`
`max_length`	Override max context length	Model default
`device`	Device (`cuda`, `cuda:0`, `cpu`)	Auto
`peft`	Path to PEFT/LoRA adapter	None
`delta`	Path to delta weights	None
`autogptq`	Use AutoGPTQ quantization	`False`
`add_bos_token`	Prepend BOS token	`False`

vLLM (vllm) model_args:

Arg	Purpose	Default
`pretrained`	Model ID or local path	required
`dtype`	Weight dtype	`auto`
`tensor_parallel_size`	TP across GPUs	`1`
`gpu_memory_utilization`	KV cache memory fraction	`0.9`
`max_model_len`	Max context length	Model default
`quantization`	Quantization method	None
`trust_remote_code`	Allow custom code	`False`
`data_parallel_size`	Data parallel replicas	`1`
`max_num_seqs`	Max concurrent sequences	`256`

SGLang (sglang): same model_args as vllm. sglang-generate: targets a running server with base_url=http://localhost:30000. See references/multimodal.md for examples.

API (local-completions / local-chat-completions) model_args:

Arg	Purpose	Default
`model`	Model name in API	required
`base_url`	Server base URL	required
`tokenizer_backend`	Tokenizer source (`huggingface`)	None
`num_concurrent`	Concurrent API requests	`1`
`max_retries`	Retry count	`3`
`tokenized_requests`	Send token IDs instead of text	`False`

CLI (v0.4.10+)

Breaking change in v0.4.10: The CLI now uses explicit subcommands (lm-eval run, lm-eval ls, lm-eval validate). The legacy lm-eval --model ... form still works but is deprecated.

# Run evaluation (primary command)
lm-eval run \
  --model vllm \
  --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct,tensor_parallel_size=2 \
  --tasks mmlu,gsm8k,hellaswag \
  --num_fewshot 5 \
  --batch_size auto \
  --apply_chat_template \
  --log_samples \
  --output_path ./results/

# Load all args from YAML config file (v0.4.11+)
lm-eval run --config eval_config.yaml

# Discover available tasks
lm-eval ls tasks                        # list all task names
lm-eval ls tasks --json | jq '.[].name' # machine-readable

# Validate task configs before running
lm-eval validate --tasks mmlu,gsm8k     # catch YAML errors early
lm-eval validate --tasks my_custom_task --include_path ./tasks/

YAML config file (eval_config.yaml) — recommended for reproducible evaluations:

# eval_config.yaml
model: vllm
model_args: pretrained=meta-llama/Llama-3.1-8B-Instruct,tensor_parallel_size=2,gpu_memory_utilization=0.9
tasks:
  - mmlu
  - gsm8k
  - hellaswag
num_fewshot: 5
batch_size: auto
apply_chat_template: true
log_samples: true
output_path: ./results/
seed: [0, 1234, 1234, 1234]

lm-eval run --config eval_config.yaml   # all flags from file; CLI overrides file
lm-eval run --config eval_config.yaml --limit 100  # override: only 100 samples

Python API

import lm_eval

results = lm_eval.simple_evaluate(
    model="vllm",
    model_args="pretrained=meta-llama/Llama-3.1-8B-Instruct,dtype=auto,tensor_parallel_size=2,gpu_memory_utilization=0.9",
    tasks=["mmlu", "gsm8k", "hellaswag"],
    num_fewshot=5,
    batch_size="auto",
    log_samples=True,
    apply_chat_template=True,
)

for task_name, task_result in results["results"].items():
    print(f"{task_name}: {task_result}")

Evaluation Settings

Core Settings

Setting	Purpose	Values
`tasks`	Benchmark tasks to run	Comma-separated names or list
`num_fewshot`	Few-shot examples	Integer (0 = zero-shot)
`batch_size`	Batch size	Integer or `"auto"`
`limit`	Sample limit per task	Integer (count) or float (fraction)
`apply_chat_template`	Use model's chat template	`True`/`False`
`fewshot_as_multiturn`	Few-shot as multi-turn conversation	`True`/`False`
`system_instruction`	System prompt for chat template	String
`log_samples`	Log individual predictions	`True`/`False`
`output_path`	Results output directory	Path string
`seed`	Random seed for reproducibility	List `[0,1234,1234,1234]` (default)
`use_cache`	Cache model responses	Path string or None

Generation Settings (for generate_until tasks)

These are set per-task in YAML configs, not globally:

generation_kwargs:
  max_gen_toks: 1024          # max tokens to generate
  temperature: 0.0            # 0.0 = greedy
  top_p: 1.0
  do_sample: false
  stop_sequences: ["\n\n"]    # stop on these strings
  until: ["\n\nQuestion:"]    # alternative stop format

Filter Pipeline

Filters post-process model output before metric computation:

filter_list:
  - name: get-answer
    filter:
      - function: regex
        regex_pattern: "#### (\\d+)"   # extract final answer
      - function: take_first
  - name: remove-whitespace
    filter:
      - function: strip
      - function: lowercase

Built-in filter functions: regex, take_first, strip, lowercase, uppercase, remove_whitespace, map, at.

W&B Integration

results = lm_eval.simple_evaluate(
    model="vllm",
    model_args="pretrained=my-model",
    tasks=["mmlu"],
    wandb_args="project=eval,name=llama-8b-mmlu,job_type=eval",
)

Benchmarks

Knowledge and Reasoning

Benchmark	Task	Metric	Shots	What It Measures
MMLU	`mmlu`	acc	5	Broad knowledge (57 subjects)
MMLU-Pro	`mmlu_pro`	acc	5	Harder MMLU (10 answer choices, CoT)
ARC-Challenge	`arc_challenge`	acc_norm	25	Grade-school science reasoning
HellaSwag	`hellaswag`	acc_norm	10	Commonsense completion
Winogrande	`winogrande`	acc	5	Commonsense coreference
TruthfulQA	`truthfulqa_mc2`	mc2	0	Resistance to misconceptions
GPQA	`gpqa_main_zeroshot`	acc	0	Graduate-level QA
BBH	`bbh`	acc	3	Big Bench Hard (23 tasks, CoT)
MuSR	`musr`	acc	0	Multi-step reasoning

Math and Code

Benchmark	Task	Metric	Shots	What It Measures
GSM8K	`gsm8k`	exact_match	5	Grade-school math (multi-step)
MATH	`minerva_math`	exact_match	4	Competition mathematics
MATH-Hard	`minerva_math_hard`	exact_match	4	Hard subset of MATH
HumanEval	`humaneval`	pass@1	0	Python code generation
MBPP	`mbpp`	pass@1	3	Python programming

Instruction Following

Benchmark	Task	Metric	Shots	What It Measures
IFEval	`ifeval`	strict_acc	0	Verifiable instruction following

Knowledge Probing (v0.4.11+)

Benchmark	Task	Metric	Shots	What It Measures
BEAR	`bear`	acc	0	Fill-in-the-blank world knowledge probing

Open LLM Leaderboard v2

results = lm_eval.simple_evaluate(
    model="vllm",
    model_args="pretrained=my-model,tensor_parallel_size=4",
    tasks=["mmlu_pro", "gpqa_main_zeroshot", "musr", "minerva_math_hard", "ifeval", "bbh"],
    batch_size="auto",
    apply_chat_template=True,
)

Metrics Reference

Metric	Description	Used By
`acc`	Accuracy (multiple choice)	MMLU, ARC, BoolQ
`acc_norm`	Length-normalized accuracy	HellaSwag, ARC, PIQA
`exact_match`	Exact string match (after normalization)	GSM8K, MATH
`exact_match,strict-match`	Strict exact match	GSM8K
`exact_match,flexible-extract`	Flexible number extraction	GSM8K
`pass@k`	k/n code samples pass tests	HumanEval, MBPP
`f1`	Token-level F1	SQuAD
`mc2`	Weighted multi-choice accuracy	TruthfulQA
`word_perplexity`	Word-level perplexity	WikiText
`byte_perplexity`	Byte-level perplexity	WikiText

Metric Gotchas

acc vs acc_norm — HellaSwag scores differ by 5+ points depending on which you use; always use acc_norm
GSM8K exact_match — answer extraction regex matters hugely; strict-match vs flexible-extract can differ by 10+ points
Few-shot count — MMLU 0-shot vs 5-shot can differ by 10+ points; always report shot count
Chat template — instruction-tuned models need apply_chat_template=True or scores will be significantly lower

Custom Evaluation Tasks

Task YAML Schema

task: my_custom_task
dataset_path: json                    # or huggingface dataset name
dataset_name: null
dataset_kwargs:
  data_files:
    test: ./data/test.jsonl
output_type: generate_until           # or multiple_choice, loglikelihood
doc_to_text: "Question: {{question}}\nAnswer:"
doc_to_target: "{{answer}}"
generation_kwargs:
  max_gen_toks: 256
  temperature: 0.0
  do_sample: false
metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true
filter_list:
  - name: get-answer
    filter:
      - function: regex
        regex_pattern: "(\\d+)"
      - function: take_first
num_fewshot: 0

output_type Reference

Type	Description	Metrics
`multiple_choice`	Model scores each option	`acc`, `acc_norm`
`loglikelihood`	Log probability of target	`perplexity`, `acc`
`loglikelihood_rolling`	Rolling log probability	`word_perplexity`, `byte_perplexity`
`generate_until`	Free-form generation	`exact_match`, `pass@k`, custom

Task Groups

group: my_eval_suite
task:
  - my_custom_task
  - mmlu
  - gsm8k
aggregate_metric_list:
  - metric: acc
    aggregation: mean
    weight_by_size: true

Task Inheritance

include: _default_template.yaml
task: my_variant
dataset_name: hard_subset
num_fewshot: 0

LLM-as-Judge Evaluation

Use a custom harness task with process_results: !function to score model outputs with a local vLLM judge. Results appear in results["results"] alongside standard metrics — loggable to W&B/MLflow with no extra plumbing.

Judge Task YAML

# judge_quality.yaml — custom task that scores responses via LLM judge
task: judge_quality
dataset_path: json
dataset_kwargs:
  data_files:
    test: ./data/judge_test.jsonl  # {"question": "...", "reference": "..."}
output_type: generate_until
doc_to_text: "{{question}}"
doc_to_target: "{{reference}}"
generation_kwargs:
  max_gen_toks: 512
  temperature: 0.0
  do_sample: false
metric_list:
  - metric: judge_score
    aggregation: mean
    higher_is_better: true
process_results: !function utils.judge_score
metadata:
  version: 1.0

Judge Utility (`utils.py`)

The judge runs against a local vLLM endpoint — any model works as judge, no per-token API costs:

# utils.py — placed alongside judge_quality.yaml
import openai

# Point at your local vLLM (or any OpenAI-compatible) endpoint
_client = openai.OpenAI(
    base_url="http://vllm-svc:8000/v1",  # local vLLM service
    api_key="unused",                      # vLLM doesn't need a real key
)
_JUDGE_MODEL = "meta-llama/Llama-3.1-70B-Instruct"  # whatever model vLLM is serving

JUDGE_PROMPT = """Rate the response on a scale of 1-5 for accuracy and completeness.
Return ONLY a JSON object: {{"score": <int>, "reason": "<one sentence>"}}

Question: {question}
Reference: {reference}
Response: {response}"""

def judge_score(doc, results):
    """Called by the harness via process_results: !function utils.judge_score"""
    response_text = results[0]
    msg = JUDGE_PROMPT.format(
        question=doc["question"],
        reference=doc.get("reference", "N/A"),
        response=response_text,
    )
    completion = _client.chat.completions.create(
        model=_JUDGE_MODEL,
        messages=[{"role": "user", "content": msg}],
        temperature=0.0,
        response_format={"type": "json_object"},  # structured output — no fragile parsing
    )
    import json
    parsed = json.loads(completion.choices[0].message.content)
    return {"judge_score": int(parsed["score"])}

Running the Judge Task

# Evaluate a model and score its outputs with the local vLLM judge
lm-eval run \
  --model vllm \
  --model_args pretrained=my-org/my-model,tensor_parallel_size=2 \
  --tasks judge_quality \
  --include_path ./tasks/ \
  --log_samples \
  --output_path ./results/

# Judge scores appear in standard results alongside any other metrics
# results["results"]["judge_quality"]["judge_score"] = 3.72

Integration with W&B / MLflow

Judge scores flow through the standard results dict — no special handling needed:

results = lm_eval.simple_evaluate(
    model="vllm",
    model_args="pretrained=my-org/my-model",
    tasks=["mmlu", "judge_quality"],
    include_path="./tasks/",
    log_samples=True,
    wandb_args="project=eval,name=my-model-judge",
)
# results["results"]["judge_quality"]["judge_score"] logged automatically to W&B

Key Design Decisions

Local vLLM judge (base_url: http://vllm-svc:8000/v1) — zero per-token cost at evaluation scale; swap any model as judge
response_format={"type": "json_object"} — structured output replaces fragile int(result.strip()) parsing
process_results: !function — native harness hook; scores appear in standard output, not a separate pipeline
Any model as judge — change _JUDGE_MODEL to whatever vLLM is serving (Llama, Qwen, Mistral, etc.)

Best Practices

Consistent settings across compared models — same shots, batch size, chat template
Report standard error — _stderr fields in results; small differences may be noise
Multiple tasks — no single benchmark tells the whole story
vLLM backend for faster GPU evaluation (tensor_parallel_size for large models)
Save raw results (log_samples=True) for debugging unexpected scores
Match eval to use case — coding? HumanEval. Reasoning? GSM8K/MATH. General? MMLU.

VLM Evaluation (Multimodal)

Use hf-multimodal or vllm-vlm backends (see Backend Reference above). Use --apply_chat_template; set fixed --batch_size for hf-multimodal. Key benchmarks: mmbench_en, mmstar, textvqa, docvqa, scienceqa. See references/multimodal.md for full CLI examples, model_args, and benchmark table.

References

benchmarks.md — Benchmark descriptions, scoring baselines, and evaluation gotchas
troubleshooting.md — Benchmark execution failures, scoring issues, resource problems, and framework-specific fixes

Cross-References

vllm — vLLM backend for fast GPU evaluation
openai-api — OpenAI-compatible API backends (local-completions, local-chat-completions)
wandb — Log evaluation results to W&B
huggingface-transformers — HF model loading for hf backend
ollama — Local inference alternative for evaluation
flyte-sdk — Orchestrate evaluation pipelines as Flyte workflows
sglang — SGLang as evaluation backend
mlflow — MLflow GenAI evaluation as alternative framework
dvc — Version evaluation datasets and results
prometheus-grafana — Monitor evaluation infrastructure metrics

Reference

lm-evaluation-harness GitHub
Available tasks
Task YAML guide
Open LLM Leaderboard
references/benchmarks.md — detailed benchmark descriptions and scoring baselines
references/multimodal.md — VLM backend options, multimodal benchmarks, and SGLang usage
assets/custom_task.yaml — example custom evaluation task with generate_until, regex filter, and task group definition
assets/judge_quality.yaml — LLM-as-Judge task YAML for the harness
assets/judge_utils.py — Judge scoring function using local vLLM endpoint

More by tylertitsworth

View all

uv — fast Python package/project manager, lockfiles, Python versions, uvx tool runner, Docker/CI integration. Use for Python dependency management. NOT for package publishing.

tensorrt-llm

TensorRT-LLM — engine building, quantization (FP8/FP4/INT4/AWQ), Python LLM API, AutoDeploy, KV cache tuning, in-flight batching, disaggregated serving with HTTP cluster management, Ray orchestrator, sparse attention (RocketKV), Triton backend. Use when optimizing directly with TRT-LLM. NOT for NIM deployment or vLLM/SGLang setup.

triton-inference-server

NVIDIA Triton Inference Server — model repository, config.pbtxt, ensemble/BLS pipelines, backends (TensorRT/ONNX/Python), dynamic batching, model management API, perf_analyzer. Use when serving models with Triton Inference Server. NOT for K8s deployment patterns. NOT for NIM.

kuberay

KubeRay operator — RayCluster, RayJob, RayService, GPU scheduling, autoscaling, auth tokens, Label Selector API, GCS fault tolerance, TLS, observability, and Kueue/Volcano integration. Use when deploying Ray on Kubernetes. NOT for Ray Core programming (see ray-core).