Benchmarking Guide

How LocoLLM adapters are evaluated and how the results fit into the bigger picture.

What `loco eval` Does Today

The evaluation harness compares an adapter against the base model on the adapter’s own evaluation dataset:

uv run loco eval math

This runs both qwen3:4b (base) and locollm-math (adapter) on the problems in adapters/math/eval_dataset.jsonl and prints a side-by-side score.

Three scoring modes are supported, configured per adapter in registry.yaml via the eval_type field:

eval_type	How it scores	Used by
`numeric`	Extract a number from the response, compare to expected answer	math
`code`	Check valid Python syntax + expected keywords present	code
`analysis`	Check that the expected answer string appears in the response	analysis

Each adapter must include at least 50 evaluation examples. See evaluation standards for benchmark construction guidelines.

What’s Not Implemented Yet

The following are documented in evaluation standards as requirements but are not yet built into the harness. Each is a student project:

Out-of-domain check — run your adapter on another domain’s benchmark to verify it doesn’t degrade general capability. The eval-standards doc specifies within 5 percentage points of base.
Cross-model comparison — run the same benchmark against other quantized models (e.g., does our adapter-trained Qwen3-4B beat a stock Phi-3-mini at Q4_K_M on math?). This would answer whether adapter training is better than just picking a different base model.
Frontier API comparison — run the benchmark against GPT-4 / Claude / Gemini to establish an upper bound. Answers the central research question: how close can routed 4-bit specialists get?
Structured results output — results.json with per-difficulty breakdowns, hardware info, inference settings, and version history.
LLM-as-judge scoring — for domains like writing where exact match doesn’t apply.
Deterministic inference settings — temperature 0, fixed max_tokens, enforced by the harness rather than relying on Ollama defaults.

How LocoBench Informed Base Model Selection

The base model (Qwen3-4B at Q4_K_M) was selected using data from LocoBench, an independent benchmarking project that evaluates small language models across standard tasks (MMLU, GSM8K, HellaSwag, etc.) at multiple quantization levels on consumer hardware. The selection rationale is documented in ADR-0001 and base model selection.

LocoBench and LocoLLM are separate projects. LocoBench benchmarks base models. LocoLLM benchmarks what adapter training adds on top.

Benchmarking Guide

What loco eval Does Today

What’s Not Implemented Yet

How LocoBench Informed Base Model Selection

What `loco eval` Does Today