Skip to content

Benchmarking Guide

How LocoLLM adapters are evaluated and how the results fit into the bigger picture.

The evaluation harness compares an adapter against the base model on the adapter’s own evaluation dataset:

Terminal window
uv run loco eval math

This runs both qwen3:4b (base) and locollm-math (adapter) on the problems in adapters/math/eval_dataset.jsonl and prints a side-by-side score.

Three scoring modes are supported, configured per adapter in registry.yaml via the eval_type field:

eval_typeHow it scoresUsed by
numericExtract a number from the response, compare to expected answermath
codeCheck valid Python syntax + expected keywords presentcode
analysisCheck that the expected answer string appears in the responseanalysis

Each adapter must include at least 50 evaluation examples. See evaluation standards for benchmark construction guidelines.

The following are documented in evaluation standards as requirements but are not yet built into the harness. Each is a student project:

  • Out-of-domain check — run your adapter on another domain’s benchmark to verify it doesn’t degrade general capability. The eval-standards doc specifies within 5 percentage points of base.
  • Cross-model comparison — run the same benchmark against other quantized models (e.g., does our adapter-trained Qwen3-4B beat a stock Phi-3-mini at Q4_K_M on math?). This would answer whether adapter training is better than just picking a different base model.
  • Frontier API comparison — run the benchmark against GPT-4 / Claude / Gemini to establish an upper bound. Answers the central research question: how close can routed 4-bit specialists get?
  • Structured results outputresults.json with per-difficulty breakdowns, hardware info, inference settings, and version history.
  • LLM-as-judge scoring — for domains like writing where exact match doesn’t apply.
  • Deterministic inference settings — temperature 0, fixed max_tokens, enforced by the harness rather than relying on Ollama defaults.

How LocoBench Informed Base Model Selection

Section titled “How LocoBench Informed Base Model Selection”

The base model (Qwen3-4B at Q4_K_M) was selected using data from LocoBench, an independent benchmarking project that evaluates small language models across standard tasks (MMLU, GSM8K, HellaSwag, etc.) at multiple quantization levels on consumer hardware. The selection rationale is documented in ADR-0001 and base model selection.

LocoBench and LocoLLM are separate projects. LocoBench benchmarks base models. LocoLLM benchmarks what adapter training adds on top.