Base Model Selection

This document explains how the standard base model is chosen, the rationale behind the current selection, and the external benchmarking resources used to inform the decision.

Current Standard

Model: Qwen3-4B-Instruct Quantization: Q4_K_M (GGUF format) RAM footprint: ~2.5GB model + ~0.5GB runtime overhead Effective RAM requirement: ~3.5GB (leaving headroom for OS and adapters on 8GB machines) Academic year: 2026-2027

Why Qwen3-4B?

The Qwen3-4B-Instruct model ranks first for post-adapter-training performance across 8 diverse tasks in distil labs’ systematic benchmark of 12 small models, outperforming even 8B models after LoRA training. It achieves this while staying within our 8GB RAM constraint in Q4_K_M quantization.

The most compelling finding for LocoLLM: the adapter-trained Qwen3-4B matched or exceeded a 120B+ teacher model on 7 of 8 benchmarks. On SQuAD 2.0, it beat the teacher by 19 percentage points. A 4B model, properly trained with LoRA adapters, can match a model 30x its size. That’s the entire thesis of this project validated in someone else’s data.

The Tunability Inversion

One of the most important findings from recent small model research is that smaller models gain more from adapter training than larger ones. The distil labs benchmark showed the tunability ranking inverts the size hierarchy: Llama-3.2-1B and Qwen3-0.6B showed the largest improvements from adapter training, while 8B models gained the least (because they start stronger and have less room to improve).

This directly validates LocoLLM’s architecture. We’re not settling for small models as a compromise. We’re exploiting the fact that small models are precisely the ones that benefit most from the kind of task-specific adaptation we’re building. The adapter approach isn’t compensating for a weakness; it’s leveraging a strength unique to the small model class.

Selection Criteria

The base model must satisfy all of the following:

Hard Requirements

Fits in 8GB RAM with OS, runtime, and adapter loaded simultaneously. In practice this means the quantized model should be under 3GB.
Instruction-tuned variant available. We need a model that can follow instructions out of the box, not a raw pretrained model.
GGUF format available (or convertible) for Ollama compatibility.
Permissive license for academic use, redistribution, and modification. Apache 2.0, MIT, or equivalent. No “research only” restrictions.
Active maintenance. The model provider is actively developing the model family, releasing updates, and responding to community issues.

Soft Preferences

Strong adapter training response (tunability). This matters more than raw base performance for LocoLLM, since every query goes through an adapter. A model that improves dramatically with LoRA training is more valuable than one that starts slightly stronger but plateaus.
Good multilingual support. Many of our students work in multiple languages.
Large community. More users means more documentation, tutorials, and LoRA examples to learn from.
Compatible with standard training tools. Works with HuggingFace PEFT, Unsloth, and other common LoRA training frameworks without special modifications.

Benchmarking Resources

You don’t have to guess how small models perform or infer from larger siblings. Dedicated benchmarking resources exist for the sub-7B class:

External Leaderboards and Benchmarks

HuggingFace Open LLM Leaderboard https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard

The standard community leaderboard for open models. Tests on IFEval, BBH, MATH, GPQA, MuSR, and MMLU-PRO. Filterable by model size, so you can compare models within a weight class rather than against 70B behemoths.

Open LLM Leaderboard: Best Models by Size https://huggingface.co/collections/open-llm-leaderboard/open-llm-leaderboard-best-models

HuggingFace maintains a curated collection of the top-performing model at each parameter bucket (around 1B, 2B, 7B, 13B, etc.). Useful as a quick reference for what’s currently winning at each weight class.

distil labs: SLM Fine-Tuning Benchmark https://www.distillabs.ai/blog/we-benchmarked-12-small-language-models-across-8-tasks-to-find-the-best-base-model-for-fine-tuning

The most directly relevant resource for LocoLLM. Benchmarks 12 small models (0.1B to 8B) across 8 tasks, measuring both base performance and post-adapter-training performance. This is where the tunability inversion finding comes from and where Qwen3-4B was identified as the top performer after adapter training.

SLM-Bench (EMNLP 2025 Findings) https://aclanthology.org/2025.findings-emnlp.1165/

An academic benchmark specifically designed for small language models. Measures 11 metrics across correctness, computational efficiency, and energy consumption on 4 hardware configurations. Useful for understanding efficiency trade-offs, not just accuracy.

Small Language Models: Survey, Measurements, and Insights https://arxiv.org/html/2409.15790v1

Comprehensive survey covering SLMs from 100M to 5B parameters. Includes inference latency measurements (first token time and decode latency per token), memory footprint analysis on actual hardware (including Jetson for edge deployment), and pre-training dataset quality comparisons. Essential reading for understanding what drives performance at this scale.

How to Use These Resources

When evaluating a candidate base model for LocoLLM:

Check the Open LLM Leaderboard for its ranking within its size class on standard benchmarks
Check distil labs (or run your own version of their methodology) for adapter training response (note: distil labs uses the term “fine-tuning” to include LoRA adapter training — this is common in the literature but distinct from full fine-tuning)
Check SLM-Bench or the SLM survey for practical hardware measurements (latency, memory, energy)
Run LocoLLM’s own evaluation on the specific task domains we care about (see Evaluation Process below)

No single benchmark tells the whole story, but together they give a much clearer picture than guessing from the parent model’s performance.

Important Caveat: Most Benchmarks Test Full Precision

Nearly all of the external benchmarks above evaluate models at full precision (bfloat16 or float16), not at Q4_K_M quantization. This matters because LocoLLM runs quantized models exclusively. We are making base model selections based on full-precision adapter training rankings and assuming those rankings hold after 4-bit quantization. That assumption is probably correct, but it’s more confident than the evidence strictly supports.

What quantization-specific evidence does exist comes from studies on larger models that suggest the picture is nuanced:

Large models recover well. Red Hat’s evaluation of half a million quantized model runs found that quantized models recover 99% of full-precision performance on average (96% minimum). However, they tested Llama 3.1 at 8B, 70B, and 405B. They specifically noted that “smaller models (8B) may experience slight variability” compared to larger ones. Sub-8B models were not systematically tested.

Task sensitivity varies significantly. A study by ionio.ai across Qwen2.5, DeepSeek, Mistral, and LLaMA 3.3 found that Q4_K_M retains roughly 90% of BF16 accuracy on reasoning tasks (BBH), but knowledge-heavy benchmarks like MMLU and C-Eval show 15-20% reductions. An IJCAI 2025 study explicitly warned that “in smaller LLMs, using 4-bit quantization can lead to significant accuracy drops, especially with GPTQ” and noted that “the Open LLM Leaderboard currently provides only limited data on quantized models, highlighting the need for comprehensive evaluation.”

The quantization cliff is real at 1B. The most concerning data point for the small model class comes from an e-commerce study using Llama 3.2 1B: Q5_K_M retained 0.99 accuracy, Q4_K_M dropped to 0.89, and Q3_K_M collapsed to 0.60. This suggests a sharp cliff between quant levels at very small scales. The 4B class may be more resilient than 1B, but we don’t have systematic data proving it.

The gap nobody has filled. Nobody systematically benchmarks the intersection that LocoLLM sits in: multiple 3-4B models, at Q4_K_M specifically, across standard benchmarks, on consumer CPU hardware. distil labs tells you which model fine-tunes best in full precision. Quantization studies tell you how much larger models lose at 4-bit. But “how do these specific small models perform after quantization AND fine-tuning on the same tasks” is genuinely undocumented territory. LocoLLM’s Phase 1 benchmarks will be among the first to generate this data systematically. See the benchmarking guide for the full methodology, hardware options, and the “bang per bit” visualisation plan.

Research Viability: Can Adapter Training Make Quantized Small Models Good Specialists?

This is the central question for LocoLLM, and the evidence from multiple independent research groups converges on a clear answer: yes, and the gap to frontier models is closing from both directions.

QLoRA Proves the Mechanism Works

QLoRA (Dettmers et al., NeurIPS 2023) demonstrated that training LoRA adapters through a frozen 4-bit quantized base model fully recovers 16-bit adapter training performance. Using NF4 quantization with double quantization on LLaMA models from 7B to 65B, QLoRA matched 16-bit LoRA on MMLU. Their key finding for LocoLLM: training adapters on a small, high-quality dataset produced state-of-the-art results even with smaller models than the previous best. Data quality mattered far more than dataset size: a 9K-sample dataset outperformed a 450K-sample dataset.

The original QLoRA work focused on 7B+ models. For LocoLLM’s 4B target, the question was whether the same mechanism holds at smaller scale. Newer work confirms it does.

The Tooling Is Actively Improving

Standard QLoRA has a known underfitting problem when training adapters on quantized models: the adapter sees complex inputs and outputs but has limited trainable capacity. Two recent papers specifically address this:

Q-BLoRA (Shen et al., TACL 2025) rebalances the adapter by simplifying inputs/outputs and increasing rank. Their results consistently outperform QLoRA across LLaMA, LLaMA2, Mistral, and Gemma models. The 4-bit inference variant (QA-BLoRA) outperforms other 4-bit models and even surpasses some 16-bit adapter benchmarks.

QR-Adaptor (2025) jointly optimizes quantization bit-width and LoRA rank per layer, allocating more precision and adaptation capacity to critical layers. It achieved a 4.89% accuracy improvement over QLoRA on GSM8K math benchmarks, and in some configurations surpassed 16-bit adapter-trained models while using memory comparable to a 4-bit setting.

These aren’t theoretical improvements. They’re published, benchmarked methods that directly apply to LocoLLM’s training pipeline. As this tooling matures and gets integrated into standard libraries (HuggingFace PEFT, Unsloth), quantized adapter training quality will continue to improve without LocoLLM needing to change its architecture.

Domain-Specific Evidence: Small Specialists Beat Large Generalists

Multiple independent studies across different domains confirm the pattern that LocoLLM depends on:

Cybersecurity. CyberBench (Liu et al., 2024) found that smaller, adapter-trained LLMs can sometimes match or exceed the performance of larger general-purpose models on domain-specific cybersecurity tasks including named entity recognition, summarization, and classification.

Medical and scientific domains. A complexity-aware fine-tuning paper (2025) applied their pipeline to Qwen2.5-3B, Phi-4-Mini, and Llama 3.2 3B across medical QA (MedMCQA), mathematics (GSM8K), and general reasoning (MMLU-Pro). They found that carefully trained smaller models match or outperform larger open models in mathematics, medicine, and chemistry. Their approach used chain-of-thought distillation from larger teacher models, which aligns with LocoLLM’s potential to use frontier model outputs as adapter training data.

Language exams. A study training LoRA adapters on compact open-source models for Ukrainian language exam tasks showed that parameter-efficient adapter training combined with quantization produced substantial improvements over baseline. The adapter-trained models outperformed GPT-4o mini, Mistral Large, and larger open-weight models, all running on a single A100 GPU.

Cybersecurity with quantization. CyberLLM-FINDS (2025) specifically combined domain-specific adapter training with quantized models under 2B parameters. They found that chain-of-thought reasoning paired with quantized weights performed best, though local inference was constrained to 200-400 effective tokens despite nominal 2048 context support. This is a practical constraint LocoLLM should monitor.

The Converging Trends

LocoLLM sits at the intersection of five trends that are all moving in the right direction:

Small base models are improving every generation. Qwen3-4B outperforms last year’s Qwen2.5-7B on over half of benchmarks. Each generation of 4B models starts from a higher baseline, which means quantization losses matter less in absolute terms.
Adapter training disproportionately helps small models. The tunability inversion from distil labs data means the technique LocoLLM depends on most, domain-specific LoRA training, is precisely the technique that benefits small models the most relative to their size.
Quantization techniques are improving in parallel. Q-BLoRA, QR-Adaptor, and similar methods are specifically addressing the accuracy gap in quantized adapter training. The 4-bit penalty is shrinking with each new method.
The combination is systematically underexplored. Plenty of people benchmark base models. Plenty benchmark quantization. Plenty benchmark adapter training. Almost nobody benchmarks all three together at the 3-4B scale, which means LocoLLM’s evaluation data will fill a genuine gap in the literature.
1.58-bit is a parallel bet, not a replacement. If BitNet tooling matures, the same specialist-adapter architecture applies at even more extreme compression. The research question (“can routed specialist adapters close the gap to frontier?”) stays valid regardless of precision format.

The Main Risk

The risk isn’t that this approach is a dead end. It’s that general-purpose small models improve so rapidly that task-specific adapter training becomes unnecessary. If Qwen4-4B (or whatever ships in 2027) is good enough at everything that adapters add negligible value, the architecture becomes pointless overhead.

But the distil labs data argues against this. Even with Qwen3-4B being excellent out of the box, adapter training still produced massive gains on specific tasks. An adapter-trained 4B matched a 120B+ teacher. The gap between “good generalist” and “great specialist” persists at every model size and every generation. There’s no sign of it closing.

Key References

QLoRA (NeurIPS 2023): https://arxiv.org/abs/2305.14314
Q-BLoRA (TACL 2025): https://direct.mit.edu/tacl/article/doi/10.1162/TACL.a.23/132118
QR-Adaptor (2025): https://arxiv.org/html/2505.03802v3
Complexity-aware fine-tuning (2025): https://arxiv.org/html/2506.21220
AQUA-LLM cybersecurity evaluation: https://arxiv.org/html/2509.13514v1
CyberLLM-FINDS domain fine-tuning: https://arxiv.org/html/2601.06779
Empowering Smaller Models (Ukrainian exams): https://arxiv.org/pdf/2503.13988

Evaluation Process

When selecting a new base model (typically annually), the following evaluation is performed:

Step 1: Filter by Hard Requirements

Eliminate any model that doesn’t meet all hard requirements. This usually reduces the field to 3-5 candidates.

Step 2: Check External Benchmarks

Before running any local tests, review the external resources listed above. If a model performs poorly on the Open LLM Leaderboard for its size class, or shows weak tunability in distil labs’ data, there’s no need to spend time testing it locally.

Step 3: Benchmark on LocoLLM Tasks

Run each remaining candidate on the existing LocoLLM benchmark suite (all adapter benchmarks combined, using only the base model without adapters). This gives a direct comparison of how well each candidate handles our specific task domains out of the box.

Step 4: LoRA Trainability Test

Train a quick test adapter (using the math-reasoning training data) on each candidate. Compare:

Training convergence speed
Final benchmark score after identical training
Adapter size

This is the most important step. Some models respond dramatically better to LoRA adapter training than others at the same parameter count. The distil labs data provides a starting point, but LocoLLM’s task domains may differ.

Step 5: Practical Testing

Install each candidate on the lowest-spec student laptop available and test:

Time to first token
Tokens per second
Memory stability over long conversations
Compatibility with Ollama on macOS, Windows, and Linux

Model Comparison (2026 Evaluation)

Model	Parameters	Est. Q4_K_M Size	License	Adapter Training Rank (distil labs)	Base Rank (distil labs)
Qwen3-4B-Instruct	4B	~2.5GB	Apache 2.0	#1	#3
Qwen3-1.7B	1.7B	~1.1GB	Apache 2.0	#4	#5
Qwen3-0.6B	0.6B	~0.4GB	Apache 2.0	#6	#8
Llama 3.2-3B-Instruct	3.2B	~2.0GB	Llama 3.2 Community	#5	#6
Llama 3.2-1B-Instruct	1B	~0.7GB	Llama 3.2 Community	#7	#9
Gemma 3-1B-it	1B	~0.7GB	Gemma License	#8	#7
SmolLM2-1.7B-Instruct	1.7B	~1.1GB	Apache 2.0	#9	#10

Rankings sourced from distil labs benchmark (June 2025). LocoLLM-specific benchmarks to be added during Semester 2, 2026.

Why Not 7B?

A 7B model in Q4_K_M quantization requires approximately 4.5GB of RAM for the model alone. With OS overhead (1-2GB), Ollama runtime (~0.5GB), and an active LoRA adapter (~0.1GB), total usage reaches 6-7GB. This is technically possible on an 8GB machine but leaves almost no headroom, leading to:

Swapping to disk under memory pressure (catastrophic for inference speed)
Inability to run a web browser or other applications alongside LocoLLM
Unreliable performance on machines with shared GPU memory (integrated graphics)

The 3-4B class provides a comfortable margin while still fitting on constrained hardware. And the tunability data shows that smaller models gain more from adapter training, so the gap between a 4B adapter-enhanced model and a 7B general model is narrower than the parameter count suggests.

Why Not Smaller (1-2B)?

Models in the 1-2B range are tempting for their speed and tiny footprint, and the tunability data shows they gain the most from adapter training in relative terms. However, their absolute post-adapter-training scores still trail the 4B class. The Qwen3-4B consistently produces the best adapter-trained results across diverse tasks.

That said, 1-2B models are worth considering for specific use cases:

Chromebooks or tablets with 4GB RAM
Situations where inference speed matters more than quality
As a secondary “fast model” in a tiered routing setup

If hardware constraints force a smaller model, Llama-3.2-1B-Instruct shows the highest tunability and Qwen3-1.7B offers the best balance of size and post-adapter-training quality below 2B.

Future: 1.58-Bit Native Models (Research Track)

The most significant development on the horizon for LocoLLM is 1.58-bit native quantization, primarily through Microsoft’s BitNet architecture and models built on it.

What Is 1.58-Bit?

Unlike post-training quantization (where you train a model in full precision then compress it), BitNet models are trained natively with ternary weights: every weight is -1, 0, or +1. This is 1.58 bits per parameter (log2(3)). Because the model learns to work within these constraints from the start, it avoids the quality loss that comes from compressing a model after the fact.

Why It Matters for LocoLLM

The numbers are significant for LocoLLM’s hardware floor:

Metric	Qwen3-4B (Q4_K_M)	BitNet b1.58 2B4T
Model size on disk	~2.5GB	~0.4GB
RAM usage	~3.5GB	~0.8GB
Inference speed (CPU)	15-30 tok/s	~6 tok/s (unoptimized)
Energy per inference	~35W	~10W
Minimum viable hardware	8GB RAM laptop	Raspberry Pi 5 / 4GB Chromebook

A 0.4GB model running on a Raspberry Pi at human reading speed opens LocoLLM to an entirely different population of users: students in developing regions, schools with no laptop budget, offline kiosks, phone-based access.

Current Limitations

LoRA incompatibility. Standard LoRA adapters attach to nn.Linear layers. BitNet replaces these with BitLinear layers that use ternary weights. The two architectures are fundamentally incompatible. Emerging solutions:

BitLoRA (2025): A modified PEFT method designed specifically for BitLinear layers. All adapter weights also operate in ternary. Early results are promising but the tooling is not yet production-ready.
Falcon-Edge (TII, 2025): 1B and 3B models pre-trained natively in 1.58-bit format with a training paradigm specifically designed to support fine-tuning. Available in both BitNet and bfloat16 variants from the same training run.
BitDistill (2025): A framework for distilling existing full-precision models into 1.58-bit BitNet format with performance comparable to the original. Three-stage process: modeling refinement, continued pre-training, and attention distillation.
HuggingFace 1.58-bit fine-tuning: HuggingFace demonstrated that existing models can be gradually fine-tuned down to 1.58-bit using warmup quantization techniques, though results are not yet as strong as native pre-training.

No Ollama support. BitNet models require Microsoft’s bitnet.cpp inference runtime or specialized kernels. They cannot currently run through Ollama or standard llama.cpp. This means a separate installation path and a different user experience.

Limited model selection. As of mid-2025, the available natively-trained 1.58-bit models are: BitNet b1.58 2B4T (Microsoft), Falcon-Edge 1B/3B (TII), and a handful of community experiments. The selection will grow, but it’s thin compared to the hundreds of 4-bit quantized models available.

4K context length. BitNet b1.58 2B4T has a maximum context of 4,096 tokens. This limits use cases that require longer context windows. Long-context fine-tuning is recommended but adds complexity.

Planned Research Track (Phase 3+)

LocoLLM’s architecture is designed to be base-model-agnostic. The router, evaluation harness, benchmarks, and adapter submission process all work regardless of the underlying model’s precision format. This means we can run a parallel track:

Semester 3 project: “LocoLLM-1bit”

A student team ports the LocoLLM framework to a 1.58-bit base (Falcon-Edge 3B or successor), adapts the adapter training pipeline to use BitLoRA or Falcon-Edge’s native training approach, and benchmarks the same task domains at both precisions.

Research questions:

Does routed 1.58-bit task specialization close the gap to 4-bit general models?
What is the quality/memory/speed trade-off curve across precisions for the same tasks?
Which task domains are most and least sensitive to extreme quantization?
Is the tunability inversion (smaller models gain more from adapter training) even more pronounced at 1.58-bit?

This comparison, done rigorously on the same task benchmarks, would be a novel contribution. Nobody has published routed multi-adapter evaluation at 1.58-bit precision.

Decision Framework: When to Switch

LocoLLM should consider making 1.58-bit the default pathway when:

At least one 1.58-bit base model exists at 3-4B parameters with competitive benchmark scores
A LoRA-compatible adapter training method (BitLoRA or equivalent) is available through standard tooling (HuggingFace PEFT or similar)
An inference runtime works cross-platform (macOS, Windows, Linux) with a user experience comparable to Ollama
Our own benchmarks confirm that routed 1.58-bit adapters achieve at least 80% of the quality of routed 4-bit adapters on LocoLLM task domains

Until those conditions are met, 4-bit remains the production default and 1.58-bit remains a research track.

Key References

BitNet b1.58 model: https://huggingface.co/microsoft/bitnet-b1.58-2B-4T
bitnet.cpp inference runtime: https://github.com/microsoft/BitNet
Falcon-Edge (fine-tunable 1.58-bit): https://falcon-lm.github.io/blog/falcon-edge/
HuggingFace 1.58-bit fine-tuning guide: https://huggingface.co/blog/1_58_llm_extreme_quantization
BitLoRA paper (adapter tuning for 1.58-bit): https://www.sciencedirect.com/science/article/abs/pii/S0957417426003106
BitDistill (distillation to 1.58-bit): https://arxiv.org/html/2510.13998v1

Changing the Base Model

A base model change affects the entire ecosystem. All existing adapters must be verified or retrained. This is a significant community effort, so changes should be:

Infrequent: Once per academic year at most
Well-justified: The new model must be meaningfully better, not just marginally
Planned: Announce at least one semester in advance so teams can prepare
Backwards-compatible: Maintain the previous model as a fallback for one semester

Migration Process

Announce candidate model and rationale
Run full evaluation (Steps 1-5 above)
Test representative adapters from the current ecosystem on the new base
If adapters transfer well: publish conversion guide
If adapters don’t transfer: coordinate retraining effort
Update all documentation, templates, and training scripts
Old base model remains supported for one additional semester