ADR-0006: GGUF and Ollama as Inference Standard
Status
Section titled “Status”Accepted
2026-03-07
Context
Section titled “Context”Multiple quantisation formats and inference runtimes exist for running local LLMs. The main contenders as of early 2026:
| Format | Runtime | Platform Support | Notes |
|---|---|---|---|
| GGUF | Ollama, llama.cpp, LM Studio, GPT4All | macOS, Windows, Linux, CPU+GPU | llama.cpp native format |
| EXL2 | ExLlamaV2, TabbyAPI | NVIDIA GPU only | Fastest tok/s on NVIDIA hardware |
| AWQ | vLLM, TGI, some llama.cpp support | Primarily NVIDIA GPU | Good perplexity at 4-bit |
| GPTQ | AutoGPTQ, vLLM, TGI | Primarily NVIDIA GPU | Older standard, being superseded |
LocoLLM targets students on diverse hardware — MacBooks, office PCs, lab machines, personal laptops. The project also needs a deployment path simple enough that students focus on the AI problem, not the infrastructure problem.
Decision
Section titled “Decision”Standardise on GGUF as the model format and Ollama as the primary inference runtime.
Why GGUF
Section titled “Why GGUF”Portability. GGUF runs on any hardware Ollama supports: NVIDIA GPUs via CUDA, Apple Silicon via Metal, AMD GPUs via ROCm, and pure CPU fallback. A student with a MacBook Air and a student with a desktop GTX 1650 use the same model file. EXL2 and AWQ are effectively NVIDIA-only, which excludes the Apple Silicon users (Poco, and a significant proportion of students).
CPU+GPU hybrid inference. GGUF via llama.cpp can split model layers between GPU VRAM and system RAM. On a 4 GB GPU card, layers that don’t fit in VRAM spill to CPU. Performance degrades gracefully rather than failing entirely. This matters for the 4 GB floor (GTX 1050 Ti, GTX 1650) where VRAM is tight. EXL2 requires the entire model in GPU VRAM.
Quantisation granularity. GGUF’s k-quant variants (Q4_K_M, Q5_K_M, Q3_K_M, etc.) offer fine-grained control over the quality/size tradeoff. Q4_K_M is the project standard — it retains the vast majority of model quality while fitting comfortably in 8 GB with headroom for context. The k-quant approach uses mixed precision per tensor group, preserving quality in sensitive layers while compressing less critical ones.
Ecosystem breadth. GGUF models work in Ollama, LM Studio, GPT4All, llama.cpp directly, and Simon Willison’s llm CLI. Students aren’t locked into one tool. A merged GGUF adapter produced by LocoLLM’s training pipeline can be loaded in any of these.
Why Ollama
Section titled “Why Ollama”Simplicity. One command to install. ollama pull to download models. ollama run to chat. ollama serve for an API. Custom models via a two-line Modelfile pointing at a GGUF. The cognitive overhead is minimal.
Cross-platform consistency. Same commands, same API, same behaviour on macOS, Windows, and Linux. Students helping each other across different operating systems aren’t fighting platform differences.
Local API. Ollama serves an OpenAI-compatible API at localhost:11434. AnythingLLM, Open WebUI, and other tools connect to it directly. This makes Ollama infrastructure rather than an endpoint — other tools build on top of it.
GPU detection. Ollama auto-detects CUDA, Metal, and ROCm and uses the best available backend without user configuration. On Colmena, CUDA_VISIBLE_DEVICES assigns specific GPUs to specific instances. On a student laptop, it just works.
Why not EXL2
Section titled “Why not EXL2”EXL2 (ExLlamaV2) is genuinely faster on NVIDIA GPUs — it saturates memory bandwidth more efficiently than llama.cpp and produces higher tok/s at equivalent quantisation. On a 2060 Super, EXL2 at 5.0 bpw would noticeably outperform GGUF Q4_K_M.
The tradeoff is portability. EXL2 is NVIDIA-only. No Apple Silicon, no CPU fallback, no AMD. The runtime (ExLlamaV2 or TabbyAPI) requires more setup than Ollama. There’s no equivalent of ollama pull — you download models manually from Hugging Face and configure the runtime.
For a project optimising inference speed on a known NVIDIA card, EXL2 is the right choice. For a project where students bring unknown hardware and the deployment path must be frictionless, it’s the wrong choice.
Why not AWQ
Section titled “Why not AWQ”AWQ (Activation-aware Weight Quantisation) produces slightly better perplexity than GGUF at the same bit width. The quality advantage is real but small — typically 0.5-1.5% on benchmarks. AWQ’s primary runtime is vLLM, which is production-grade server software designed for high-throughput serving, not student laptops.
llama.cpp has partial AWQ support, but GGUF remains its native format and gets the most testing and optimisation. Choosing AWQ would mean either running vLLM (heavy) or running AWQ through llama.cpp (second-class support).
Consequences
Section titled “Consequences”- All LocoLLM adapters are exported as merged GGUF files (Q4_K_M standard)
- Students install Ollama once and use it for all models and adapters
- Apple Silicon users (MacBook, Mac Mini, etc.) are first-class participants
- Students on 4 GB GPUs get graceful CPU+GPU hybrid rather than hard failure
- Inference speed on NVIDIA hardware is not optimal — EXL2 would be faster. This is an accepted tradeoff for portability
- LM Studio, GPT4All, and
llmCLI all work with LocoLLM’s model files as alternative interfaces - If Ollama adds native EXL2 support in future, the benefit becomes available without changing the project’s deployment path
- LocoBench benchmarks use GGUF/Ollama, which means results reflect what Ollama users will actually experience rather than peak theoretical throughput