LocoLLM Research Roadmap
This document outlines the research questions, milestones, and evaluation criteria for the LocoLLM project.
Core Research Question
Section titled “Core Research Question”Can a routed swarm of adapter-trained small language models (4B parameters, 4-bit quantization) outperform a generalist model of equivalent size on real-world tasks, while running entirely on consumer hardware?
Sub-Questions
Section titled “Sub-Questions”- Specialisation vs generalisation: At what point does a specialist adapter reliably beat the base model, and how much training data is needed?
- Routing accuracy: How well can a lightweight router direct queries to the right specialist, and what is the cost of misrouting?
- Ensemble effects: Does voting across multiple adapters for the same domain improve accuracy beyond the single best adapter?
- Inference-time enhancements: How much do RE2 prompting and self-consistency voting improve results when compute is free (local inference)?
- Scaling the swarm: How does system performance change as we add more domains (5 -> 10 -> 20 adapters)?
Phased Milestones
Section titled “Phased Milestones”Phase 1: Foundation (Semester 1, 2026)
Section titled “Phase 1: Foundation (Semester 1, 2026)”Goal: Prove the concept with 3-5 adapters in clearly distinct domains.
| Milestone | Deliverable | Success Criteria |
|---|---|---|
| Base model selected | ADR-0001, benchmarks | Model fits 8GB, strong baseline |
| First adapter (math) | Trained LoRA, eval results | Beats base model on GSM8K |
| Evaluation harness | loco eval command | Reproducible, automated benchmarks |
| Keyword router | v1 router implementation | Correct routing on 90%+ of test queries |
| 3 domain adapters | math, code, writing (minimum) | Each beats base model on domain benchmark |
| Registry v2 | Domain grouping, benchmark scores | Supports multiple adapters per domain |
Phase 2: Scale & Compete (Semester 2, 2027)
Section titled “Phase 2: Scale & Compete (Semester 2, 2027)”Goal: 6-10 adapters, classifier router, first ensemble experiments.
| Milestone | Deliverable | Success Criteria |
|---|---|---|
| Classifier router | ML-based routing | Handles domain overlap, >85% accuracy |
| Multi-adapter domains | 2+ adapters per domain | Leaderboard shows ranking |
| Ensemble voting | Cross-adapter voting mode | Measurable accuracy gain vs single adapter |
| Out-of-domain testing | Automated regression checks | No adapter degrades base model by >5% |
| Adapter installer | loco adapters install | Students can share adapters easily |
Phase 3: Research Depth (Semester 3+, 2027-2028)
Section titled “Phase 3: Research Depth (Semester 3+, 2027-2028)”Goal: Push the boundaries — confidence routing, adapter composition, sub-4-bit quantization.
| Milestone | Deliverable | Success Criteria |
|---|---|---|
| Confidence routing | Router with uncertainty estimation | Graceful fallback on ambiguous queries |
| Adapter composition | Multi-adapter inference | Meaningful quality gain on cross-domain tasks |
| 1.58-bit exploration | BitNet-style quantization | Feasibility study on quality vs memory |
| Learned router | Feedback-driven routing | Improves with usage data |
Evaluation Framework
Section titled “Evaluation Framework”Per-Adapter Acceptance Criteria
Section titled “Per-Adapter Acceptance Criteria”Every adapter submission must demonstrate:
- Domain improvement: Statistically significant accuracy gain over base model on the target domain benchmark (minimum 50 test cases)
- No catastrophic forgetting: Out-of-domain performance within 5% of base model
- Reproducibility: Training script, data, config, and log all included
- Documentation: Training log following
TRAINING_LOG_TEMPLATE.md
System-Level Metrics
Section titled “System-Level Metrics”Track these across the full system over time:
| Metric | Description | Target |
|---|---|---|
| Routing accuracy | % of queries routed to correct domain | >90% (Phase 1), >95% (Phase 2) |
| Composite score | Weighted average across all domain benchmarks | Higher than base model on every domain |
| Latency | Time to route + generate response | <5s on consumer hardware |
| Memory footprint | Peak RAM during inference | <8GB always |
| Adapter swap time | Time to switch between adapters | <500ms |
Benchmark Suite
Section titled “Benchmark Suite”Each domain maintains its own benchmark. The system-level benchmark is the union:
| Domain | Benchmark | Metric | Baseline (base model) |
|---|---|---|---|
| Math | GSM8K subset | Exact match accuracy | 0.41 |
| Code | HumanEval subset | pass@1 | TBD |
| Writing | Custom rubric | LLM-judge score | TBD |
| Summarisation | ROUGE-L | Token overlap | TBD |
| … | … | … | … |
User Stories
Section titled “User Stories”LocoLLM serves three user types. While not a traditional application, framing requirements as user stories clarifies what we’re building and defines acceptance criteria for testing.
Student Developer
Section titled “Student Developer”| ID | Story | Acceptance Criteria |
|---|---|---|
| SD-01 | As a student, I want to train a new adapter for a domain so that I can contribute a specialist to the swarm | Adapter guide exists; training script template works end-to-end; adapter can be registered in registry.yaml |
| SD-02 | As a student, I want to evaluate my adapter against the base model so that I can prove it improves performance | loco eval <adapter> runs and produces a comparison report |
| SD-03 | As a student, I want to submit my adapter via PR so that it joins the library | PR template includes checklist; CI validates registry format |
| SD-04 | As a student, I want to see how my adapter ranks against others in the same domain | loco leaderboard shows per-domain rankings |
| SD-05 | As a student, I want to build a better router so that query classification improves | Router is a pluggable module with a defined interface; router accuracy benchmark exists |
End User (Researcher / Tester)
Section titled “End User (Researcher / Tester)”| ID | Story | Acceptance Criteria |
|---|---|---|
| EU-01 | As a user, I want to ask a question and get the best available answer without choosing an adapter manually | Router automatically selects the best adapter; response quality >= manual selection |
| EU-02 | As a user, I want to run LocoLLM on my laptop with 8GB RAM | Setup completes; inference runs without OOM; documented system requirements |
| EU-03 | As a user, I want to install a specific adapter from the library | loco adapters install <name> downloads and registers it |
| EU-04 | As a user, I want to see which adapter handled my query | CLI output includes adapter name and confidence (when available) |
Project Lead / Instructor
Section titled “Project Lead / Instructor”| ID | Story | Acceptance Criteria |
|---|---|---|
| PL-01 | As the project lead, I want to compare all adapters for a domain on a common benchmark | Leaderboard is auto-generated from registry benchmark scores |
| PL-02 | As the project lead, I want to promote the best adapter to “active” for a domain | active flag in registry; loco promote <adapter> command |
| PL-03 | As the project lead, I want to track system-level performance over semesters | Results archived per semester; trend report |
| PL-04 | As the project lead, I want to set the base model for the next academic year | ADR process for base model selection; migration guide |
Open Research Questions
Section titled “Open Research Questions”Tracked in docs/ideas.md. Key items:
- Ensemble voting across same-domain adapters — latency vs accuracy tradeoff
- Adapter composition (stacking two LoRAs) — feasibility and interference
- Confidence-based routing with fallback chains
- Adapter retirement policy for long-term registry management
- Sub-4-bit quantization (1.58-bit BitNet) viability
Semester Planning Template
Section titled “Semester Planning Template”For each semester, create a plan covering:
- Domains to target: Which new domains will students build adapters for?
- Infrastructure goals: Which system features are prioritised (router upgrade, voting, installer)?
- Research experiments: Which open questions will be investigated?
- Evaluation milestones: What benchmarks and scores are we aiming for?
- Student allocation: How many students per domain? Solo vs team?