Skip to content

LocoLLM Research Roadmap

This document outlines the research questions, milestones, and evaluation criteria for the LocoLLM project.

Can a routed swarm of adapter-trained small language models (4B parameters, 4-bit quantization) outperform a generalist model of equivalent size on real-world tasks, while running entirely on consumer hardware?

  1. Specialisation vs generalisation: At what point does a specialist adapter reliably beat the base model, and how much training data is needed?
  2. Routing accuracy: How well can a lightweight router direct queries to the right specialist, and what is the cost of misrouting?
  3. Ensemble effects: Does voting across multiple adapters for the same domain improve accuracy beyond the single best adapter?
  4. Inference-time enhancements: How much do RE2 prompting and self-consistency voting improve results when compute is free (local inference)?
  5. Scaling the swarm: How does system performance change as we add more domains (5 -> 10 -> 20 adapters)?

Goal: Prove the concept with 3-5 adapters in clearly distinct domains.

MilestoneDeliverableSuccess Criteria
Base model selectedADR-0001, benchmarksModel fits 8GB, strong baseline
First adapter (math)Trained LoRA, eval resultsBeats base model on GSM8K
Evaluation harnessloco eval commandReproducible, automated benchmarks
Keyword routerv1 router implementationCorrect routing on 90%+ of test queries
3 domain adaptersmath, code, writing (minimum)Each beats base model on domain benchmark
Registry v2Domain grouping, benchmark scoresSupports multiple adapters per domain

Phase 2: Scale & Compete (Semester 2, 2027)

Section titled “Phase 2: Scale & Compete (Semester 2, 2027)”

Goal: 6-10 adapters, classifier router, first ensemble experiments.

MilestoneDeliverableSuccess Criteria
Classifier routerML-based routingHandles domain overlap, >85% accuracy
Multi-adapter domains2+ adapters per domainLeaderboard shows ranking
Ensemble votingCross-adapter voting modeMeasurable accuracy gain vs single adapter
Out-of-domain testingAutomated regression checksNo adapter degrades base model by >5%
Adapter installerloco adapters installStudents can share adapters easily

Phase 3: Research Depth (Semester 3+, 2027-2028)

Section titled “Phase 3: Research Depth (Semester 3+, 2027-2028)”

Goal: Push the boundaries — confidence routing, adapter composition, sub-4-bit quantization.

MilestoneDeliverableSuccess Criteria
Confidence routingRouter with uncertainty estimationGraceful fallback on ambiguous queries
Adapter compositionMulti-adapter inferenceMeaningful quality gain on cross-domain tasks
1.58-bit explorationBitNet-style quantizationFeasibility study on quality vs memory
Learned routerFeedback-driven routingImproves with usage data

Every adapter submission must demonstrate:

  1. Domain improvement: Statistically significant accuracy gain over base model on the target domain benchmark (minimum 50 test cases)
  2. No catastrophic forgetting: Out-of-domain performance within 5% of base model
  3. Reproducibility: Training script, data, config, and log all included
  4. Documentation: Training log following TRAINING_LOG_TEMPLATE.md

Track these across the full system over time:

MetricDescriptionTarget
Routing accuracy% of queries routed to correct domain>90% (Phase 1), >95% (Phase 2)
Composite scoreWeighted average across all domain benchmarksHigher than base model on every domain
LatencyTime to route + generate response<5s on consumer hardware
Memory footprintPeak RAM during inference<8GB always
Adapter swap timeTime to switch between adapters<500ms

Each domain maintains its own benchmark. The system-level benchmark is the union:

DomainBenchmarkMetricBaseline (base model)
MathGSM8K subsetExact match accuracy0.41
CodeHumanEval subsetpass@1TBD
WritingCustom rubricLLM-judge scoreTBD
SummarisationROUGE-LToken overlapTBD

LocoLLM serves three user types. While not a traditional application, framing requirements as user stories clarifies what we’re building and defines acceptance criteria for testing.

IDStoryAcceptance Criteria
SD-01As a student, I want to train a new adapter for a domain so that I can contribute a specialist to the swarmAdapter guide exists; training script template works end-to-end; adapter can be registered in registry.yaml
SD-02As a student, I want to evaluate my adapter against the base model so that I can prove it improves performanceloco eval <adapter> runs and produces a comparison report
SD-03As a student, I want to submit my adapter via PR so that it joins the libraryPR template includes checklist; CI validates registry format
SD-04As a student, I want to see how my adapter ranks against others in the same domainloco leaderboard shows per-domain rankings
SD-05As a student, I want to build a better router so that query classification improvesRouter is a pluggable module with a defined interface; router accuracy benchmark exists
IDStoryAcceptance Criteria
EU-01As a user, I want to ask a question and get the best available answer without choosing an adapter manuallyRouter automatically selects the best adapter; response quality >= manual selection
EU-02As a user, I want to run LocoLLM on my laptop with 8GB RAMSetup completes; inference runs without OOM; documented system requirements
EU-03As a user, I want to install a specific adapter from the libraryloco adapters install <name> downloads and registers it
EU-04As a user, I want to see which adapter handled my queryCLI output includes adapter name and confidence (when available)
IDStoryAcceptance Criteria
PL-01As the project lead, I want to compare all adapters for a domain on a common benchmarkLeaderboard is auto-generated from registry benchmark scores
PL-02As the project lead, I want to promote the best adapter to “active” for a domainactive flag in registry; loco promote <adapter> command
PL-03As the project lead, I want to track system-level performance over semestersResults archived per semester; trend report
PL-04As the project lead, I want to set the base model for the next academic yearADR process for base model selection; migration guide

Tracked in docs/ideas.md. Key items:

  • Ensemble voting across same-domain adapters — latency vs accuracy tradeoff
  • Adapter composition (stacking two LoRAs) — feasibility and interference
  • Confidence-based routing with fallback chains
  • Adapter retirement policy for long-term registry management
  • Sub-4-bit quantization (1.58-bit BitNet) viability

For each semester, create a plan covering:

  1. Domains to target: Which new domains will students build adapters for?
  2. Infrastructure goals: Which system features are prioritised (router upgrade, voting, installer)?
  3. Research experiments: Which open questions will be investigated?
  4. Evaluation milestones: What benchmarks and scores are we aiming for?
  5. Student allocation: How many students per domain? Solo vs team?