Adapter Development Guide
A practical, step-by-step guide for student teams building their first LocoLLM adapter.
Before You Start
Section titled “Before You Start”Make sure you have:
- A LocoLLM installation (see README Quick Start)
- Access to a GPU for training (Google Colab free tier works)
- An approved adapter proposal (see CONTRIBUTING.md Step 1)
- Python 3.10+ with the LocoLLM dev dependencies installed (
uv sync)
Phase 1: Understand Your Domain (Week 1-2)
Section titled “Phase 1: Understand Your Domain (Week 1-2)”Before writing any code or collecting data, spend time understanding where the base model fails in your target domain.
Baseline Testing
Section titled “Baseline Testing”Run at least 20 queries in your target domain through the base model and record the results:
loco query "Your test query here" --no-adapter --save-log baseline_tests.jsonlFor each query, note:
- Did the model get it right?
- If wrong, how was it wrong? (factual error, wrong format, missing steps, irrelevant response)
- What would a good answer look like?
This baseline testing serves two purposes: it tells you whether your adapter domain is worth pursuing (if the base model already handles it well, pick a different domain), and the patterns you observe in failures will guide your training data strategy.
Failure Analysis
Section titled “Failure Analysis”Categorize the failures you found. Common patterns:
| Failure Type | Training Strategy |
|---|---|
| Wrong answer format | Include many examples with correct formatting |
| Missing reasoning steps | Use chain-of-thought examples with explicit steps |
| Domain vocabulary errors | Include examples that use domain-specific terms correctly |
| Incorrect application of rules | Create examples covering each rule/formula explicitly |
| Inconsistent quality | More diverse training examples, especially edge cases |
This analysis should go in your TRAINING_LOG.md.
Phase 2: Curate Training Data (Week 2-4)
Section titled “Phase 2: Curate Training Data (Week 2-4)”This is the most important phase. Adapter quality is mostly determined by training data quality. A well-curated dataset of 500 examples will outperform a sloppy dataset of 5000.
Data Sources
Section titled “Data Sources”Option A: Existing public datasets. Many NLP datasets are freely available on HuggingFace Datasets. Convert them to the LocoLLM JSONL format. Good starting points:
- Math: GSM8K, MATH, AQuA-RAT
- Code: CodeAlpaca, CodeContests
- Writing: Alpaca variants filtered by topic
- Domain QA: SQuAD, Natural Questions (filtered)
Option B: Synthetic generation. Use a frontier model to generate training examples. This is practical and effective, but requires careful quality control.
Template for synthetic data generation:
Generate a {domain} training example for a language model.
The example should have:- An instruction telling the model what to do- An input providing the specific problem or context- An output showing the ideal response
Requirements for the output:- {quality criteria specific to your domain}- {format requirements}- {level of detail expected}
Generate one example now in this JSON format:{"instruction": "...", "input": "...", "output": "..."}Generate in batches, review each batch, and discard or fix low-quality examples. Document your rejection rate.
Option C: Expert-written examples. The highest quality but most labor-intensive. Best used for the benchmark set rather than the full training set.
Data Quality Checklist
Section titled “Data Quality Checklist”Before finalizing your dataset, verify:
- At least 500 training examples
- No duplicate or near-duplicate examples
- Outputs are consistently high quality (would you be happy receiving this response?)
- Difficulty is varied (not all easy, not all hard)
- Edge cases are represented
- Format is consistent (same JSONL schema throughout)
- Benchmark set (50+ examples) is completely separate from training set
- All sources are documented in TRAINING_LOG.md
JSONL Format
Section titled “JSONL Format”{"instruction": "Solve the following arithmetic word problem. Show your reasoning step by step, then give the final answer.", "input": "A bakery makes 240 muffins per day. They sell 75% of them at full price ($3 each) and donate the rest. What is their daily revenue from muffins?", "output": "Step 1: Calculate muffins sold at full price.\n240 muffins x 75% = 240 x 0.75 = 180 muffins sold\n\nStep 2: Calculate revenue.\n180 muffins x $3 = $540\n\nThe bakery's daily muffin revenue is $540."}Phase 3: Train the Adapter (Week 4-5)
Section titled “Phase 3: Train the Adapter (Week 4-5)”Google Colab Setup
Section titled “Google Colab Setup”If you don’t have local GPU access, use this Colab workflow:
- Upload your dataset to Google Drive
- Open the LocoLLM training notebook (link in project repo)
- Connect to a T4 runtime (free tier)
- Run the training cells with your dataset path and domain name
Local Training
Section titled “Local Training”uv run python scripts/train_adapter.py \ --adapter-name your-domain \ --lora-r 16 \ --epochs 3 \ --lr 2e-4Training a 3B model with LoRA on 500-1000 examples typically takes:
- T4 GPU (Colab free): 2-4 hours
- RTX 3060/4060: 1-2 hours
- RTX 3090/4090: 30-60 minutes
- M1/M2 Mac (MPS): 3-6 hours
Monitoring Training
Section titled “Monitoring Training”Watch for these patterns in the training loss:
Healthy training: Loss decreases steadily for the first epoch, then more slowly. Final loss is noticeably lower than starting loss.
Overfitting: Loss on training data keeps decreasing but performance on held-out examples starts degrading. Usually means you need more diverse training data or fewer epochs.
Underfitting: Loss barely decreases. Usually means the learning rate is too low, or your training data format doesn’t match what the model expects.
Unstable training: Loss jumps around wildly. Reduce the learning rate.
Common Pitfalls
Section titled “Common Pitfalls”Data format mismatch. Make sure your JSONL format matches the chat template expected by the base model. The training script handles this conversion, but check that instruction, input, and output fields are correctly populated.
Too few epochs on small data. With only 500 examples, the model might need 3-5 passes (epochs) to learn the patterns. With 5000+ examples, 1-2 epochs is often enough.
LoRA rank too low. If your domain requires the model to learn many new patterns, rank 8 might not have enough capacity. Try rank 32 if rank 16 doesn’t show improvement. The adapter will be larger but still small.
Phase 4: Evaluate (Week 5-6)
Section titled “Phase 4: Evaluate (Week 5-6)”Running the Evaluation
Section titled “Running the Evaluation”# Your adapter vs the benchmarkuv run python scripts/evaluate.py \ --adapter adapters/your-domain/ \ --benchmark adapters/your-domain/eval/benchmark.jsonl \ --output adapters/your-domain/eval/results.json
# Base model vs the same benchmarkuv run python scripts/evaluate.py \ --benchmark adapters/your-domain/eval/benchmark.jsonl \ --output adapters/your-domain/eval/base_results.jsonInterpreting Results
Section titled “Interpreting Results”Strong result (ship it): Adapter scores 15+ percentage points above base model, minimal out-of-domain degradation.
Moderate result (iterate): Adapter scores 5-15 points above base model. Consider whether the improvement is consistent across difficulty levels. If the adapter only helps on easy questions, the training data may lack challenging examples.
Weak result (rethink): Adapter scores less than 5 points above base model. Possible causes:
- Training data quality issues (most common)
- Domain too broad (narrow your scope)
- Base model already handles this domain reasonably well (pick a different domain)
- Not enough training data
- Hyperparameters need tuning
Negative result (don’t ship): Adapter scores below the base model. Something went wrong in training. Check for data contamination (benchmark examples in training set), format issues, or corrupted training data.
Iteration Cycle
Section titled “Iteration Cycle”If your first training run doesn’t pass the evaluation bar:
- Analyze which specific benchmark examples the adapter gets wrong
- Look for patterns (e.g., all multi-step problems fail, or a specific type of formatting is wrong)
- Add targeted training examples that address these failure patterns
- Retrain and re-evaluate
- Document each iteration in your TRAINING_LOG.md
Most successful adapters go through 2-3 training iterations.
Phase 5: Package and Submit (Week 6)
Section titled “Phase 5: Package and Submit (Week 6)”Directory Structure
Section titled “Directory Structure”Your final adapter directory should look like this:
adapters/your-domain/├── adapter_config.json # Generated by training script├── adapter_model.safetensors # Trained weights├── training/│ ├── dataset.jsonl # Full training dataset│ ├── train_config.yaml # Exact config used for final run│ ├── generate_data.py # Data generation script (if synthetic)│ └── TRAINING_LOG.md # Complete development log└── eval/ ├── benchmark.jsonl # 50+ held-out test cases ├── results.json # Adapter evaluation results ├── base_results.json # Base model evaluation results └── ood_results.json # Out-of-domain check resultsRegistry Entry
Section titled “Registry Entry”Add your adapter to adapters/registry.yaml:
your-domain: version: "1.0.0" base_model: "Qwen/Qwen2.5-3B-Instruct" quantization: "Q4_K_M" lora_rank: 16 created: "2026-S2" authors: ["Your Team Name, ISYS6020"] benchmark_score: 0.72 base_model_score: 0.41 improvement: "+75.6%" description: "One-line description of what this adapter does" tags: ["relevant", "keywords"] size_mb: 24 router_keywords: ["trigger", "words", "for", "keyword", "router"]Submit Your PR
Section titled “Submit Your PR”Open a pull request against the main branch. Use the PR template, which includes the full checklist from CONTRIBUTING.md.
Tips from Previous Cohorts
Section titled “Tips from Previous Cohorts”This section will grow as cohorts complete their projects.
- Start data curation early. It always takes longer than you expect.
- Test the base model extensively before training. Understanding its failures is more valuable than understanding its successes.
- Quality of training outputs matters enormously. Every output in your dataset is teaching the model what “good” looks like for your domain.
- Don’t over-scope. “Math reasoning” is a semester project. “All of mathematics” is a career.
- Document as you go, not at the end. Your TRAINING_LOG.md is both a deliverable and a gift to next semester’s team.