RE2 (Re-Reading) on Qwen3-4B
Status: planned
Claim under test
Section titled “Claim under test”RE2 (Re-Reading), proposed by Xu et al. 2023, prepends a re-read instruction to the prompt:
[Question]. Read the question again: [Question]. Now answer.
The published claim is that RE2 reduces errors caused by the model misreading or partially processing the input on the first pass, particularly when the model’s first-pass attention misses important constraints.
Hypothesis on small models
Section titled “Hypothesis on small models”We expect RE2 to help Qwen3-4B more than it helps frontier models, for three reasons:
- Smaller attention budgets — fewer attention heads and layers; more likely to miss content on a single pass.
- Tighter effective context — re-reading is cheap when the question is short relative to the model’s context window.
- Asymmetric cost — the technique cost (extra input tokens) is borne by both small and large models equally; the gain on small models should be larger.
We expect the effect to be largest on:
- Multi-constraint questions (e.g., “summarise X, but only the parts relevant to Y, in fewer than Z words”)
- Word problems (maths, logic) where attention to detail matters
- Negation and edge cases (“which of these is not true…?”)
We expect the effect to be smallest or absent on:
- Free-form generation (creative writing, brainstorming)
- Tasks where the answer is dictated by the first content words, regardless of subsequent constraints
Methodology
Section titled “Methodology”(Draft — to be refined before running.)
- Base model: Qwen3-4B-Instruct at Q4_K_M (LocoLLM standard, per ADR-0001 and ADR-0006)
- Comparison run: Llama 3.1 8B Instruct at Q4_K_M, on a sub-sample, to test the asymmetric-effect hypothesis
- Task suite: TBD. Candidates:
- A slice of MMLU (multi-choice with negation/constraint variants)
- A slice of GSM8K (multi-step word problems)
- A small constructed multi-constraint extraction set (closer to LocoLLM real use)
- Sample count: sized for a 95% CI on accuracy difference of ±2 percentage points (~600 items per condition, depending on baseline accuracy)
- Sampling:
temperature=0for the primary run; anN=5rerun attemperature=0.3to characterise variance - Conditions:
- Baseline — plain prompt
- RE2 — baseline prompt with the re-read prefix and “Now answer” suffix
- Pass criterion per item: task-specific (exact match for multi-choice; numeric match for GSM8K; rubric-graded by a frozen judge model for the constructed set, with judge variance characterised separately)
Results
Section titled “Results”Pending. This page will be updated when the study runs.
Limitations (declared in advance)
Section titled “Limitations (declared in advance)”- Single base model in the primary condition. The effect on Llama 3.2 3B, Phi-3 Mini, or other 3-4B models may differ.
- Quantisation level fixed at Q4_K_M. Effect may differ at Q8 or F16; not tested in this study.
- A judge-model pass criterion introduces a second source of variance and is avoided where possible. Where used, judge variance is reported alongside the primary effect.
- “Reasoning” is a broad category. Results on the chosen task suite do not generalise automatically to all reasoning tasks.
- No interaction effects with adapters. This study uses the base model only. Whether RE2 stacks with a math adapter, code adapter, or routing setup is a separate study.
Invalidation condition
Section titled “Invalidation condition”If RE2 produces an accuracy improvement of less than 1 percentage point on the chosen task suite, with overlapping 95% confidence intervals between baseline and RE2, the claim that RE2 reliably helps Qwen3-4B is not supported by this evidence.
If the effect is positive on Qwen3-4B but equal-or-larger on Llama 3.1 8B in the comparison run, the asymmetric-effect hypothesis is not supported (the technique helps, but not preferentially on small models).
Practitioner takeaway (provisional)
Section titled “Practitioner takeaway (provisional)”Until results land, the Small Model Strategies summary stands: RE2 is cheap, plausibly helps, and is worth using on questions with multiple constraints. This study will refine that summary with measurement.
Cross-references
Section titled “Cross-references”- Small Model Strategies §2: Prompting Strategies
- Scaffolding Studies index
- RE2 (general-audience version) —
ai-toolkit/downloads/re2-prompting.qmdin the AI Skills Passport