Skip to content

RE2 (Re-Reading) on Qwen3-4B

Status: planned

RE2 (Re-Reading), proposed by Xu et al. 2023, prepends a re-read instruction to the prompt:

[Question]. Read the question again: [Question]. Now answer.

The published claim is that RE2 reduces errors caused by the model misreading or partially processing the input on the first pass, particularly when the model’s first-pass attention misses important constraints.

We expect RE2 to help Qwen3-4B more than it helps frontier models, for three reasons:

  • Smaller attention budgets — fewer attention heads and layers; more likely to miss content on a single pass.
  • Tighter effective context — re-reading is cheap when the question is short relative to the model’s context window.
  • Asymmetric cost — the technique cost (extra input tokens) is borne by both small and large models equally; the gain on small models should be larger.

We expect the effect to be largest on:

  • Multi-constraint questions (e.g., “summarise X, but only the parts relevant to Y, in fewer than Z words”)
  • Word problems (maths, logic) where attention to detail matters
  • Negation and edge cases (“which of these is not true…?”)

We expect the effect to be smallest or absent on:

  • Free-form generation (creative writing, brainstorming)
  • Tasks where the answer is dictated by the first content words, regardless of subsequent constraints

(Draft — to be refined before running.)

  • Base model: Qwen3-4B-Instruct at Q4_K_M (LocoLLM standard, per ADR-0001 and ADR-0006)
  • Comparison run: Llama 3.1 8B Instruct at Q4_K_M, on a sub-sample, to test the asymmetric-effect hypothesis
  • Task suite: TBD. Candidates:
    • A slice of MMLU (multi-choice with negation/constraint variants)
    • A slice of GSM8K (multi-step word problems)
    • A small constructed multi-constraint extraction set (closer to LocoLLM real use)
  • Sample count: sized for a 95% CI on accuracy difference of ±2 percentage points (~600 items per condition, depending on baseline accuracy)
  • Sampling: temperature=0 for the primary run; an N=5 rerun at temperature=0.3 to characterise variance
  • Conditions:
    • Baseline — plain prompt
    • RE2 — baseline prompt with the re-read prefix and “Now answer” suffix
  • Pass criterion per item: task-specific (exact match for multi-choice; numeric match for GSM8K; rubric-graded by a frozen judge model for the constructed set, with judge variance characterised separately)

Pending. This page will be updated when the study runs.

  • Single base model in the primary condition. The effect on Llama 3.2 3B, Phi-3 Mini, or other 3-4B models may differ.
  • Quantisation level fixed at Q4_K_M. Effect may differ at Q8 or F16; not tested in this study.
  • A judge-model pass criterion introduces a second source of variance and is avoided where possible. Where used, judge variance is reported alongside the primary effect.
  • “Reasoning” is a broad category. Results on the chosen task suite do not generalise automatically to all reasoning tasks.
  • No interaction effects with adapters. This study uses the base model only. Whether RE2 stacks with a math adapter, code adapter, or routing setup is a separate study.

If RE2 produces an accuracy improvement of less than 1 percentage point on the chosen task suite, with overlapping 95% confidence intervals between baseline and RE2, the claim that RE2 reliably helps Qwen3-4B is not supported by this evidence.

If the effect is positive on Qwen3-4B but equal-or-larger on Llama 3.1 8B in the comparison run, the asymmetric-effect hypothesis is not supported (the technique helps, but not preferentially on small models).

Until results land, the Small Model Strategies summary stands: RE2 is cheap, plausibly helps, and is worth using on questions with multiple constraints. This study will refine that summary with measurement.