Benchmarks

Comprehensive benchmark analysis comparing our distilled Qwen3-4B models against the base model. All tests run with 4-bit quantization using lm-eval harness.

Key Insights

HIGHEST AVERAGE

Gemini 2.5 Flash

54.0% average across all benchmarks(+1.9 vs base)

MOST CONSISTENT

Gemini 2.5 Flash

Beats base model on 5/6 benchmarks(+11.4 total)

BEST GPQA

GPT-5 Codex

43.9% on graduate-level science(+13.6 vs base)

Performance Comparison

Average accuracy across all 6 benchmarks. Higher is better.

Base Model
52.1%
#1Gemini 2.5 Flash
54.0%(+1.9)
#2GPT-5 Codex
53.1%(+1.0)
#3Kimi K2
51.6%(-0.5)
#4GLM 4.6
51.5%(-0.6)
#5Gemini 2.5 Pro
51.3%(-0.8)
#6Claude 4.5 Opus
51.2%(-0.9)
#7Command A
51.2%(-0.9)
#8Gemini 3 Pro
50.8%(-1.3)
#9GPT-5.1
50.6%(-1.5)

Full Results

All scores shown as accuracy percentages. Green indicates improvement over base model, red indicates regression.

ModelARCGPQAHellaSwagMMLUTruthfulQAWinoGrandeAvg
Base (Qwen3-4B)48.630.348.065.555.664.652.1
Gemini 2.5 Flash51.235.450.466.255.365.654.0
Claude 4.5 Opus48.131.349.663.452.662.151.2
Gemini 2.5 Pro48.530.848.564.354.461.251.3
GPT-5 Codex45.943.947.762.557.061.353.1
Kimi K245.837.449.262.052.562.751.6
GLM 4.648.732.348.364.353.162.251.5
GPT-5.147.829.848.063.655.758.650.6
Command A45.831.848.863.554.962.251.2
Gemini 3 Pro46.534.948.262.450.562.350.8

Methodology

Test Configuration

  • Quantization: 4-bit (matching typical deployment)
  • Framework: lm-eval harness
  • Temperature: 0.6
  • Top-p: 0.95

Benchmarks Used

  • ARC-Challenge: Science reasoning
  • GPQA Diamond: Graduate-level science
  • HellaSwag: Commonsense reasoning
  • MMLU: Multi-task language understanding
  • TruthfulQA: Truthfulness evaluation
  • WinoGrande: Pronoun resolution