Benchmarks
Comprehensive benchmark analysis comparing our distilled Qwen3-4B models against the base model. All tests run with 4-bit quantization using lm-eval harness.
Key Insights
HIGHEST AVERAGE
Gemini 2.5 Flash
54.0% average across all benchmarks(+1.9 vs base)
MOST CONSISTENT
Gemini 2.5 Flash
Beats base model on 5/6 benchmarks(+11.4 total)
BEST GPQA
GPT-5 Codex
43.9% on graduate-level science(+13.6 vs base)
Performance Comparison
Average accuracy across all 6 benchmarks. Higher is better.
Base Model
52.1%
#1Gemini 2.5 Flash
54.0%(+1.9)
#2GPT-5 Codex
53.1%(+1.0)
#3Kimi K2
51.6%(-0.5)
#4GLM 4.6
51.5%(-0.6)
#5Gemini 2.5 Pro
51.3%(-0.8)
#6Claude 4.5 Opus
51.2%(-0.9)
#7Command A
51.2%(-0.9)
#8Gemini 3 Pro
50.8%(-1.3)
#9GPT-5.1
50.6%(-1.5)
Full Results
All scores shown as accuracy percentages. Green indicates improvement over base model, red indicates regression.
| Model | ARC | GPQA | HellaSwag | MMLU | TruthfulQA | WinoGrande | Avg |
|---|---|---|---|---|---|---|---|
| Base (Qwen3-4B) | 48.6 | 30.3 | 48.0 | 65.5 | 55.6 | 64.6 | 52.1 |
| Gemini 2.5 Flash | 51.2 | 35.4 | 50.4 | 66.2 | 55.3 | 65.6 | 54.0 |
| Claude 4.5 Opus | 48.1 | 31.3 | 49.6 | 63.4 | 52.6 | 62.1 | 51.2 |
| Gemini 2.5 Pro | 48.5 | 30.8 | 48.5 | 64.3 | 54.4 | 61.2 | 51.3 |
| GPT-5 Codex | 45.9 | 43.9 | 47.7 | 62.5 | 57.0 | 61.3 | 53.1 |
| Kimi K2 | 45.8 | 37.4 | 49.2 | 62.0 | 52.5 | 62.7 | 51.6 |
| GLM 4.6 | 48.7 | 32.3 | 48.3 | 64.3 | 53.1 | 62.2 | 51.5 |
| GPT-5.1 | 47.8 | 29.8 | 48.0 | 63.6 | 55.7 | 58.6 | 50.6 |
| Command A | 45.8 | 31.8 | 48.8 | 63.5 | 54.9 | 62.2 | 51.2 |
| Gemini 3 Pro | 46.5 | 34.9 | 48.2 | 62.4 | 50.5 | 62.3 | 50.8 |
Methodology
Test Configuration
- • Quantization: 4-bit (matching typical deployment)
- • Framework: lm-eval harness
- • Temperature: 0.6
- • Top-p: 0.95
Benchmarks Used
- • ARC-Challenge: Science reasoning
- • GPQA Diamond: Graduate-level science
- • HellaSwag: Commonsense reasoning
- • MMLU: Multi-task language understanding
- • TruthfulQA: Truthfulness evaluation
- • WinoGrande: Pronoun resolution