Universal benchmark for evaluating AI agents.
Oracle-Normalized Score (ONS)
Category ONS Breakdown
| Rank |
Agent |
Type |
Modality |
Harness |
Score |
95% CI |
Open |
Date |
| 1 |
Oracle Agent |
other |
– |
– |
|
0.811–0.969 |
No |
2026-03-17 |
| 2 |
GPT-5 mini |
llm |
ascii |
MarkovianReasoner |
|
0.000–0.000 |
No |
2026-03-20 |
| 3 |
PPO Dense (2M) |
rl |
rgb_array |
– |
|
0.212–0.367 |
Yes |
2026-03-22 |
| 4 |
Qwen3.5-4B |
llm |
ascii |
markovian_reasoner |
|
0.161–0.292 |
Yes |
2026-03-25 |
| 5 |
PPO Dense (500k) |
rl |
rgb_array |
– |
|
0.166–0.287 |
Yes |
2026-03-20 |
| 6 |
Gemini 2.5 Flash Lite |
llm |
ascii |
markovian_reasoner |
|
0.126–0.247 |
No |
2026-03-17 |
| 7 |
Qwen3.5-4B |
llm |
language |
markovian_reasoner |
|
0.085–0.272 |
Yes |
2026-03-25 |
| 8 |
Qwen3.5-2B |
llm |
ascii |
markovian_reasoner |
|
0.069–0.195 |
Yes |
2026-03-25 |
| 9 |
Qwen3.5-2B |
llm |
language |
markovian_reasoner |
|
0.056–0.183 |
Yes |
2026-03-25 |
| 10 |
Qwen3.5-0.8B |
llm |
ascii |
markovian_reasoner |
|
0.048–0.140 |
Yes |
2026-03-22 |
| 11 |
Qwen3-4B |
llm |
ascii |
markovian_reasoner |
|
0.000–0.000 |
Yes |
2026-03-21 |
| 12 |
Random Agent |
other |
– |
– |
|
0.031–0.130 |
No |
2026-03-17 |
| 13 |
PPO Sparse (500k) |
rl |
rgb_array |
– |
|
0.051–0.097 |
Yes |
2026-03-20 |
| 14 |
Gemini 2.5 Flash Lite |
llm |
language |
markovian_zero_shot |
|
0.025–0.102 |
No |
2026-03-17 |
| 15 |
Qwen3.5-2B |
llm |
ascii |
markovian_zero_shot |
|
0.032–0.093 |
Yes |
2026-03-22 |
| 16 |
Qwen3.5-0.8B |
llm |
language |
markovian_reasoner |
|
0.021–0.100 |
Yes |
2026-03-22 |
| 17 |
Gemini 2.5 Flash Lite |
llm |
ascii |
markovian_zero_shot |
|
0.022–0.080 |
No |
2026-03-17 |
| 18 |
Qwen3-4B |
llm |
language |
markovian_reasoner |
|
0.000–0.000 |
Yes |
2026-03-21 |
| 19 |
Qwen3.5-2B |
llm |
language |
markovian_zero_shot |
|
0.012–0.050 |
Yes |
2026-03-22 |
| 20 |
Qwen3.5-4B |
llm |
ascii |
markovian_zero_shot |
|
0.007–0.042 |
Yes |
2026-03-22 |
| 21 |
Qwen3.5-4B |
llm |
language |
markovian_zero_shot |
|
0.006–0.038 |
Yes |
2026-03-22 |
| 22 |
Qwen3.5-0.8B |
llm |
ascii |
markovian_zero_shot |
|
0.006–0.035 |
Yes |
2026-03-25 |
| 23 |
Qwen3-4B |
llm |
ascii |
markovian_zero_shot |
|
0.000–0.000 |
Yes |
2026-03-21 |
| 24 |
Qwen3-4B |
llm |
language |
markovian_zero_shot |
|
0.000–0.000 |
Yes |
2026-03-21 |
| 25 |
Qwen3.5-0.8B |
llm |
language |
markovian_zero_shot |
|
0.002–0.035 |
Yes |
2026-03-22 |
DistributionShift-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
100%
|
100%
|
48%
|
| GPT-5 mini |
llm |
8%
|
4%
|
4%
|
0%
|
| PPO Dense (2M) |
rl |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| PPO Dense (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Random Agent |
other |
0%
|
4%
|
0%
|
0%
|
| PPO Sparse (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
4%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
Reach all 3 goals across shifting maze phases.
FewShotAdaptation-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
92%
|
80%
|
72%
|
80%
|
| GPT-5 mini |
llm |
56%
|
44%
|
28%
|
44%
|
| PPO Dense (2M) |
rl |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
36%
|
44%
|
28%
|
24%
|
| PPO Dense (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
28%
|
44%
|
32%
|
24%
|
| Qwen3.5-4B |
llm |
56%
|
28%
|
32%
|
28%
|
| Qwen3.5-2B |
llm |
28%
|
28%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
8%
|
32%
|
24%
|
0%
|
| Qwen3.5-0.8B |
llm |
24%
|
24%
|
24%
|
0%
|
| Qwen3-4B |
llm |
12%
|
24%
|
8%
|
12%
|
| Random Agent |
other |
28%
|
20%
|
12%
|
12%
|
| PPO Sparse (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
12%
|
12%
|
8%
|
12%
|
| Qwen3.5-2B |
llm |
24%
|
12%
|
16%
|
0%
|
| Qwen3.5-0.8B |
llm |
12%
|
32%
|
8%
|
16%
|
| Gemini 2.5 Flash Lite |
llm |
20%
|
16%
|
20%
|
0%
|
| Qwen3-4B |
llm |
16%
|
24%
|
8%
|
24%
|
| Qwen3.5-2B |
llm |
24%
|
4%
|
12%
|
4%
|
| Qwen3.5-4B |
llm |
0%
|
4%
|
4%
|
0%
|
| Qwen3.5-4B |
llm |
4%
|
4%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
16%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
8%
|
0%
|
4%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
Watch demo trials to infer the hidden rule, then navigate to the correct candidate object in the test trial.
NoisyObservation-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
80%
|
84%
|
68%
|
| GPT-5 mini |
llm |
100%
|
84%
|
84%
|
68%
|
| PPO Dense (2M) |
rl |
100%
|
56%
|
24%
|
16%
|
| Qwen3.5-4B |
llm |
88%
|
68%
|
60%
|
44%
|
| PPO Dense (500k) |
rl |
96%
|
28%
|
8%
|
24%
|
| Gemini 2.5 Flash Lite |
llm |
88%
|
68%
|
32%
|
28%
|
| Qwen3.5-4B |
llm |
92%
|
72%
|
68%
|
32%
|
| Qwen3.5-2B |
llm |
60%
|
36%
|
8%
|
0%
|
| Qwen3.5-2B |
llm |
76%
|
40%
|
20%
|
4%
|
| Qwen3.5-0.8B |
llm |
48%
|
36%
|
8%
|
8%
|
| Qwen3-4B |
llm |
56%
|
32%
|
4%
|
12%
|
| Random Agent |
other |
68%
|
28%
|
8%
|
0%
|
| PPO Sparse (500k) |
rl |
28%
|
4%
|
12%
|
4%
|
| Gemini 2.5 Flash Lite |
llm |
60%
|
28%
|
8%
|
8%
|
| Qwen3.5-2B |
llm |
48%
|
28%
|
8%
|
8%
|
| Qwen3.5-0.8B |
llm |
44%
|
32%
|
4%
|
8%
|
| Gemini 2.5 Flash Lite |
llm |
20%
|
8%
|
8%
|
8%
|
| Qwen3-4B |
llm |
44%
|
16%
|
12%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
24%
|
4%
|
8%
|
| Qwen3.5-4B |
llm |
12%
|
4%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
12%
|
4%
|
4%
|
4%
|
| Qwen3.5-0.8B |
llm |
36%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
12%
|
4%
|
0%
|
4%
|
| Qwen3-4B |
llm |
16%
|
8%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
20%
|
20%
|
0%
|
0%
|
Locate and reach the true GOAL amid visual noise.
DelayedGratification-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
100%
|
100%
|
100%
|
| GPT-5 mini |
llm |
100%
|
100%
|
56%
|
100%
|
| PPO Dense (2M) |
rl |
100%
|
88%
|
20%
|
0%
|
| Qwen3.5-4B |
llm |
84%
|
36%
|
20%
|
72%
|
| PPO Dense (500k) |
rl |
100%
|
28%
|
0%
|
100%
|
| Gemini 2.5 Flash Lite |
llm |
88%
|
60%
|
24%
|
64%
|
| Qwen3.5-4B |
llm |
64%
|
24%
|
0%
|
52%
|
| Qwen3.5-2B |
llm |
60%
|
52%
|
20%
|
36%
|
| Qwen3.5-2B |
llm |
36%
|
8%
|
0%
|
20%
|
| Qwen3.5-0.8B |
llm |
16%
|
16%
|
0%
|
16%
|
| Qwen3-4B |
llm |
56%
|
28%
|
8%
|
32%
|
| Random Agent |
other |
8%
|
0%
|
0%
|
0%
|
| PPO Sparse (500k) |
rl |
48%
|
16%
|
0%
|
8%
|
| Gemini 2.5 Flash Lite |
llm |
20%
|
0%
|
0%
|
24%
|
| Qwen3.5-2B |
llm |
12%
|
0%
|
0%
|
24%
|
| Qwen3.5-0.8B |
llm |
20%
|
4%
|
0%
|
4%
|
| Gemini 2.5 Flash Lite |
llm |
20%
|
0%
|
4%
|
32%
|
| Qwen3-4B |
llm |
16%
|
8%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
16%
|
4%
|
0%
|
12%
|
| Qwen3.5-4B |
llm |
12%
|
4%
|
0%
|
8%
|
| Qwen3.5-4B |
llm |
12%
|
4%
|
0%
|
8%
|
| Qwen3.5-0.8B |
llm |
16%
|
0%
|
4%
|
0%
|
| Qwen3-4B |
llm |
16%
|
4%
|
0%
|
8%
|
| Qwen3-4B |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
4%
|
0%
|
0%
|
0%
|
Reach the distant true GOAL without collecting any decoy KEY.
FogOfWarExploration-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
100%
|
92%
|
76%
|
| GPT-5 mini |
llm |
96%
|
56%
|
12%
|
12%
|
| PPO Dense (2M) |
rl |
92%
|
36%
|
12%
|
28%
|
| Qwen3.5-4B |
llm |
84%
|
60%
|
12%
|
12%
|
| PPO Dense (500k) |
rl |
56%
|
24%
|
12%
|
8%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
88%
|
96%
|
76%
|
64%
|
| Qwen3.5-2B |
llm |
72%
|
52%
|
0%
|
24%
|
| Qwen3.5-2B |
llm |
48%
|
44%
|
28%
|
16%
|
| Qwen3.5-0.8B |
llm |
56%
|
40%
|
20%
|
20%
|
| Qwen3-4B |
llm |
64%
|
20%
|
16%
|
16%
|
| Random Agent |
other |
52%
|
56%
|
12%
|
16%
|
| PPO Sparse (500k) |
rl |
40%
|
20%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
52%
|
36%
|
16%
|
12%
|
| Qwen3.5-2B |
llm |
36%
|
24%
|
4%
|
4%
|
| Qwen3.5-0.8B |
llm |
44%
|
36%
|
12%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
36%
|
24%
|
8%
|
0%
|
| Qwen3-4B |
llm |
28%
|
12%
|
0%
|
4%
|
| Qwen3.5-2B |
llm |
0%
|
28%
|
8%
|
16%
|
| Qwen3.5-4B |
llm |
16%
|
8%
|
12%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
4%
|
0%
|
| Qwen3.5-0.8B |
llm |
20%
|
4%
|
0%
|
4%
|
| Qwen3-4B |
llm |
4%
|
0%
|
4%
|
0%
|
| Qwen3-4B |
llm |
12%
|
8%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
Find and reach the GOAL despite incomplete map information.
SequenceMemory-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
100%
|
100%
|
100%
|
| GPT-5 mini |
llm |
0%
|
0%
|
0%
|
0%
|
| PPO Dense (2M) |
rl |
76%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| PPO Dense (500k) |
rl |
32%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
8%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
4%
|
4%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Random Agent |
other |
12%
|
0%
|
0%
|
0%
|
| PPO Sparse (500k) |
rl |
32%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
12%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
Memorize shown GEM positions, then visit them in exact order during reproduce phase.
TreasureHunt-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
100%
|
100%
|
100%
|
| GPT-5 mini |
llm |
20%
|
4%
|
0%
|
0%
|
| PPO Dense (2M) |
rl |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
16%
|
0%
|
0%
|
0%
|
| PPO Dense (500k) |
rl |
4%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
16%
|
8%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
8%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
8%
|
8%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
8%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
24%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
4%
|
0%
|
0%
|
0%
|
| Random Agent |
other |
28%
|
4%
|
0%
|
0%
|
| PPO Sparse (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
8%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
8%
|
4%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
8%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
12%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
4%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
Read scroll clues, triangulate hidden treasure positions, and step on each treasure cell to collect all treasures.
ChaseEvade-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
84%
|
84%
|
72%
|
16%
|
| GPT-5 mini |
llm |
4%
|
0%
|
0%
|
0%
|
| PPO Dense (2M) |
rl |
100%
|
40%
|
52%
|
8%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| PPO Dense (500k) |
rl |
96%
|
32%
|
60%
|
12%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Random Agent |
other |
0%
|
0%
|
0%
|
0%
|
| PPO Sparse (500k) |
rl |
40%
|
60%
|
20%
|
8%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
Survive the required steps without enemy collision.
CooperativeTransport-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
84%
|
64%
|
52%
|
| GPT-5 mini |
llm |
8%
|
0%
|
0%
|
0%
|
| PPO Dense (2M) |
rl |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
48%
|
0%
|
0%
|
0%
|
| PPO Dense (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
12%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
4%
|
0%
|
0%
|
0%
|
| Random Agent |
other |
0%
|
0%
|
0%
|
0%
|
| PPO Sparse (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
Push all heavy boxes into holes with NPC cooperation.
EmergentStrategy-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
88%
|
92%
|
52%
|
48%
|
| GPT-5 mini |
llm |
92%
|
8%
|
16%
|
0%
|
| PPO Dense (2M) |
rl |
96%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
52%
|
4%
|
0%
|
0%
|
| PPO Dense (500k) |
rl |
4%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
76%
|
8%
|
4%
|
4%
|
| Qwen3.5-4B |
llm |
20%
|
8%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
32%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
24%
|
4%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
36%
|
4%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Random Agent |
other |
20%
|
4%
|
4%
|
0%
|
| PPO Sparse (500k) |
rl |
4%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
20%
|
8%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
4%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
12%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
Exploit NPC behaviors to lure/scare them onto locking pressure plates, permanently opening barriers to reach the GOAL...
Herding-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
80%
|
76%
|
44%
|
36%
|
| GPT-5 mini |
llm |
8%
|
4%
|
0%
|
0%
|
| PPO Dense (2M) |
rl |
100%
|
52%
|
32%
|
0%
|
| Qwen3.5-4B |
llm |
12%
|
4%
|
0%
|
0%
|
| PPO Dense (500k) |
rl |
28%
|
40%
|
44%
|
76%
|
| Gemini 2.5 Flash Lite |
llm |
4%
|
4%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
16%
|
4%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
4%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
12%
|
8%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
4%
|
0%
|
0%
|
| Qwen3-4B |
llm |
4%
|
8%
|
0%
|
0%
|
| Random Agent |
other |
8%
|
4%
|
0%
|
0%
|
| PPO Sparse (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
4%
|
4%
|
8%
|
0%
|
| Qwen3.5-2B |
llm |
8%
|
8%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
8%
|
4%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
4%
|
8%
|
0%
|
0%
|
| Qwen3-4B |
llm |
4%
|
4%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
8%
|
4%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
4%
|
8%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
4%
|
4%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
4%
|
4%
|
0%
|
0%
|
| Qwen3-4B |
llm |
4%
|
8%
|
0%
|
0%
|
| Qwen3-4B |
llm |
4%
|
4%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
4%
|
4%
|
0%
|
0%
|
Move all SHEEP into the pen zone (TARGET cells).
TagHunt-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
96%
|
84%
|
68%
|
64%
|
| GPT-5 mini |
llm |
72%
|
60%
|
20%
|
8%
|
| PPO Dense (2M) |
rl |
100%
|
96%
|
96%
|
92%
|
| Qwen3.5-4B |
llm |
88%
|
48%
|
8%
|
4%
|
| PPO Dense (500k) |
rl |
100%
|
80%
|
84%
|
48%
|
| Gemini 2.5 Flash Lite |
llm |
92%
|
4%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
76%
|
8%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
24%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
52%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
24%
|
4%
|
0%
|
0%
|
| Qwen3-4B |
llm |
48%
|
12%
|
0%
|
0%
|
| Random Agent |
other |
20%
|
0%
|
0%
|
0%
|
| PPO Sparse (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
20%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
12%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
16%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
28%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
Tag all NPCs by stepping onto them.
CuriosityMaze-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
100%
|
100%
|
100%
|
| GPT-5 mini |
llm |
0%
|
0%
|
0%
|
0%
|
| PPO Dense (2M) |
rl |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| PPO Dense (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Random Agent |
other |
0%
|
0%
|
0%
|
0%
|
| PPO Sparse (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Claude Haiku 4.5 |
llm |
– |
– |
0%
|
– |
| Gemini 3.1 Flash Lite |
llm |
– |
– |
0%
|
– |
Visit at least the required percentage of all reachable cells before the step budget runs out.
DynamicObstacles-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
96%
|
88%
|
92%
|
72%
|
| GPT-5 mini |
llm |
80%
|
92%
|
60%
|
28%
|
| PPO Dense (2M) |
rl |
96%
|
72%
|
12%
|
4%
|
| Qwen3.5-4B |
llm |
64%
|
60%
|
36%
|
0%
|
| PPO Dense (500k) |
rl |
76%
|
20%
|
4%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
64%
|
68%
|
32%
|
8%
|
| Qwen3.5-4B |
llm |
44%
|
16%
|
24%
|
0%
|
| Qwen3.5-2B |
llm |
52%
|
36%
|
12%
|
0%
|
| Qwen3.5-2B |
llm |
36%
|
32%
|
8%
|
0%
|
| Qwen3.5-0.8B |
llm |
36%
|
28%
|
8%
|
0%
|
| Qwen3-4B |
llm |
24%
|
8%
|
0%
|
0%
|
| Random Agent |
other |
8%
|
0%
|
0%
|
0%
|
| PPO Sparse (500k) |
rl |
60%
|
12%
|
12%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
16%
|
4%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
4%
|
12%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Claude Haiku 4.5 |
llm |
– |
– |
24%
|
– |
| Gemini 3.1 Flash Lite |
llm |
– |
– |
44%
|
– |
Reach GOAL without colliding with any NPC.
GoToGoal-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
100%
|
100%
|
92%
|
| GPT-5 mini |
llm |
100%
|
88%
|
56%
|
44%
|
| PPO Dense (2M) |
rl |
100%
|
24%
|
16%
|
8%
|
| Qwen3.5-4B |
llm |
88%
|
52%
|
28%
|
8%
|
| PPO Dense (500k) |
rl |
100%
|
40%
|
12%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
100%
|
64%
|
8%
|
4%
|
| Qwen3.5-4B |
llm |
68%
|
20%
|
0%
|
4%
|
| Qwen3.5-2B |
llm |
52%
|
32%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
52%
|
16%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
32%
|
12%
|
0%
|
0%
|
| Qwen3-4B |
llm |
40%
|
12%
|
0%
|
0%
|
| Random Agent |
other |
44%
|
0%
|
0%
|
0%
|
| PPO Sparse (500k) |
rl |
100%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
8%
|
4%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
40%
|
24%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
36%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
28%
|
16%
|
8%
|
4%
|
| Qwen3-4B |
llm |
40%
|
4%
|
0%
|
4%
|
| Qwen3.5-2B |
llm |
0%
|
8%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
20%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Claude Haiku 4.5 |
llm |
– |
– |
20%
|
– |
| Gemini 3.1 Flash Lite |
llm |
– |
– |
20%
|
– |
Reach the GOAL position.
InstructionFollowing-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
100%
|
100%
|
92%
|
| GPT-5 mini |
llm |
96%
|
96%
|
80%
|
40%
|
| PPO Dense (2M) |
rl |
20%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
64%
|
68%
|
4%
|
0%
|
| PPO Dense (500k) |
rl |
16%
|
4%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
52%
|
56%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
48%
|
40%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
32%
|
28%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
16%
|
16%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
12%
|
24%
|
0%
|
0%
|
| Qwen3-4B |
llm |
28%
|
36%
|
0%
|
0%
|
| Random Agent |
other |
12%
|
20%
|
0%
|
0%
|
| PPO Sparse (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
16%
|
24%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
4%
|
8%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
28%
|
24%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
28%
|
12%
|
0%
|
0%
|
| Qwen3-4B |
llm |
32%
|
32%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
8%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
12%
|
4%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
12%
|
8%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
4%
|
8%
|
0%
|
0%
|
| Qwen3-4B |
llm |
12%
|
8%
|
0%
|
0%
|
| Qwen3-4B |
llm |
12%
|
4%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
4%
|
0%
|
0%
|
0%
|
| Claude Haiku 4.5 |
llm |
– |
– |
8%
|
– |
| Gemini 3.1 Flash Lite |
llm |
– |
– |
0%
|
– |
Reach the unique target object without touching any distractor.
MazeNavigation-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
100%
|
100%
|
100%
|
| GPT-5 mini |
llm |
80%
|
16%
|
4%
|
0%
|
| PPO Dense (2M) |
rl |
100%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
84%
|
12%
|
0%
|
0%
|
| PPO Dense (500k) |
rl |
100%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
68%
|
24%
|
4%
|
0%
|
| Qwen3.5-4B |
llm |
72%
|
4%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
96%
|
8%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
76%
|
8%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
8%
|
4%
|
0%
|
0%
|
| Qwen3-4B |
llm |
72%
|
0%
|
0%
|
0%
|
| Random Agent |
other |
4%
|
4%
|
0%
|
0%
|
| PPO Sparse (500k) |
rl |
100%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
24%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
4%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Claude Haiku 4.5 |
llm |
– |
– |
0%
|
– |
| Gemini 3.1 Flash Lite |
llm |
– |
– |
4%
|
– |
Navigate the maze to reach the GOAL exit.
RecursiveRooms-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
100%
|
100%
|
100%
|
| GPT-5 mini |
llm |
48%
|
12%
|
12%
|
8%
|
| PPO Dense (2M) |
rl |
12%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
80%
|
32%
|
0%
|
0%
|
| PPO Dense (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
40%
|
4%
|
8%
|
8%
|
| Qwen3.5-4B |
llm |
52%
|
12%
|
4%
|
0%
|
| Qwen3.5-2B |
llm |
36%
|
16%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
28%
|
12%
|
0%
|
12%
|
| Qwen3.5-0.8B |
llm |
12%
|
0%
|
8%
|
8%
|
| Qwen3-4B |
llm |
8%
|
0%
|
0%
|
0%
|
| Random Agent |
other |
12%
|
0%
|
0%
|
4%
|
| PPO Sparse (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
20%
|
8%
|
4%
|
16%
|
| Qwen3.5-0.8B |
llm |
12%
|
4%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
4%
|
0%
|
0%
|
| Qwen3-4B |
llm |
8%
|
4%
|
4%
|
0%
|
| Qwen3.5-2B |
llm |
8%
|
4%
|
4%
|
16%
|
| Qwen3.5-4B |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Claude Haiku 4.5 |
llm |
– |
– |
0%
|
– |
| Gemini 3.1 Flash Lite |
llm |
– |
– |
12%
|
– |
Navigate through nested rooms to reach GOAL in the deepest room.
ShortestPath-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
100%
|
96%
|
96%
|
| GPT-5 mini |
llm |
100%
|
88%
|
72%
|
28%
|
| PPO Dense (2M) |
rl |
100%
|
72%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
12%
|
8%
|
0%
|
4%
|
| PPO Dense (500k) |
rl |
72%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
56%
|
20%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
12%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
4%
|
0%
|
0%
|
0%
|
| Random Agent |
other |
0%
|
0%
|
0%
|
0%
|
| PPO Sparse (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Claude Haiku 4.5 |
llm |
– |
– |
0%
|
– |
| Gemini 3.1 Flash Lite |
llm |
– |
– |
16%
|
– |
Visit all real GOAL objects within the step budget (optimal path × budget multiplier).
TimingChallenge-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
100%
|
96%
|
100%
|
| GPT-5 mini |
llm |
36%
|
24%
|
36%
|
36%
|
| PPO Dense (2M) |
rl |
52%
|
32%
|
44%
|
36%
|
| Qwen3.5-4B |
llm |
4%
|
4%
|
0%
|
0%
|
| PPO Dense (500k) |
rl |
56%
|
40%
|
48%
|
28%
|
| Gemini 2.5 Flash Lite |
llm |
8%
|
16%
|
28%
|
20%
|
| Qwen3.5-4B |
llm |
0%
|
8%
|
8%
|
0%
|
| Qwen3.5-2B |
llm |
8%
|
4%
|
12%
|
12%
|
| Qwen3.5-2B |
llm |
28%
|
36%
|
16%
|
16%
|
| Qwen3.5-0.8B |
llm |
0%
|
12%
|
12%
|
4%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Random Agent |
other |
0%
|
8%
|
0%
|
12%
|
| PPO Sparse (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
44%
|
12%
|
0%
|
4%
|
| Qwen3.5-2B |
llm |
0%
|
4%
|
4%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
16%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
4%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
4%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Claude Haiku 4.5 |
llm |
– |
– |
32%
|
– |
| Gemini 3.1 Flash Lite |
llm |
– |
– |
32%
|
– |
Cross the patrol zone without collision, then reach GOAL.
BacktrackPuzzle-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
100%
|
100%
|
100%
|
| GPT-5 mini |
llm |
100%
|
60%
|
20%
|
4%
|
| PPO Dense (2M) |
rl |
96%
|
52%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
88%
|
32%
|
20%
|
24%
|
| PPO Dense (500k) |
rl |
100%
|
8%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
16%
|
8%
|
8%
|
8%
|
| Qwen3.5-4B |
llm |
76%
|
16%
|
8%
|
0%
|
| Qwen3.5-2B |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Random Agent |
other |
0%
|
4%
|
0%
|
0%
|
| PPO Sparse (500k) |
rl |
92%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
4%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
16%
|
4%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Claude Haiku 4.5 |
llm |
– |
– |
0%
|
– |
| Gemini 3.1 Flash Lite |
llm |
– |
– |
60%
|
– |
Activate the correct SWITCH to open the gate, then backtrack to reach GOAL.
KeyDoorPuzzle-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
100%
|
88%
|
76%
|
| GPT-5 mini |
llm |
100%
|
44%
|
4%
|
0%
|
| PPO Dense (2M) |
rl |
84%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
60%
|
44%
|
0%
|
0%
|
| PPO Dense (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
32%
|
4%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
36%
|
28%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
20%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
20%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
4%
|
0%
|
0%
|
0%
|
| Random Agent |
other |
0%
|
0%
|
0%
|
0%
|
| PPO Sparse (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Claude Haiku 4.5 |
llm |
– |
– |
0%
|
– |
| Gemini 3.1 Flash Lite |
llm |
– |
– |
0%
|
– |
Reach GOAL after unlocking ALL doors with matching keys.
PackingPuzzle-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
96%
|
96%
|
96%
|
96%
|
| GPT-5 mini |
llm |
0%
|
0%
|
0%
|
0%
|
| PPO Dense (2M) |
rl |
64%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| PPO Dense (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Random Agent |
other |
0%
|
0%
|
0%
|
0%
|
| PPO Sparse (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Claude Haiku 4.5 |
llm |
– |
– |
0%
|
– |
| Gemini 3.1 Flash Lite |
llm |
– |
– |
0%
|
– |
Push each piece onto its matching-type target slot.
PreciseNavigation-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
100%
|
56%
|
32%
|
| GPT-5 mini |
llm |
88%
|
12%
|
12%
|
12%
|
| PPO Dense (2M) |
rl |
64%
|
0%
|
8%
|
4%
|
| Qwen3.5-4B |
llm |
92%
|
88%
|
36%
|
40%
|
| PPO Dense (500k) |
rl |
88%
|
4%
|
4%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
88%
|
28%
|
36%
|
20%
|
| Qwen3.5-4B |
llm |
80%
|
48%
|
16%
|
20%
|
| Qwen3.5-2B |
llm |
76%
|
28%
|
20%
|
16%
|
| Qwen3.5-2B |
llm |
76%
|
40%
|
40%
|
24%
|
| Qwen3.5-0.8B |
llm |
40%
|
12%
|
20%
|
4%
|
| Qwen3-4B |
llm |
48%
|
4%
|
4%
|
8%
|
| Random Agent |
other |
84%
|
16%
|
44%
|
24%
|
| PPO Sparse (500k) |
rl |
20%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
36%
|
4%
|
8%
|
0%
|
| Qwen3.5-2B |
llm |
48%
|
12%
|
32%
|
16%
|
| Qwen3.5-0.8B |
llm |
52%
|
16%
|
24%
|
8%
|
| Gemini 2.5 Flash Lite |
llm |
44%
|
4%
|
16%
|
4%
|
| Qwen3-4B |
llm |
16%
|
12%
|
4%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
12%
|
4%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
16%
|
16%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
4%
|
| Claude Haiku 4.5 |
llm |
– |
– |
32%
|
– |
| Gemini 3.1 Flash Lite |
llm |
– |
– |
48%
|
– |
Slide across ice to reach the GOAL by planning trajectories through stopping points.
RecipeAssembly-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
100%
|
100%
|
100%
|
| GPT-5 mini |
llm |
24%
|
0%
|
0%
|
0%
|
| PPO Dense (2M) |
rl |
88%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
4%
|
0%
|
0%
|
0%
|
| PPO Dense (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
16%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Random Agent |
other |
0%
|
0%
|
0%
|
0%
|
| PPO Sparse (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Claude Haiku 4.5 |
llm |
– |
– |
0%
|
– |
| Gemini 3.1 Flash Lite |
llm |
– |
– |
0%
|
– |
Collect and deliver all ingredients in recipe order to the crafting station.
ResourceManagement-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
100%
|
100%
|
96%
|
| GPT-5 mini |
llm |
100%
|
100%
|
12%
|
0%
|
| PPO Dense (2M) |
rl |
100%
|
100%
|
100%
|
100%
|
| Qwen3.5-4B |
llm |
100%
|
100%
|
0%
|
0%
|
| PPO Dense (500k) |
rl |
100%
|
100%
|
100%
|
100%
|
| Gemini 2.5 Flash Lite |
llm |
100%
|
100%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
100%
|
100%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
100%
|
100%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
100%
|
100%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
100%
|
100%
|
0%
|
0%
|
| Qwen3-4B |
llm |
100%
|
100%
|
0%
|
0%
|
| Random Agent |
other |
100%
|
100%
|
0%
|
0%
|
| PPO Sparse (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
100%
|
100%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
100%
|
100%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
100%
|
100%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
100%
|
100%
|
0%
|
0%
|
| Qwen3-4B |
llm |
100%
|
100%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
100%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
100%
|
100%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
100%
|
100%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
100%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
100%
|
100%
|
0%
|
0%
|
| Qwen3-4B |
llm |
100%
|
100%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
100%
|
100%
|
0%
|
0%
|
| Claude Haiku 4.5 |
llm |
– |
– |
76%
|
– |
| Gemini 3.1 Flash Lite |
llm |
– |
– |
48%
|
– |
Keep ALL stations above 0 energy for the entire episode (survive max_steps).
SokobanPush-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
84%
|
88%
|
36%
|
| GPT-5 mini |
llm |
12%
|
0%
|
0%
|
0%
|
| PPO Dense (2M) |
rl |
92%
|
4%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
20%
|
4%
|
0%
|
0%
|
| PPO Dense (500k) |
rl |
36%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
12%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
8%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
8%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
4%
|
0%
|
0%
|
0%
|
| Random Agent |
other |
0%
|
0%
|
0%
|
0%
|
| PPO Sparse (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
16%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
8%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Claude Haiku 4.5 |
llm |
– |
– |
0%
|
– |
| Gemini 3.1 Flash Lite |
llm |
– |
– |
0%
|
– |
Push all BOX objects onto matching TARGET positions.
TileSorting-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
100%
|
100%
|
100%
|
| GPT-5 mini |
llm |
88%
|
0%
|
12%
|
0%
|
| PPO Dense (2M) |
rl |
100%
|
96%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
100%
|
72%
|
0%
|
0%
|
| PPO Dense (500k) |
rl |
92%
|
84%
|
56%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
76%
|
32%
|
16%
|
0%
|
| Qwen3.5-4B |
llm |
96%
|
4%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
80%
|
12%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
100%
|
4%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
68%
|
12%
|
0%
|
0%
|
| Qwen3-4B |
llm |
72%
|
20%
|
8%
|
0%
|
| Random Agent |
other |
80%
|
8%
|
0%
|
0%
|
| PPO Sparse (500k) |
rl |
72%
|
4%
|
12%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
72%
|
8%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
12%
|
4%
|
0%
|
| Qwen3.5-0.8B |
llm |
20%
|
12%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
64%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
20%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
4%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
16%
|
4%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
16%
|
4%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
16%
|
4%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
4%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Claude Haiku 4.5 |
llm |
– |
– |
44%
|
– |
| Gemini 3.1 Flash Lite |
llm |
– |
– |
24%
|
– |
Arrange tiles to goal configuration (1,2,3...N-1 in row-major order).
ToolUse-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
100%
|
100%
|
100%
|
| GPT-5 mini |
llm |
100%
|
100%
|
100%
|
100%
|
| PPO Dense (2M) |
rl |
100%
|
100%
|
96%
|
0%
|
| Qwen3.5-4B |
llm |
36%
|
52%
|
84%
|
32%
|
| PPO Dense (500k) |
rl |
56%
|
52%
|
100%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
100%
|
92%
|
64%
|
56%
|
| Qwen3.5-4B |
llm |
52%
|
56%
|
0%
|
28%
|
| Qwen3.5-2B |
llm |
100%
|
100%
|
96%
|
100%
|
| Qwen3.5-2B |
llm |
80%
|
88%
|
100%
|
96%
|
| Qwen3.5-0.8B |
llm |
56%
|
52%
|
72%
|
52%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
8%
|
| Random Agent |
other |
16%
|
24%
|
16%
|
24%
|
| PPO Sparse (500k) |
rl |
100%
|
68%
|
40%
|
4%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Claude Haiku 4.5 |
llm |
– |
– |
100%
|
– |
| Gemini 3.1 Flash Lite |
llm |
– |
– |
96%
|
– |
Collect all SCROLLs to spawn the ORB, pick up the ORB, cross the river, and reach the GOAL for full reward (1.0).
DeceptiveReward-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
100%
|
100%
|
100%
|
| GPT-5 mini |
llm |
100%
|
100%
|
100%
|
84%
|
| PPO Dense (2M) |
rl |
100%
|
100%
|
100%
|
100%
|
| Qwen3.5-4B |
llm |
88%
|
52%
|
20%
|
16%
|
| PPO Dense (500k) |
rl |
100%
|
100%
|
100%
|
100%
|
| Gemini 2.5 Flash Lite |
llm |
100%
|
92%
|
60%
|
24%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
40%
|
4%
|
| Qwen3.5-2B |
llm |
68%
|
56%
|
24%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
16%
|
8%
|
24%
|
4%
|
| Qwen3-4B |
llm |
16%
|
0%
|
0%
|
0%
|
| Random Agent |
other |
4%
|
0%
|
4%
|
0%
|
| PPO Sparse (500k) |
rl |
100%
|
0%
|
12%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
28%
|
28%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
4%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
4%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Claude Haiku 4.5 |
llm |
– |
– |
80%
|
– |
| Gemini 3.1 Flash Lite |
llm |
– |
– |
56%
|
– |
Resist the coin reward gradient.
GraphColoring-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
100%
|
100%
|
100%
|
| GPT-5 mini |
llm |
0%
|
0%
|
0%
|
0%
|
| PPO Dense (2M) |
rl |
32%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
16%
|
4%
|
4%
|
0%
|
| PPO Dense (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Random Agent |
other |
0%
|
0%
|
0%
|
0%
|
| PPO Sparse (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Claude Haiku 4.5 |
llm |
– |
– |
0%
|
– |
| Gemini 3.1 Flash Lite |
llm |
– |
– |
0%
|
– |
Color all nodes so no two adjacent nodes share the same color.
LightsOut-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
88%
|
92%
|
100%
|
| GPT-5 mini |
llm |
4%
|
0%
|
0%
|
0%
|
| PPO Dense (2M) |
rl |
100%
|
12%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
36%
|
4%
|
0%
|
0%
|
| PPO Dense (500k) |
rl |
56%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
16%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
12%
|
4%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
8%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
8%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Random Agent |
other |
0%
|
0%
|
0%
|
0%
|
| PPO Sparse (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Claude Haiku 4.5 |
llm |
– |
– |
0%
|
– |
| Gemini 3.1 Flash Lite |
llm |
– |
– |
0%
|
– |
Turn all lights OFF by toggling switches.
ProgramSynthesis-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
100%
|
100%
|
100%
|
| GPT-5 mini |
llm |
0%
|
0%
|
0%
|
0%
|
| PPO Dense (2M) |
rl |
4%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
8%
|
0%
|
0%
|
0%
|
| PPO Dense (500k) |
rl |
4%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Random Agent |
other |
0%
|
0%
|
0%
|
0%
|
| PPO Sparse (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Claude Haiku 4.5 |
llm |
– |
– |
0%
|
– |
| Gemini 3.1 Flash Lite |
llm |
– |
– |
0%
|
– |
Push all GEM objects so they form the same relative pattern as the reference SCROLLs.
RuleInduction-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
88%
|
96%
|
100%
|
| GPT-5 mini |
llm |
0%
|
0%
|
0%
|
0%
|
| PPO Dense (2M) |
rl |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
28%
|
20%
|
0%
|
0%
|
| PPO Dense (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
4%
|
4%
|
4%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
12%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
4%
|
0%
|
4%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Random Agent |
other |
0%
|
0%
|
0%
|
0%
|
| PPO Sparse (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
4%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Claude Haiku 4.5 |
llm |
– |
– |
0%
|
– |
| Gemini 3.1 Flash Lite |
llm |
– |
– |
32%
|
– |
Identify real switches via ICE pattern, INTERACT all real ones, pass barrier to GOAL.
SwitchCircuit-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
80%
|
80%
|
80%
|
76%
|
| GPT-5 mini |
llm |
4%
|
0%
|
0%
|
0%
|
| PPO Dense (2M) |
rl |
24%
|
4%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| PPO Dense (500k) |
rl |
4%
|
0%
|
4%
|
4%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Random Agent |
other |
0%
|
0%
|
0%
|
0%
|
| PPO Sparse (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Claude Haiku 4.5 |
llm |
– |
– |
0%
|
– |
| Gemini 3.1 Flash Lite |
llm |
– |
– |
0%
|
– |
Plan switch activation order to open all barriers blocking the path to GOAL.
SymbolMatching-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
100%
|
100%
|
100%
|
| GPT-5 mini |
llm |
0%
|
0%
|
0%
|
0%
|
| PPO Dense (2M) |
rl |
32%
|
4%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
48%
|
0%
|
0%
|
0%
|
| PPO Dense (500k) |
rl |
16%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
8%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Random Agent |
other |
4%
|
0%
|
0%
|
0%
|
| PPO Sparse (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Claude Haiku 4.5 |
llm |
– |
– |
0%
|
– |
| Gemini 3.1 Flash Lite |
llm |
– |
– |
0%
|
– |
Pick up each symbol item and deliver it to the matching target of the same type on the right side of the grid.
TaskInterference-v0
| Agent |
Type |
Easy |
Medium |
Hard |
Expert |
| Oracle Agent |
other |
100%
|
100%
|
100%
|
100%
|
| GPT-5 mini |
llm |
16%
|
8%
|
4%
|
0%
|
| PPO Dense (2M) |
rl |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
52%
|
0%
|
0%
|
0%
|
| PPO Dense (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
4%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Random Agent |
other |
0%
|
0%
|
0%
|
0%
|
| PPO Sparse (500k) |
rl |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Gemini 2.5 Flash Lite |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-2B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3-4B |
llm |
0%
|
0%
|
0%
|
0%
|
| Qwen3.5-0.8B |
llm |
0%
|
0%
|
0%
|
0%
|
| Claude Haiku 4.5 |
llm |
– |
– |
12%
|
– |
| Gemini 3.1 Flash Lite |
llm |
– |
– |
0%
|
– |
Raise both GEM and ORB meters to >= threshold simultaneously.