Agentick Leaderboard

Universal benchmark for evaluating AI agents.

ONS (Oracle-Normalized Score) = (agent − random) / (oracle − random), where 0% = random baseline and 100% = oracle upper bound.

Oracle-Normalized Score (ONS)

Category ONS Breakdown

Rank Agent Type Modality Harness Score 95% CI Open Date
1 Oracle Agent other
0.895
0.811–0.969 No 2026-03-17
2 GPT-5 mini llm ascii MarkovianReasoner
0.309
0.000–0.000 No 2026-03-20
3 PPO Dense (2M) rl rgb_array
0.287
0.212–0.367 Yes 2026-03-22
4 Qwen3.5-4B llm ascii markovian_reasoner
0.228
0.161–0.292 Yes 2026-03-25
5 PPO Dense (500k) rl rgb_array
0.226
0.166–0.287 Yes 2026-03-20
6 Gemini 2.5 Flash Lite llm ascii markovian_reasoner
0.187
0.126–0.247 No 2026-03-17
7 Qwen3.5-4B llm language markovian_reasoner
0.181
0.085–0.272 Yes 2026-03-25
8 Qwen3.5-2B llm ascii markovian_reasoner
0.133
0.069–0.195 Yes 2026-03-25
9 Qwen3.5-2B llm language markovian_reasoner
0.122
0.056–0.183 Yes 2026-03-25
10 Qwen3.5-0.8B llm ascii markovian_reasoner
0.094
0.048–0.140 Yes 2026-03-22
11 Qwen3-4B llm ascii markovian_reasoner
0.085
0.000–0.000 Yes 2026-03-21
12 Random Agent other
0.082
0.031–0.130 No 2026-03-17
13 PPO Sparse (500k) rl rgb_array
0.074
0.051–0.097 Yes 2026-03-20
14 Gemini 2.5 Flash Lite llm language markovian_zero_shot
0.064
0.025–0.102 No 2026-03-17
15 Qwen3.5-2B llm ascii markovian_zero_shot
0.062
0.032–0.093 Yes 2026-03-22
16 Qwen3.5-0.8B llm language markovian_reasoner
0.061
0.021–0.100 Yes 2026-03-22
17 Gemini 2.5 Flash Lite llm ascii markovian_zero_shot
0.053
0.022–0.080 No 2026-03-17
18 Qwen3-4B llm language markovian_reasoner
0.050
0.000–0.000 Yes 2026-03-21
19 Qwen3.5-2B llm language markovian_zero_shot
0.031
0.012–0.050 Yes 2026-03-22
20 Qwen3.5-4B llm ascii markovian_zero_shot
0.023
0.007–0.042 Yes 2026-03-22
21 Qwen3.5-4B llm language markovian_zero_shot
0.020
0.006–0.038 Yes 2026-03-22
22 Qwen3.5-0.8B llm ascii markovian_zero_shot
0.020
0.006–0.035 Yes 2026-03-25
23 Qwen3-4B llm ascii markovian_zero_shot
0.020
0.000–0.000 Yes 2026-03-21
24 Qwen3-4B llm language markovian_zero_shot
0.019
0.000–0.000 Yes 2026-03-21
25 Qwen3.5-0.8B llm language markovian_zero_shot
0.016
0.002–0.035 Yes 2026-03-22

Generalization ONS

Rank Agent Type Modality Harness Score
1 Oracle Agent other
0.837
2 GPT-5 mini llm ascii MarkovianReasoner
0.437
3 PPO Dense (2M) rl rgb_array
0.163
4 Qwen3.5-4B llm ascii markovian_reasoner
0.327
5 PPO Dense (500k) rl rgb_array
0.130
6 Gemini 2.5 Flash Lite llm ascii markovian_reasoner
0.287
7 Qwen3.5-4B llm language markovian_reasoner
0.340
8 Qwen3.5-2B llm ascii markovian_reasoner
0.133
9 Qwen3.5-2B llm language markovian_reasoner
0.170
10 Qwen3.5-0.8B llm ascii markovian_reasoner
0.143

Memory ONS

Rank Agent Type Modality Harness Score
1 Oracle Agent other
0.980
2 GPT-5 mini llm ascii MarkovianReasoner
0.347
3 PPO Dense (2M) rl rgb_array
0.282
4 Qwen3.5-4B llm ascii markovian_reasoner
0.247
5 PPO Dense (500k) rl rgb_array
0.228
6 Gemini 2.5 Flash Lite llm ascii markovian_reasoner
0.163
7 Qwen3.5-4B llm language markovian_reasoner
0.295
8 Qwen3.5-2B llm ascii markovian_reasoner
0.212
9 Qwen3.5-2B llm language markovian_reasoner
0.135
10 Qwen3.5-0.8B llm ascii markovian_reasoner
0.133

Multi Agent ONS

Rank Agent Type Modality Harness Score
1 Oracle Agent other
0.692
2 GPT-5 mini llm ascii MarkovianReasoner
0.150
3 PPO Dense (2M) rl rgb_array
0.432
4 Qwen3.5-4B llm ascii markovian_reasoner
0.134
5 PPO Dense (500k) rl rgb_array
0.352
6 Gemini 2.5 Flash Lite llm ascii markovian_reasoner
0.098
7 Qwen3.5-4B llm language markovian_reasoner
0.072
8 Qwen3.5-2B llm ascii markovian_reasoner
0.032
9 Qwen3.5-2B llm language markovian_reasoner
0.050
10 Qwen3.5-0.8B llm ascii markovian_reasoner
0.036

Navigation ONS

Rank Agent Type Modality Harness Score
1 Oracle Agent other
0.975
2 GPT-5 mini llm ascii MarkovianReasoner
0.456
3 PPO Dense (2M) rl rgb_array
0.250
4 Qwen3.5-4B llm ascii markovian_reasoner
0.223
5 PPO Dense (500k) rl rgb_array
0.193
6 Gemini 2.5 Flash Lite llm ascii markovian_reasoner
0.237
7 Qwen3.5-4B llm language markovian_reasoner
0.136
8 Qwen3.5-2B llm ascii markovian_reasoner
0.136
9 Qwen3.5-2B llm language markovian_reasoner
0.128
10 Qwen3.5-0.8B llm ascii markovian_reasoner
0.069

Planning ONS

Rank Agent Type Modality Harness Score
1 Oracle Agent other
0.928
2 GPT-5 mini llm ascii MarkovianReasoner
0.334
3 PPO Dense (2M) rl rgb_array
0.402
4 Qwen3.5-4B llm ascii markovian_reasoner
0.313
5 PPO Dense (500k) rl rgb_array
0.300
6 Gemini 2.5 Flash Lite llm ascii markovian_reasoner
0.249
7 Qwen3.5-4B llm language markovian_reasoner
0.219
8 Qwen3.5-2B llm ascii markovian_reasoner
0.237
9 Qwen3.5-2B llm language markovian_reasoner
0.244
10 Qwen3.5-0.8B llm ascii markovian_reasoner
0.164

Reasoning ONS

Rank Agent Type Modality Harness Score
1 Oracle Agent other
0.961
2 GPT-5 mini llm ascii MarkovianReasoner
0.131
3 PPO Dense (2M) rl rgb_array
0.191
4 Qwen3.5-4B llm ascii markovian_reasoner
0.124
5 PPO Dense (500k) rl rgb_array
0.152
6 Gemini 2.5 Flash Lite llm ascii markovian_reasoner
0.090
7 Qwen3.5-4B llm language markovian_reasoner
0.024
8 Qwen3.5-2B llm ascii markovian_reasoner
0.048
9 Qwen3.5-2B llm language markovian_reasoner
0.007
10 Qwen3.5-0.8B llm ascii markovian_reasoner
0.021

DistributionShift-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 48%
GPT-5 mini llm 8% 4% 4% 0%
PPO Dense (2M) rl 0% 0% 0% 0%
Qwen3.5-4B llm 0% 0% 0% 0%
PPO Dense (500k) rl 0% 0% 0% 0%
DistributionShift-v0
Reach all 3 goals across shifting maze phases.

FewShotAdaptation-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 92% 80% 72% 80%
GPT-5 mini llm 56% 44% 28% 44%
PPO Dense (2M) rl 0% 0% 0% 0%
Qwen3.5-4B llm 36% 44% 28% 24%
PPO Dense (500k) rl 0% 0% 0% 0%
FewShotAdaptation-v0
Watch demo trials to infer the hidden rule, then navigate to the correct candidate object in the test trial.

NoisyObservation-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 80% 84% 68%
GPT-5 mini llm 100% 84% 84% 68%
PPO Dense (2M) rl 100% 56% 24% 16%
Qwen3.5-4B llm 88% 68% 60% 44%
PPO Dense (500k) rl 96% 28% 8% 24%
NoisyObservation-v0
Locate and reach the true GOAL amid visual noise.

DelayedGratification-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
GPT-5 mini llm 100% 100% 56% 100%
PPO Dense (2M) rl 100% 88% 20% 0%
Qwen3.5-4B llm 84% 36% 20% 72%
PPO Dense (500k) rl 100% 28% 0% 100%
DelayedGratification-v0
Reach the distant true GOAL without collecting any decoy KEY.

FogOfWarExploration-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 92% 76%
GPT-5 mini llm 96% 56% 12% 12%
PPO Dense (2M) rl 92% 36% 12% 28%
Qwen3.5-4B llm 84% 60% 12% 12%
PPO Dense (500k) rl 56% 24% 12% 8%
FogOfWarExploration-v0
Find and reach the GOAL despite incomplete map information.

SequenceMemory-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
GPT-5 mini llm 0% 0% 0% 0%
PPO Dense (2M) rl 76% 0% 0% 0%
Qwen3.5-4B llm 0% 0% 0% 0%
PPO Dense (500k) rl 32% 0% 0% 0%
SequenceMemory-v0
Memorize shown GEM positions, then visit them in exact order during reproduce phase.

TreasureHunt-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
GPT-5 mini llm 20% 4% 0% 0%
PPO Dense (2M) rl 0% 0% 0% 0%
Qwen3.5-4B llm 16% 0% 0% 0%
PPO Dense (500k) rl 4% 0% 0% 0%
TreasureHunt-v0
Read scroll clues, triangulate hidden treasure positions, and step on each treasure cell to collect all treasures.

ChaseEvade-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 84% 84% 72% 16%
GPT-5 mini llm 4% 0% 0% 0%
PPO Dense (2M) rl 100% 40% 52% 8%
Qwen3.5-4B llm 0% 0% 0% 0%
PPO Dense (500k) rl 96% 32% 60% 12%
ChaseEvade-v0
Survive the required steps without enemy collision.

CooperativeTransport-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 84% 64% 52%
GPT-5 mini llm 8% 0% 0% 0%
PPO Dense (2M) rl 0% 0% 0% 0%
Qwen3.5-4B llm 48% 0% 0% 0%
PPO Dense (500k) rl 0% 0% 0% 0%
CooperativeTransport-v0
Push all heavy boxes into holes with NPC cooperation.

EmergentStrategy-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 88% 92% 52% 48%
GPT-5 mini llm 92% 8% 16% 0%
PPO Dense (2M) rl 96% 0% 0% 0%
Qwen3.5-4B llm 52% 4% 0% 0%
PPO Dense (500k) rl 4% 0% 0% 0%
EmergentStrategy-v0
Exploit NPC behaviors to lure/scare them onto locking pressure plates, permanently opening barriers to reach the GOAL...

Herding-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 80% 76% 44% 36%
GPT-5 mini llm 8% 4% 0% 0%
PPO Dense (2M) rl 100% 52% 32% 0%
Qwen3.5-4B llm 12% 4% 0% 0%
PPO Dense (500k) rl 28% 40% 44% 76%
Herding-v0
Move all SHEEP into the pen zone (TARGET cells).

TagHunt-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 96% 84% 68% 64%
GPT-5 mini llm 72% 60% 20% 8%
PPO Dense (2M) rl 100% 96% 96% 92%
Qwen3.5-4B llm 88% 48% 8% 4%
PPO Dense (500k) rl 100% 80% 84% 48%
TagHunt-v0
Tag all NPCs by stepping onto them.

CuriosityMaze-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
GPT-5 mini llm 0% 0% 0% 0%
PPO Dense (2M) rl 0% 0% 0% 0%
Qwen3.5-4B llm 0% 0% 0% 0%
PPO Dense (500k) rl 0% 0% 0% 0%
CuriosityMaze-v0
Visit at least the required percentage of all reachable cells before the step budget runs out.

DynamicObstacles-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 96% 88% 92% 72%
GPT-5 mini llm 80% 92% 60% 28%
PPO Dense (2M) rl 96% 72% 12% 4%
Qwen3.5-4B llm 64% 60% 36% 0%
PPO Dense (500k) rl 76% 20% 4% 0%
DynamicObstacles-v0
Reach GOAL without colliding with any NPC.

GoToGoal-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 92%
GPT-5 mini llm 100% 88% 56% 44%
PPO Dense (2M) rl 100% 24% 16% 8%
Qwen3.5-4B llm 88% 52% 28% 8%
PPO Dense (500k) rl 100% 40% 12% 0%
GoToGoal-v0
Reach the GOAL position.

InstructionFollowing-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 92%
GPT-5 mini llm 96% 96% 80% 40%
PPO Dense (2M) rl 20% 0% 0% 0%
Qwen3.5-4B llm 64% 68% 4% 0%
PPO Dense (500k) rl 16% 4% 0% 0%
InstructionFollowing-v0
Reach the unique target object without touching any distractor.

MazeNavigation-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
GPT-5 mini llm 80% 16% 4% 0%
PPO Dense (2M) rl 100% 0% 0% 0%
Qwen3.5-4B llm 84% 12% 0% 0%
PPO Dense (500k) rl 100% 0% 0% 0%
MazeNavigation-v0
Navigate the maze to reach the GOAL exit.

RecursiveRooms-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
GPT-5 mini llm 48% 12% 12% 8%
PPO Dense (2M) rl 12% 0% 0% 0%
Qwen3.5-4B llm 80% 32% 0% 0%
PPO Dense (500k) rl 0% 0% 0% 0%
RecursiveRooms-v0
Navigate through nested rooms to reach GOAL in the deepest room.

ShortestPath-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 96% 96%
GPT-5 mini llm 100% 88% 72% 28%
PPO Dense (2M) rl 100% 72% 0% 0%
Qwen3.5-4B llm 12% 8% 0% 4%
PPO Dense (500k) rl 72% 0% 0% 0%
ShortestPath-v0
Visit all real GOAL objects within the step budget (optimal path × budget multiplier).

TimingChallenge-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 96% 100%
GPT-5 mini llm 36% 24% 36% 36%
PPO Dense (2M) rl 52% 32% 44% 36%
Qwen3.5-4B llm 4% 4% 0% 0%
PPO Dense (500k) rl 56% 40% 48% 28%
TimingChallenge-v0
Cross the patrol zone without collision, then reach GOAL.

BacktrackPuzzle-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
GPT-5 mini llm 100% 60% 20% 4%
PPO Dense (2M) rl 96% 52% 0% 0%
Qwen3.5-4B llm 88% 32% 20% 24%
PPO Dense (500k) rl 100% 8% 0% 0%
BacktrackPuzzle-v0
Activate the correct SWITCH to open the gate, then backtrack to reach GOAL.

KeyDoorPuzzle-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 88% 76%
GPT-5 mini llm 100% 44% 4% 0%
PPO Dense (2M) rl 84% 0% 0% 0%
Qwen3.5-4B llm 60% 44% 0% 0%
PPO Dense (500k) rl 0% 0% 0% 0%
KeyDoorPuzzle-v0
Reach GOAL after unlocking ALL doors with matching keys.

PackingPuzzle-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 96% 96% 96% 96%
GPT-5 mini llm 0% 0% 0% 0%
PPO Dense (2M) rl 64% 0% 0% 0%
Qwen3.5-4B llm 0% 0% 0% 0%
PPO Dense (500k) rl 0% 0% 0% 0%
PackingPuzzle-v0
Push each piece onto its matching-type target slot.

PreciseNavigation-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 56% 32%
GPT-5 mini llm 88% 12% 12% 12%
PPO Dense (2M) rl 64% 0% 8% 4%
Qwen3.5-4B llm 92% 88% 36% 40%
PPO Dense (500k) rl 88% 4% 4% 0%
PreciseNavigation-v0
Slide across ice to reach the GOAL by planning trajectories through stopping points.

RecipeAssembly-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
GPT-5 mini llm 24% 0% 0% 0%
PPO Dense (2M) rl 88% 0% 0% 0%
Qwen3.5-4B llm 4% 0% 0% 0%
PPO Dense (500k) rl 0% 0% 0% 0%
RecipeAssembly-v0
Collect and deliver all ingredients in recipe order to the crafting station.

ResourceManagement-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 96%
GPT-5 mini llm 100% 100% 12% 0%
PPO Dense (2M) rl 100% 100% 100% 100%
Qwen3.5-4B llm 100% 100% 0% 0%
PPO Dense (500k) rl 100% 100% 100% 100%
ResourceManagement-v0
Keep ALL stations above 0 energy for the entire episode (survive max_steps).

SokobanPush-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 84% 88% 36%
GPT-5 mini llm 12% 0% 0% 0%
PPO Dense (2M) rl 92% 4% 0% 0%
Qwen3.5-4B llm 20% 4% 0% 0%
PPO Dense (500k) rl 36% 0% 0% 0%
SokobanPush-v0
Push all BOX objects onto matching TARGET positions.

TileSorting-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
GPT-5 mini llm 88% 0% 12% 0%
PPO Dense (2M) rl 100% 96% 0% 0%
Qwen3.5-4B llm 100% 72% 0% 0%
PPO Dense (500k) rl 92% 84% 56% 0%
TileSorting-v0
Arrange tiles to goal configuration (1,2,3...N-1 in row-major order).

ToolUse-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
GPT-5 mini llm 100% 100% 100% 100%
PPO Dense (2M) rl 100% 100% 96% 0%
Qwen3.5-4B llm 36% 52% 84% 32%
PPO Dense (500k) rl 56% 52% 100% 0%
ToolUse-v0
Collect all SCROLLs to spawn the ORB, pick up the ORB, cross the river, and reach the GOAL for full reward (1.0).

DeceptiveReward-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
GPT-5 mini llm 100% 100% 100% 84%
PPO Dense (2M) rl 100% 100% 100% 100%
Qwen3.5-4B llm 88% 52% 20% 16%
PPO Dense (500k) rl 100% 100% 100% 100%
DeceptiveReward-v0
Resist the coin reward gradient.

GraphColoring-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
GPT-5 mini llm 0% 0% 0% 0%
PPO Dense (2M) rl 32% 0% 0% 0%
Qwen3.5-4B llm 16% 4% 4% 0%
PPO Dense (500k) rl 0% 0% 0% 0%
GraphColoring-v0
Color all nodes so no two adjacent nodes share the same color.

LightsOut-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 88% 92% 100%
GPT-5 mini llm 4% 0% 0% 0%
PPO Dense (2M) rl 100% 12% 0% 0%
Qwen3.5-4B llm 36% 4% 0% 0%
PPO Dense (500k) rl 56% 0% 0% 0%
LightsOut-v0
Turn all lights OFF by toggling switches.

ProgramSynthesis-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
GPT-5 mini llm 0% 0% 0% 0%
PPO Dense (2M) rl 4% 0% 0% 0%
Qwen3.5-4B llm 8% 0% 0% 0%
PPO Dense (500k) rl 4% 0% 0% 0%
ProgramSynthesis-v0
Push all GEM objects so they form the same relative pattern as the reference SCROLLs.

RuleInduction-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 88% 96% 100%
GPT-5 mini llm 0% 0% 0% 0%
PPO Dense (2M) rl 0% 0% 0% 0%
Qwen3.5-4B llm 28% 20% 0% 0%
PPO Dense (500k) rl 0% 0% 0% 0%
RuleInduction-v0
Identify real switches via ICE pattern, INTERACT all real ones, pass barrier to GOAL.

SwitchCircuit-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 80% 80% 80% 76%
GPT-5 mini llm 4% 0% 0% 0%
PPO Dense (2M) rl 24% 4% 0% 0%
Qwen3.5-4B llm 0% 0% 0% 0%
PPO Dense (500k) rl 4% 0% 4% 4%
SwitchCircuit-v0
Plan switch activation order to open all barriers blocking the path to GOAL.

SymbolMatching-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
GPT-5 mini llm 0% 0% 0% 0%
PPO Dense (2M) rl 32% 4% 0% 0%
Qwen3.5-4B llm 48% 0% 0% 0%
PPO Dense (500k) rl 16% 0% 0% 0%
SymbolMatching-v0
Pick up each symbol item and deliver it to the matching target of the same type on the right side of the grid.

TaskInterference-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
GPT-5 mini llm 16% 8% 4% 0%
PPO Dense (2M) rl 0% 0% 0% 0%
Qwen3.5-4B llm 52% 0% 0% 0%
PPO Dense (500k) rl 0% 0% 0% 0%
TaskInterference-v0
Raise both GEM and ORB meters to >= threshold simultaneously.

What are models saying about Agentick?