Agentick Leaderboard

Universal benchmark for evaluating AI agents.

ONS (Oracle-Normalized Score) = (agent − random) / (oracle − random), where 0.0 = random baseline and 1.0 = oracle upper bound.

Oracle-Normalized Score (ONS)

Category ONS Breakdown

Rank Agent Type Modality Harness Score 95% CI Open Date
1 Oracle Agent other
0.895
0.811–0.969 No 2026-03-17
2 Qwen3.5-4B (SFT-250k) llm ascii markovian_zero_shot
0.447
0.380–0.518 Yes 2026-05-11
3 Qwen3.5-4B (SFT-250k) llm ascii markovian_reasoner
0.444
0.372–0.518 Yes 2026-05-11
4 Qwen3.5-4B (SFT-120k) llm ascii markovian_zero_shot
0.354
0.294–0.408 Yes 2026-05-11
5 Qwen3.5-4B (SFT-120k) llm ascii markovian_reasoner
0.349
0.290–0.402 Yes 2026-05-11
6 GPT-5 mini llm ascii MarkovianReasoner
0.309
0.000–0.000 No 2026-03-20
7 PPO Dense (2M) rl rgb_array
0.287
0.212–0.367 Yes 2026-03-22
8 Qwen3.5-4B llm ascii markovian_reasoner
0.228
0.161–0.292 Yes 2026-03-25
9 PPO Dense (500k) rl rgb_array
0.226
0.166–0.287 Yes 2026-03-20
10 Gemini 2.5 Flash Lite llm ascii markovian_reasoner
0.187
0.126–0.247 No 2026-03-17
11 Qwen3.5-4B llm language markovian_reasoner
0.181
0.085–0.272 Yes 2026-03-25
12 Qwen3.5-2B llm ascii markovian_reasoner
0.133
0.069–0.195 Yes 2026-03-25
13 Qwen3.5-2B llm language markovian_reasoner
0.122
0.056–0.183 Yes 2026-03-25
14 Qwen3.5-0.8B llm ascii markovian_reasoner
0.094
0.048–0.140 Yes 2026-03-22
15 Qwen3-4B llm ascii markovian_reasoner
0.085
0.000–0.000 Yes 2026-03-21
16 Random Agent other
0.082
0.031–0.130 No 2026-03-17
17 PPO Sparse (500k) rl rgb_array
0.074
0.051–0.097 Yes 2026-03-20
18 Gemini 2.5 Flash Lite llm language markovian_zero_shot
0.064
0.025–0.102 No 2026-03-17
19 Qwen3.5-2B llm ascii markovian_zero_shot
0.062
0.032–0.093 Yes 2026-03-22
20 Qwen3.5-0.8B llm language markovian_reasoner
0.061
0.021–0.100 Yes 2026-03-22
21 Gemini 2.5 Flash Lite llm ascii markovian_zero_shot
0.053
0.022–0.080 No 2026-03-17
22 Qwen3-4B llm language markovian_reasoner
0.050
0.000–0.000 Yes 2026-03-21
23 Qwen3.5-2B llm language markovian_zero_shot
0.031
0.012–0.050 Yes 2026-03-22
24 Qwen3.5-4B llm ascii markovian_zero_shot
0.023
0.007–0.042 Yes 2026-03-22
25 Qwen3.5-4B llm language markovian_zero_shot
0.020
0.006–0.038 Yes 2026-03-22
26 Qwen3.5-0.8B llm ascii markovian_zero_shot
0.020
0.006–0.035 Yes 2026-03-25
27 Qwen3-4B llm ascii markovian_zero_shot
0.020
0.000–0.000 Yes 2026-03-21
28 Qwen3-4B llm language markovian_zero_shot
0.019
0.000–0.000 Yes 2026-03-21
29 Qwen3.5-0.8B llm language markovian_zero_shot
0.016
0.002–0.035 Yes 2026-03-22

Generalization ONS

Rank Agent Type Modality Harness Score
1 Oracle Agent other
0.837
2 Qwen3.5-4B (SFT-250k) llm ascii markovian_zero_shot
0.350
3 Qwen3.5-4B (SFT-250k) llm ascii markovian_reasoner
0.337
4 Qwen3.5-4B (SFT-120k) llm ascii markovian_zero_shot
0.357
5 Qwen3.5-4B (SFT-120k) llm ascii markovian_reasoner
0.347
6 GPT-5 mini llm ascii MarkovianReasoner
0.437
7 PPO Dense (2M) rl rgb_array
0.163
8 Qwen3.5-4B llm ascii markovian_reasoner
0.327
9 PPO Dense (500k) rl rgb_array
0.130
10 Gemini 2.5 Flash Lite llm ascii markovian_reasoner
0.287

Memory ONS

Rank Agent Type Modality Harness Score
1 Oracle Agent other
0.980
2 Qwen3.5-4B (SFT-250k) llm ascii markovian_zero_shot
0.412
3 Qwen3.5-4B (SFT-250k) llm ascii markovian_reasoner
0.422
4 Qwen3.5-4B (SFT-120k) llm ascii markovian_zero_shot
0.287
5 Qwen3.5-4B (SFT-120k) llm ascii markovian_reasoner
0.292
6 GPT-5 mini llm ascii MarkovianReasoner
0.347
7 PPO Dense (2M) rl rgb_array
0.282
8 Qwen3.5-4B llm ascii markovian_reasoner
0.247
9 PPO Dense (500k) rl rgb_array
0.228
10 Gemini 2.5 Flash Lite llm ascii markovian_reasoner
0.163

Multi Agent ONS

Rank Agent Type Modality Harness Score
1 Oracle Agent other
0.692
2 Qwen3.5-4B (SFT-250k) llm ascii markovian_zero_shot
0.348
3 Qwen3.5-4B (SFT-250k) llm ascii markovian_reasoner
0.338
4 Qwen3.5-4B (SFT-120k) llm ascii markovian_zero_shot
0.246
5 Qwen3.5-4B (SFT-120k) llm ascii markovian_reasoner
0.238
6 GPT-5 mini llm ascii MarkovianReasoner
0.150
7 PPO Dense (2M) rl rgb_array
0.432
8 Qwen3.5-4B llm ascii markovian_reasoner
0.134
9 PPO Dense (500k) rl rgb_array
0.352
10 Gemini 2.5 Flash Lite llm ascii markovian_reasoner
0.098

Navigation ONS

Rank Agent Type Modality Harness Score
1 Oracle Agent other
0.975
2 Qwen3.5-4B (SFT-250k) llm ascii markovian_zero_shot
0.568
3 Qwen3.5-4B (SFT-250k) llm ascii markovian_reasoner
0.545
4 Qwen3.5-4B (SFT-120k) llm ascii markovian_zero_shot
0.427
5 Qwen3.5-4B (SFT-120k) llm ascii markovian_reasoner
0.425
6 GPT-5 mini llm ascii MarkovianReasoner
0.456
7 PPO Dense (2M) rl rgb_array
0.250
8 Qwen3.5-4B llm ascii markovian_reasoner
0.223
9 PPO Dense (500k) rl rgb_array
0.193
10 Gemini 2.5 Flash Lite llm ascii markovian_reasoner
0.237

Planning ONS

Rank Agent Type Modality Harness Score
1 Oracle Agent other
0.928
2 Qwen3.5-4B (SFT-250k) llm ascii markovian_zero_shot
0.557
3 Qwen3.5-4B (SFT-250k) llm ascii markovian_reasoner
0.573
4 Qwen3.5-4B (SFT-120k) llm ascii markovian_zero_shot
0.459
5 Qwen3.5-4B (SFT-120k) llm ascii markovian_reasoner
0.450
6 GPT-5 mini llm ascii MarkovianReasoner
0.334
7 PPO Dense (2M) rl rgb_array
0.402
8 Qwen3.5-4B llm ascii markovian_reasoner
0.313
9 PPO Dense (500k) rl rgb_array
0.300
10 Gemini 2.5 Flash Lite llm ascii markovian_reasoner
0.249

Reasoning ONS

Rank Agent Type Modality Harness Score
1 Oracle Agent other
0.961
2 Qwen3.5-4B (SFT-250k) llm ascii markovian_zero_shot
0.445
3 Qwen3.5-4B (SFT-250k) llm ascii markovian_reasoner
0.446
4 Qwen3.5-4B (SFT-120k) llm ascii markovian_zero_shot
0.350
5 Qwen3.5-4B (SFT-120k) llm ascii markovian_reasoner
0.341
6 GPT-5 mini llm ascii MarkovianReasoner
0.131
7 PPO Dense (2M) rl rgb_array
0.191
8 Qwen3.5-4B llm ascii markovian_reasoner
0.124
9 PPO Dense (500k) rl rgb_array
0.152
10 Gemini 2.5 Flash Lite llm ascii markovian_reasoner
0.090

DistributionShift-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 48%
Qwen3.5-4B (SFT-250k) llm 8% 4% 0% 0%
Qwen3.5-4B (SFT-250k) llm 8% 0% 0% 0%
Qwen3.5-4B (SFT-120k) llm 0% 0% 0% 0%
Qwen3.5-4B (SFT-120k) llm 4% 0% 0% 0%
DistributionShift-v0
Reach all 3 goals across shifting maze phases.

FewShotAdaptation-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 92% 80% 72% 80%
Qwen3.5-4B (SFT-250k) llm 40% 20% 16% 20%
Qwen3.5-4B (SFT-250k) llm 24% 20% 12% 28%
Qwen3.5-4B (SFT-120k) llm 28% 32% 20% 24%
Qwen3.5-4B (SFT-120k) llm 36% 28% 8% 24%
FewShotAdaptation-v0
Watch demo trials to infer the hidden rule, then navigate to the correct candidate object in the test trial.

NoisyObservation-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 80% 84% 68%
Qwen3.5-4B (SFT-250k) llm 92% 72% 84% 64%
Qwen3.5-4B (SFT-250k) llm 92% 72% 88% 60%
Qwen3.5-4B (SFT-120k) llm 96% 80% 92% 56%
Qwen3.5-4B (SFT-120k) llm 92% 80% 88% 56%
NoisyObservation-v0
Locate and reach the true GOAL amid visual noise.

DelayedGratification-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
Qwen3.5-4B (SFT-250k) llm 96% 84% 64% 100%
Qwen3.5-4B (SFT-250k) llm 96% 80% 56% 100%
Qwen3.5-4B (SFT-120k) llm 76% 60% 32% 56%
Qwen3.5-4B (SFT-120k) llm 76% 56% 36% 48%
DelayedGratification-v0
Reach the distant true GOAL without collecting any decoy KEY.

FogOfWarExploration-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 92% 76%
Qwen3.5-4B (SFT-250k) llm 96% 76% 52% 44%
Qwen3.5-4B (SFT-250k) llm 100% 72% 60% 56%
Qwen3.5-4B (SFT-120k) llm 72% 72% 36% 24%
Qwen3.5-4B (SFT-120k) llm 72% 76% 40% 24%
FogOfWarExploration-v0
Find and reach the GOAL despite incomplete map information.

SequenceMemory-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
Qwen3.5-4B (SFT-250k) llm 24% 12% 0% 4%
Qwen3.5-4B (SFT-250k) llm 32% 16% 0% 0%
Qwen3.5-4B (SFT-120k) llm 16% 0% 4% 0%
Qwen3.5-4B (SFT-120k) llm 28% 12% 0% 0%
SequenceMemory-v0
Memorize shown GEM positions, then visit them in exact order during reproduce phase.

TreasureHunt-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
Qwen3.5-4B (SFT-250k) llm 8% 0% 0% 0%
Qwen3.5-4B (SFT-250k) llm 8% 0% 0% 0%
Qwen3.5-4B (SFT-120k) llm 8% 4% 0% 0%
Qwen3.5-4B (SFT-120k) llm 0% 0% 0% 0%
TreasureHunt-v0
Read scroll clues, triangulate hidden treasure positions, and step on each treasure cell to collect all treasures.

ChaseEvade-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 84% 84% 72% 16%
Qwen3.5-4B (SFT-250k) llm 76% 48% 16% 16%
Qwen3.5-4B (SFT-250k) llm 64% 32% 8% 12%
Qwen3.5-4B (SFT-120k) llm 32% 48% 12% 0%
Qwen3.5-4B (SFT-120k) llm 36% 40% 12% 0%
ChaseEvade-v0
Survive the required steps without enemy collision.

CooperativeTransport-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 84% 64% 52%
Qwen3.5-4B (SFT-250k) llm 32% 0% 0% 0%
Qwen3.5-4B (SFT-250k) llm 40% 0% 0% 0%
Qwen3.5-4B (SFT-120k) llm 20% 0% 0% 0%
Qwen3.5-4B (SFT-120k) llm 16% 0% 0% 0%
CooperativeTransport-v0
Push all heavy boxes into holes with NPC cooperation.

EmergentStrategy-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 88% 92% 52% 48%
Qwen3.5-4B (SFT-250k) llm 60% 32% 16% 0%
Qwen3.5-4B (SFT-250k) llm 56% 24% 4% 0%
Qwen3.5-4B (SFT-120k) llm 52% 36% 4% 0%
Qwen3.5-4B (SFT-120k) llm 36% 40% 0% 8%
EmergentStrategy-v0
Exploit NPC behaviors to lure/scare them onto locking pressure plates, permanently opening barriers to reach the GOAL...

Herding-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 80% 76% 44% 36%
Qwen3.5-4B (SFT-250k) llm 60% 36% 4% 4%
Qwen3.5-4B (SFT-250k) llm 76% 32% 4% 12%
Qwen3.5-4B (SFT-120k) llm 20% 12% 0% 4%
Qwen3.5-4B (SFT-120k) llm 32% 0% 0% 0%
Herding-v0
Move all SHEEP into the pen zone (TARGET cells).

TagHunt-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 96% 84% 68% 64%
Qwen3.5-4B (SFT-250k) llm 92% 96% 80% 28%
Qwen3.5-4B (SFT-250k) llm 100% 100% 84% 28%
Qwen3.5-4B (SFT-120k) llm 100% 88% 48% 16%
Qwen3.5-4B (SFT-120k) llm 100% 72% 64% 20%
TagHunt-v0
Tag all NPCs by stepping onto them.

CuriosityMaze-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
Qwen3.5-4B (SFT-250k) llm 16% 0% 0% 0%
Qwen3.5-4B (SFT-250k) llm 12% 0% 0% 0%
Qwen3.5-4B (SFT-120k) llm 4% 0% 0% 0%
Qwen3.5-4B (SFT-120k) llm 0% 0% 0% 0%
CuriosityMaze-v0
Visit at least the required percentage of all reachable cells before the step budget runs out.

DynamicObstacles-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 96% 88% 92% 72%
Qwen3.5-4B (SFT-250k) llm 92% 88% 64% 48%
Qwen3.5-4B (SFT-250k) llm 92% 92% 60% 40%
Qwen3.5-4B (SFT-120k) llm 88% 72% 64% 36%
Qwen3.5-4B (SFT-120k) llm 80% 84% 52% 36%
DynamicObstacles-v0
Reach GOAL without colliding with any NPC.

GoToGoal-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 92%
Qwen3.5-4B (SFT-250k) llm 100% 92% 76% 68%
Qwen3.5-4B (SFT-250k) llm 100% 88% 56% 44%
Qwen3.5-4B (SFT-120k) llm 100% 76% 44% 40%
Qwen3.5-4B (SFT-120k) llm 100% 76% 28% 28%
GoToGoal-v0
Reach the GOAL position.

InstructionFollowing-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 92%
Qwen3.5-4B (SFT-250k) llm 72% 60% 72% 20%
Qwen3.5-4B (SFT-250k) llm 76% 76% 72% 24%
Qwen3.5-4B (SFT-120k) llm 60% 56% 24% 16%
Qwen3.5-4B (SFT-120k) llm 68% 52% 20% 12%
InstructionFollowing-v0
Reach the unique target object without touching any distractor.

MazeNavigation-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
Qwen3.5-4B (SFT-250k) llm 100% 28% 24% 4%
Qwen3.5-4B (SFT-250k) llm 100% 20% 24% 8%
Qwen3.5-4B (SFT-120k) llm 100% 28% 4% 4%
Qwen3.5-4B (SFT-120k) llm 100% 20% 4% 4%
MazeNavigation-v0
Navigate the maze to reach the GOAL exit.

RecursiveRooms-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
Qwen3.5-4B (SFT-250k) llm 80% 72% 36% 28%
Qwen3.5-4B (SFT-250k) llm 84% 68% 40% 28%
Qwen3.5-4B (SFT-120k) llm 80% 24% 8% 16%
Qwen3.5-4B (SFT-120k) llm 72% 36% 20% 24%
RecursiveRooms-v0
Navigate through nested rooms to reach GOAL in the deepest room.

ShortestPath-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 96% 96%
Qwen3.5-4B (SFT-250k) llm 72% 72% 64% 64%
Qwen3.5-4B (SFT-250k) llm 72% 76% 52% 52%
Qwen3.5-4B (SFT-120k) llm 68% 64% 24% 36%
Qwen3.5-4B (SFT-120k) llm 56% 68% 40% 40%
ShortestPath-v0
Visit all real GOAL objects within the step budget (optimal path × budget multiplier).

TimingChallenge-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 96% 100%
Qwen3.5-4B (SFT-250k) llm 96% 68% 60% 80%
Qwen3.5-4B (SFT-250k) llm 80% 68% 72% 68%
Qwen3.5-4B (SFT-120k) llm 52% 48% 64% 68%
Qwen3.5-4B (SFT-120k) llm 44% 60% 72% 64%
TimingChallenge-v0
Cross the patrol zone without collision, then reach GOAL.

BacktrackPuzzle-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
Qwen3.5-4B (SFT-250k) llm 100% 88% 80% 40%
Qwen3.5-4B (SFT-250k) llm 100% 96% 76% 56%
Qwen3.5-4B (SFT-120k) llm 100% 84% 64% 24%
Qwen3.5-4B (SFT-120k) llm 100% 76% 68% 16%
BacktrackPuzzle-v0
Activate the correct SWITCH to open the gate, then backtrack to reach GOAL.

KeyDoorPuzzle-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 88% 76%
Qwen3.5-4B (SFT-250k) llm 96% 20% 0% 0%
Qwen3.5-4B (SFT-250k) llm 96% 8% 0% 0%
Qwen3.5-4B (SFT-120k) llm 76% 12% 0% 0%
Qwen3.5-4B (SFT-120k) llm 64% 0% 0% 0%
KeyDoorPuzzle-v0
Reach GOAL after unlocking ALL doors with matching keys.

PackingPuzzle-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 96% 96% 96% 96%
Qwen3.5-4B (SFT-250k) llm 36% 20% 12% 0%
Qwen3.5-4B (SFT-250k) llm 32% 20% 24% 4%
Qwen3.5-4B (SFT-120k) llm 28% 24% 0% 0%
Qwen3.5-4B (SFT-120k) llm 12% 8% 0% 0%
PackingPuzzle-v0
Push each piece onto its matching-type target slot.

PreciseNavigation-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 56% 32%
Qwen3.5-4B (SFT-250k) llm 84% 44% 24% 8%
Qwen3.5-4B (SFT-250k) llm 92% 48% 36% 8%
Qwen3.5-4B (SFT-120k) llm 92% 32% 20% 0%
Qwen3.5-4B (SFT-120k) llm 92% 28% 12% 4%
PreciseNavigation-v0
Slide across ice to reach the GOAL by planning trajectories through stopping points.

RecipeAssembly-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
Qwen3.5-4B (SFT-250k) llm 88% 80% 52% 72%
Qwen3.5-4B (SFT-250k) llm 88% 88% 52% 60%
Qwen3.5-4B (SFT-120k) llm 56% 48% 16% 20%
Qwen3.5-4B (SFT-120k) llm 76% 60% 12% 52%
RecipeAssembly-v0
Collect and deliver all ingredients in recipe order to the crafting station.

ResourceManagement-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 96%
Qwen3.5-4B (SFT-250k) llm 100% 100% 100% 60%
Qwen3.5-4B (SFT-250k) llm 100% 100% 100% 68%
Qwen3.5-4B (SFT-120k) llm 100% 100% 92% 20%
Qwen3.5-4B (SFT-120k) llm 100% 100% 76% 8%
ResourceManagement-v0
Keep ALL stations above 0 energy for the entire episode (survive max_steps).

SokobanPush-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 84% 88% 36%
Qwen3.5-4B (SFT-250k) llm 48% 8% 4% 0%
Qwen3.5-4B (SFT-250k) llm 48% 12% 0% 0%
Qwen3.5-4B (SFT-120k) llm 40% 0% 0% 0%
Qwen3.5-4B (SFT-120k) llm 24% 4% 0% 0%
SokobanPush-v0
Push all BOX objects onto matching TARGET positions.

TileSorting-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
Qwen3.5-4B (SFT-250k) llm 100% 84% 44% 12%
Qwen3.5-4B (SFT-250k) llm 100% 88% 56% 8%
Qwen3.5-4B (SFT-120k) llm 84% 64% 56% 0%
Qwen3.5-4B (SFT-120k) llm 100% 80% 48% 0%
TileSorting-v0
Arrange tiles to goal configuration (1,2,3...N-1 in row-major order).

ToolUse-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
Qwen3.5-4B (SFT-250k) llm 100% 100% 100% 100%
Qwen3.5-4B (SFT-250k) llm 100% 100% 100% 100%
Qwen3.5-4B (SFT-120k) llm 100% 100% 100% 100%
Qwen3.5-4B (SFT-120k) llm 100% 100% 100% 100%
ToolUse-v0
Collect all SCROLLs to spawn the ORB, pick up the ORB, cross the river, and reach the GOAL for full reward (1.0).

DeceptiveReward-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
Qwen3.5-4B (SFT-250k) llm 100% 100% 100% 92%
Qwen3.5-4B (SFT-250k) llm 100% 100% 100% 92%
Qwen3.5-4B (SFT-120k) llm 100% 100% 72% 72%
Qwen3.5-4B (SFT-120k) llm 100% 100% 100% 0%
DeceptiveReward-v0
Resist the coin reward gradient.

GraphColoring-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
Qwen3.5-4B (SFT-250k) llm 76% 16% 0% 4%
Qwen3.5-4B (SFT-250k) llm 60% 24% 4% 0%
Qwen3.5-4B (SFT-120k) llm 24% 0% 0% 0%
Qwen3.5-4B (SFT-120k) llm 20% 8% 0% 0%
GraphColoring-v0
Color all nodes so no two adjacent nodes share the same color.

LightsOut-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 88% 92% 100%
Qwen3.5-4B (SFT-250k) llm 48% 24% 72% 40%
Qwen3.5-4B (SFT-250k) llm 40% 32% 56% 56%
Qwen3.5-4B (SFT-120k) llm 36% 12% 16% 4%
Qwen3.5-4B (SFT-120k) llm 40% 4% 16% 8%
LightsOut-v0
Turn all lights OFF by toggling switches.

ProgramSynthesis-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
Qwen3.5-4B (SFT-250k) llm 12% 4% 4% 0%
Qwen3.5-4B (SFT-250k) llm 8% 16% 4% 0%
Qwen3.5-4B (SFT-120k) llm 12% 0% 0% 0%
Qwen3.5-4B (SFT-120k) llm 0% 0% 0% 0%
ProgramSynthesis-v0
Push all GEM objects so they form the same relative pattern as the reference SCROLLs.

RuleInduction-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 88% 96% 100%
Qwen3.5-4B (SFT-250k) llm 12% 36% 36% 40%
Qwen3.5-4B (SFT-250k) llm 24% 16% 48% 28%
Qwen3.5-4B (SFT-120k) llm 28% 24% 40% 24%
Qwen3.5-4B (SFT-120k) llm 40% 48% 44% 28%
RuleInduction-v0
Identify real switches via ICE pattern, INTERACT all real ones, pass barrier to GOAL.

SwitchCircuit-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 80% 80% 80% 76%
Qwen3.5-4B (SFT-250k) llm 64% 40% 0% 0%
Qwen3.5-4B (SFT-250k) llm 72% 48% 0% 0%
Qwen3.5-4B (SFT-120k) llm 48% 24% 0% 0%
Qwen3.5-4B (SFT-120k) llm 52% 24% 0% 0%
SwitchCircuit-v0
Plan switch activation order to open all barriers blocking the path to GOAL.

SymbolMatching-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
Qwen3.5-4B (SFT-250k) llm 64% 20% 8% 12%
Qwen3.5-4B (SFT-250k) llm 68% 12% 8% 12%
Qwen3.5-4B (SFT-120k) llm 64% 24% 4% 4%
Qwen3.5-4B (SFT-120k) llm 60% 8% 0% 0%
SymbolMatching-v0
Pick up each symbol item and deliver it to the matching target of the same type on the right side of the grid.

TaskInterference-v0

Agent Type Easy Medium Hard Expert
Oracle Agent other 100% 100% 100% 100%
Qwen3.5-4B (SFT-250k) llm 100% 100% 100% 100%
Qwen3.5-4B (SFT-250k) llm 100% 100% 100% 100%
Qwen3.5-4B (SFT-120k) llm 100% 100% 100% 88%
Qwen3.5-4B (SFT-120k) llm 100% 100% 100% 92%
TaskInterference-v0
Raise both GEM and ORB meters to >= threshold simultaneously.

What are models saying about Agentick?