Introducing Agentick: A Universal Benchmark for AI Agents
Procedurally generated tasks. Multi-modal observations. Every agent type. One benchmark.

Agentick is an open-source benchmark for evaluating AI agents across the core challenges of sequential decision-making. It supports RL agents, LLM agents, VLM agents, hybrid systems, hand-written bots and planners and even human play - all through a standard Gymnasium interface.
📊 The Leaderboard is Live
See how current agents compare — and submit your own results.
View Leaderboard How to SubmitThe Gap
The AI agents space has split into two worlds. RL researchers build agents that learn from scratch through environment interaction - PPO, DQN, SAC - but these agents are sample-inefficient, task-specific, and struggle to scale. Meanwhile, foundation model researchers prompt GPT-4, Gemini, or open-source LLMs to act as agents, leveraging internet-scale knowledge for zero-shot reasoning - but these models weren't trained for control and haven't learned from their own experience in interactive environments.
These two paradigms occupy opposite ends of a spectrum, and between them lies a rich design space of hybrid approaches: fine-tuned LLMs, RL post-training of foundation models, FM-guided reward shaping, and more.
The problem? There's no unified benchmark where you can meaningfully compare agents across this entire spectrum. RL benchmarks use state or pixel observations. LLM benchmarks use text. They test different capabilities with different metrics. You can't put a PPO agent and GPT-4 side by side and ask: which one is actually better at planning?
Agentick fills this gap.
Capability Decomposition: What Should a General Agent Master?
Rather than producing a single aggregate score, Agentick decomposes evaluation into six capability axes - the core properties we believe a general autonomous agent needs to master:
| Capability | What it tests | # Tasks |
|---|---|---|
| Navigation | Spatial reasoning, pathfinding, reactive control | 8 |
| Planning | Multi-step lookahead, constraint satisfaction, backtracking | 9 |
| Reasoning | Logical inference, causal reasoning, abstraction | 8 |
| Memory | Information retention, temporal integration, partial observability | 4 |
| Generalization | Distribution shift, few-shot adaptation, noise robustness | 3 |
| Multi-Agent | Cooperation, competition, emergent strategy | 5 |
This decomposition lets you build capability radar charts showing exactly where an agent excels and where it falls short. An RL agent might dominate navigation but fail at reasoning. An LLM might ace planning but struggle with memory across long episodes. These profiles are far more informative than a single number.
Check out the live leaderboard to see how current agents compare across these axes.
One Benchmark, Every Agent
The key design choice that makes this work: multiple observation modes for every task. The same underlying environment state is rendered in whatever format your agent needs:
#####
#@..#
#.#.#
#..G#
#####
Legend: @=agent G=goal #=wall .=empty
Token-efficient grid representation. An LLM can parse this in a few tokens and reason about spatial relationships.
You are at position (1,1) facing north in a 5x5 room.
A goal is visible to the southeast at distance 3.
Walls to the north and west. Path clear to south and east.
Valid actions: move_down (1), move_right (3)
Verbose descriptions with spatial context, configurable verbosity and perspective.
512x512 sprite-based isometric rendering. Rich visual observations for VLM agents and human evaluation.

{"description": "You are at (1,1) facing north...",
"position": {"x": 1, "y": 1},
"orientation": "north",
"surroundings": {"north": "wall", "east": "empty"},
"valid_actions": ["move_down", "move_right"],
"inventory": [], "energy": 1.0, "step_count": 0}
Parsed semantic fields - position, surroundings, valid actions, inventory. Useful for LLM agents that prefer structured input over free-text.
{"grid": {"terrain": [[1,1,...], ...], "objects": [...]},
"agent": {"position": [1,1], "orientation": "north"},
"info": {"step_count": 0, "valid_actions": [1, 3]}}
Raw numpy arrays of the full grid layers (terrain, objects, agents, metadata). For programmatic agents, planners, and MLP-based RL via the built-in FlattenObservationWrapper.
This means you can take the exact same task - say, SokobanPush - and evaluate a PPO agent on pixel observations, GPT-4 on ASCII text, a fine-tuned Qwen-VL on isometric renders, and a hand-coded BFS planner on the state dict. Same task, same seeds, same metrics. Fair comparison.
Not Just Eval - Training-First Design
Most benchmarks are eval-only: you bring a pre-trained agent and measure its performance. Agentick is designed for the full pipeline - training, data collection, fine-tuning, and evaluation.
Train RL agents directly:
from stable_baselines3 import PPO
import agentick
env = agentick.make("MazeNavigation-v0", render_mode="rgb_array")
model = PPO("CnnPolicy", env, verbose=1)
model.learn(total_timesteps=500_000)
Collect expert trajectories from oracle policies:
from agentick.oracles import get_oracle
from agentick.data.collector import DataCollector
env = agentick.make("SokobanPush-v0", render_mode="language")
oracle = get_oracle("SokobanPush-v0", env)
collector = DataCollector(env, oracle, record_modalities=["language"])
dataset = collector.collect(num_episodes=1000, seeds=range(1000))
dataset.export_to_huggingface("data/sokoban_expert/", format="conversation")
Fine-tune LLMs on expert demonstrations:
Oracle policies are provided for all 37 tasks. Generate your own trajectories, or grab one of our pre-built datasets on HuggingFace:
| Dataset | Episodes | |
|---|---|---|
| agentick-oracle-trajectories-120k | 120K | Good starting point |
| agentick-oracle-trajectories-250k | 250K | Broader coverage |
| agentick-oracle-trajectories-500k | 500K | Full scale |
Each dataset includes per-step oracle actions, ASCII and language observations, rewards, done flags, and step info across all 37 tasks and difficulty levels. Load them directly with datasets and SFT your favorite open-source model.
This training-first design means Agentick isn't just measuring where agents are today - it's infrastructure for making them better.
Coding API: Write Agents in Python
Every environment exposes a Coding API - a programmatic interface with spatial queries, pathfinding, entity lookups, and high-level action primitives. It's designed for hand-coded bots, code-generating LLMs, and anyone who wants to write agent logic in Python rather than training a model.
from agentick.coding_api import AgentickAPI
env = agentick.make("KeyDoorPuzzle-v0", difficulty="medium")
api = AgentickAPI(env)
obs, info = env.reset(seed=42)
api.update(obs, info)
# Spatial queries
api.agent_position # (3, 5)
api.get_nearest("key") # EntityInfo(type="key", position=(7, 2), distance=8)
api.get_entities_of_type("door") # [EntityInfo(...), ...]
api.is_walkable(4, 3) # True
api.is_reachable(7, 2) # True
# BFS pathfinding — returns action sequences
actions = api.path_to(7, 2) # [1, 3, 3, 1, 3, ...]
actions = api.go_to_nearest("key") # pathfind to closest key
actions = api.flee_from(5, 5) # single action moving away
# Execute
for action in actions:
obs, reward, done, trunc, info = env.step(action)
api.update(obs, info)
if done or trunc:
break
The API also exposes grid inspection (get_cell, get_object, get_walls, get_walkable_cells), inventory management (has_in_inventory), and interaction primitives (interact_with).
This is how the oracle policies for all 37 tasks were built - coded up through this API by a frontier coding LLM with iteration and refinement. Those oracles then generate the expert trajectory datasets linked above, closing the loop from code → trajectories → SFT.
LLM Agent Harnesses
When evaluating LLMs as agents, how you prompt matters as much as which model you use. Agentick ships with built-in harness presets that control the prompting strategy:
Each step is independent — the model sees only the current observation with no history. Fast, token-efficient, and memoryless.
System prompt:
You are an AI agent playing grid-world tasks in the Agentick benchmark.
Your goal is to navigate the grid and complete the task objective.
## Action Space
0: NOOP 1: MOVE_UP 2: MOVE_DOWN 3: MOVE_LEFT 4: MOVE_RIGHT 5: INTERACT
## Task Objective
Navigate the maze to reach the GOAL exit. Collect keys to open doors.
Respond with ONLY the action number, nothing else.
Observation → Model → Response:
User: You are at (3,2) facing south. Key visible to the east at distance 2.
Walls to the north and west. Valid actions: move_down, move_right
Model: 4
Same single-step view, but the model reasons before acting. Trades tokens for better decisions on tasks that require planning or inference.
System prompt (appended):
IMPORTANT: Before choosing an action, reason step-by-step but be
CONCISE (2-4 sentences max):
1. What do you observe? What is your goal?
2. Which action best advances you toward the goal?
3. Output your final answer on the LAST line as: ACTION: <number>
Observation → Model → Response:
User: You are at (3,2) facing south. Key visible to the east at distance 2.
Walls to the north and west. Valid actions: move_down, move_right
Model: I see a key to my east at distance 2. I need to collect it to unlock
the door blocking the maze exit. Moving right gets me closer.
ACTION: 4
Both harnesses support any observation mode (ASCII, language, pixels for VLMs) and any backend (OpenAI, Gemini, HuggingFace, vLLM). See the LLM/VLM agents docs for the full setup.
Running Experiments
Agentick includes an experiment runner for reproducible evaluation. Define your setup in YAML and run:
# eval_gpt4_navigation.yaml
name: gpt4-navigation-eval
agent:
type: llm
hyperparameters:
backend: openai
model: gpt-4o
harness: markovian_reasoner
observation_modes: [ascii]
tasks: navigation # or "full", "planning", ["GoToGoal-v0", ...]
difficulties: [easy, medium, hard, expert]
n_seeds: 25
split: eval
uv run python -m agentick.experiments.run --config eval_gpt4_navigation.yaml
The runner handles seed generation, episode management, crash-safe checkpointing, metric computation, and cost tracking for API-based agents. Results include per-task success rates, normalized scores, and capability breakdowns.
For RL training, use standard libraries directly — Agentick environments are Gymnasium-compatible, so SB3, CleanRL, and any Gym-compatible framework work out of the box. For fine-tuning LLMs on oracle trajectories, see the SFT pipeline docs.
The Tasks
37 tasks, each procedurally generated with 4 difficulty levels (easy → expert). Every run produces a unique layout, so agents can't memorize solutions. Click a category and difficulty to explore.
Every task scales across 4 difficulty levels. Easy tasks use 7x7 grids with simple mechanics. Expert tasks are 15-20 cell grids with multiple interacting systems, decoys, and tight step budgets. This controlled scaling lets you see exactly where an agent's capabilities break down.
Leaderboard and Reproducibility
Agentick includes a deterministic seed system for reproducible evaluation. Seeds are derived from a SHA-256 hash of "{task}::{difficulty}::eval", so every submission runs on the exact same 25 episodes per task-difficulty pair (3,800 total episodes).
Scoring is normalized against random and oracle baselines:
score = (agent_return - random) / (oracle - random)
Per-category scores break down into the six capability axes. Submit your results to appear on the leaderboard and see how your agent's capability profile compares.
Preliminary Results
We ran initial evaluations to understand where current agents stand. The results are early — we're still collecting data — but they already reveal striking patterns.
Frontier LLMs on Hard Tasks
We evaluated three frontier LLMs on hard difficulty across navigation, planning, and reasoning using ASCII observations and a chain-of-thought reasoning harness.
Full Benchmark Overview
For agents with complete evaluations across all tasks and difficulties, we can compute Oracle-Normalized Scores (ONS).
Observation Mode & Reasoning Harness: Qwen Model Family
How does observation mode (ASCII vs. language) and the reasoning harness affect performance across model scales? We evaluated the full Qwen model family — Qwen3-4B and Qwen3.5 at 0.8B, 2B, and 4B parameters — with every combination.
The pattern is striking: the reasoning harness consistently multiplies performance by 3-10x across every model and observation mode. ASCII observations outperform language descriptions, and Qwen3.5-4B with ASCII + Reasoner achieves the highest score (22.8%) among all local models tested.
Category Breakdown
ONS across all six capability categories for the three fully-evaluated agents.
What's Next
- More evaluations — open-source models are being evaluated across the full benchmark
- Fine-tuning — SFT datasets available on HuggingFace (120k, 250k, 500k episodes). RL post-training support coming soon.
- VLM evaluation — pixel observation benchmarks for vision-language models
- Better RL baselines — longer training, curriculum learning, multi-task agents
- New tasks — community contributions welcome
Help us populate the leaderboard! If you have API access to frontier models and want to contribute, see the submission instructions.
Get Started
git clone https://github.com/roger-creus/agentick.git && cd agentick
uv sync --extra all
import agentick
env = agentick.make("GoToGoal-v0", difficulty="easy")
obs, info = env.reset(seed=42)
for _ in range(100):
obs, reward, terminated, truncated, info = env.step(env.action_space.sample())
if terminated or truncated:
break
env.close()
uv run agentick webapp # Play tasks yourself in the browser
uv run agentick list-tasks # See all 37 tasks
Browse the documentation, explore the task catalog, or check out the example configs to get running.
Built for the Research Community
Agentick is MIT-licensed and designed for open research. Whether you're training RL agents, evaluating foundation models, building hybrid systems, or studying what makes agents "general" - we built this to be the common ground where all of that work can be measured and compared.
We'd love to see:
- Leaderboard submissions from open-source and commercial models
- New training recipes - RL post-training, SFT, behavior cloning, curriculum learning
- Analysis of capability profiles across agent paradigms
- Community contributions - new tasks, observation modes, agent harnesses
The benchmark, documentation, examples, and leaderboard are all live. Give it a try and let us know what you find.