Blog¶

March 18, 2026
13 min read

Introducing Agentick: A Universal Benchmark for AI Agents

Procedurally generated tasks. Multi-modal observations. Every agent type. One benchmark.

Agentick Banner

Agentick is an open-source benchmark for evaluating AI agents across the core challenges of sequential decision-making. It supports RL agents, LLM agents, VLM agents, hybrid systems, hand-written bots and planners and even human play - all through a standard Gymnasium interface.

📊 The Leaderboard is Live

See how current agents compare — and submit your own results.

View Leaderboard How to Submit

The Gap

The AI agents space has split into two worlds. RL researchers build agents that learn from scratch through environment interaction - PPO, DQN, SAC - but these agents are sample-inefficient, task-specific, and struggle to scale. Meanwhile, foundation model researchers prompt GPT-4, Gemini, or open-source LLMs to act as agents, leveraging internet-scale knowledge for zero-shot reasoning - but these models weren't trained for control and haven't learned from their own experience in interactive environments.

These two paradigms occupy opposite ends of a spectrum, and between them lies a rich design space of hybrid approaches: fine-tuned LLMs, RL post-training of foundation models, FM-guided reward shaping, and more.

The problem? There's no unified benchmark where you can meaningfully compare agents across this entire spectrum. RL benchmarks use state or pixel observations. LLM benchmarks use text. They test different capabilities with different metrics. You can't put a PPO agent and GPT-4 side by side and ask: which one is actually better at planning?

Agentick fills this gap.

Capability Decomposition: What Should a General Agent Master?

Rather than producing a single aggregate score, Agentick decomposes evaluation into six capability axes - the core properties we believe a general autonomous agent needs to master:

Capability	What it tests	# Tasks
Navigation	Spatial reasoning, pathfinding, reactive control	8
Planning	Multi-step lookahead, constraint satisfaction, backtracking	9
Reasoning	Logical inference, causal reasoning, abstraction	8
Memory	Information retention, temporal integration, partial observability	4
Generalization	Distribution shift, few-shot adaptation, noise robustness	3
Multi-Agent	Cooperation, competition, emergent strategy	5

This decomposition lets you build capability radar charts showing exactly where an agent excels and where it falls short. An RL agent might dominate navigation but fail at reasoning. An LLM might ace planning but struggle with memory across long episodes. These profiles are far more informative than a single number.

Check out the live leaderboard to see how current agents compare across these axes.

One Benchmark, Every Agent

The key design choice that makes this work: multiple observation modes for every task. The same underlying environment state is rendered in whatever format your agent needs:

ASCII (for LLMs)Natural Language (for LLMs)Isometric Pixels (for VLMs)Structured Dict (for LLM agents)State Dict (for bots/RL)

#####
#@..#
#.#.#
#..G#
#####
Legend: @=agent G=goal #=wall .=empty

Token-efficient grid representation. An LLM can parse this in a few tokens and reason about spatial relationships.

You are at position (1,1) facing north in a 5x5 room.
A goal is visible to the southeast at distance 3.
Walls to the north and west. Path clear to south and east.
Valid actions: move_down (1), move_right (3)

Verbose descriptions with spatial context, configurable verbosity and perspective.

512x512 sprite-based isometric rendering. Rich visual observations for VLM agents and human evaluation.

SokobanPush isometric

{"description": "You are at (1,1) facing north...",
 "position": {"x": 1, "y": 1},
 "orientation": "north",
 "surroundings": {"north": "wall", "east": "empty"},
 "valid_actions": ["move_down", "move_right"],
 "inventory": [], "energy": 1.0, "step_count": 0}

Parsed semantic fields - position, surroundings, valid actions, inventory. Useful for LLM agents that prefer structured input over free-text.

{"grid": {"terrain": [[1,1,...], ...], "objects": [...]},
 "agent": {"position": [1,1], "orientation": "north"},
 "info": {"step_count": 0, "valid_actions": [1, 3]}}

Raw numpy arrays of the full grid layers (terrain, objects, agents, metadata). For programmatic agents, planners, and MLP-based RL via the built-in FlattenObservationWrapper.

This means you can take the exact same task - say, SokobanPush - and evaluate a PPO agent on pixel observations, GPT-4 on ASCII text, a fine-tuned Qwen-VL on isometric renders, and a hand-coded BFS planner on the state dict. Same task, same seeds, same metrics. Fair comparison.

Not Just Eval - Training-First Design

Most benchmarks are eval-only: you bring a pre-trained agent and measure its performance. Agentick is designed for the full pipeline - training, data collection, fine-tuning, and evaluation.

Train RL agents directly:

from stable_baselines3 import PPO
import agentick

env = agentick.make("MazeNavigation-v0", render_mode="rgb_array")
model = PPO("CnnPolicy", env, verbose=1)
model.learn(total_timesteps=500_000)

Collect expert trajectories from oracle policies:

from agentick.oracles import get_oracle
from agentick.data.collector import DataCollector

env = agentick.make("SokobanPush-v0", render_mode="language")
oracle = get_oracle("SokobanPush-v0", env)
collector = DataCollector(env, oracle, record_modalities=["language"])

dataset = collector.collect(num_episodes=1000, seeds=range(1000))
dataset.export_to_huggingface("data/sokoban_expert/", format="conversation")

Fine-tune LLMs on expert demonstrations:

Oracle policies are provided for every task in the suite. Generate your own trajectories, or grab one of our pre-built datasets on HuggingFace:

Dataset	Rows
agentick-oracle-trajectories-120k	120K per-step rows	Good starting point
agentick-oracle-trajectories-250k	250K per-step rows	Broader coverage
agentick-oracle-trajectories-500k	500K per-step rows	Full scale

Each dataset includes per-step oracle actions, ASCII and language observations, rewards, done flags, and step info across every task and difficulty level in the suite. Load them directly with datasets and SFT your favorite open-source model.

This training-first design means Agentick isn't just measuring where agents are today - it's infrastructure for making them better.

Coding API: Write Agents in Python

Every environment exposes a Coding API - a programmatic interface with spatial queries, pathfinding, entity lookups, and high-level action primitives. It's designed for hand-coded bots, code-generating LLMs, and anyone who wants to write agent logic in Python rather than training a model.

from agentick.coding_api import AgentickAPI

env = agentick.make("KeyDoorPuzzle-v0", difficulty="medium")
api = AgentickAPI(env)
obs, info = env.reset(seed=42)
api.update(obs, info)

# Spatial queries
api.agent_position          # (3, 5)
api.get_nearest("key")      # EntityInfo(type="key", position=(7, 2), distance=8)
api.get_entities_of_type("door")  # [EntityInfo(...), ...]
api.is_walkable(4, 3)       # True
api.is_reachable(7, 2)      # True

# BFS pathfinding — returns action sequences
actions = api.path_to(7, 2)         # [1, 3, 3, 1, 3, ...]
actions = api.go_to_nearest("key")  # pathfind to closest key
actions = api.flee_from(5, 5)       # single action moving away

# Execute
for action in actions:
    obs, reward, done, trunc, info = env.step(action)
    api.update(obs, info)
    if done or trunc:
        break

The API also exposes grid inspection (get_cell, get_object, get_walls, get_walkable_cells), inventory management (has_in_inventory), and interaction primitives (interact_with).

This is how the oracle policies for every task in the suite were built - coded up through this API by a frontier coding LLM with iteration and refinement. Those oracles then generate the expert trajectory datasets linked above, closing the loop from code → trajectories → SFT.

LLM Agent Harnesses

When evaluating LLMs as agents, how you prompt matters as much as which model you use. Agentick ships with built-in harness presets that control the prompting strategy:

Zero-Shot (Markovian)Chain-of-Thought (Reasoner)

Each step is independent — the model sees only the current observation with no history. Fast, token-efficient, and memoryless.

System prompt:

You are an AI agent playing grid-world tasks in the Agentick benchmark.
Your goal is to navigate the grid and complete the task objective.

## Action Space
0: NOOP  1: MOVE_UP  2: MOVE_DOWN  3: MOVE_LEFT  4: MOVE_RIGHT  5: INTERACT

## Task Objective
Navigate the maze to reach the GOAL exit. Collect keys to open doors.

Respond with ONLY the action number, nothing else.

Observation → Model → Response:

User: You are at (3,2) facing south. Key visible to the east at distance 2.
      Walls to the north and west. Valid actions: move_down, move_right

Model: 4

Same single-step view, but the model reasons before acting. Trades tokens for better decisions on tasks that require planning or inference.

System prompt (appended):

IMPORTANT: Before choosing an action, reason step-by-step but be
CONCISE (2-4 sentences max):
1. What do you observe? What is your goal?
2. Which action best advances you toward the goal?
3. Output your final answer on the LAST line as: ACTION: <number>

Observation → Model → Response:

User: You are at (3,2) facing south. Key visible to the east at distance 2.
      Walls to the north and west. Valid actions: move_down, move_right

Model: I see a key to my east at distance 2. I need to collect it to unlock
       the door blocking the maze exit. Moving right gets me closer.
       ACTION: 4

Both harnesses support any observation mode (ASCII, language, pixels for VLMs) and any backend (OpenAI, Gemini, HuggingFace, vLLM). See the LLM/VLM agents docs for the full setup.

Running Experiments

Agentick includes an experiment runner for reproducible evaluation. Define your setup in YAML and run:

# eval_gpt4_navigation.yaml
name: gpt4-navigation-eval
agent:
  type: llm
  hyperparameters:
    backend: openai
    model: gpt-4o
    harness: markovian_reasoner
    observation_modes: [ascii]
tasks: navigation           # or "full", "planning", ["GoToGoal-v0", ...]
difficulties: [easy, medium, hard, expert]
n_seeds: 25
split: eval

uv run python -m agentick.experiments.run --config eval_gpt4_navigation.yaml

The runner handles seed generation, episode management, crash-safe checkpointing, metric computation, and cost tracking for API-based agents. Results include per-task success rates, normalized scores, and capability breakdowns.

For RL training, use standard libraries directly — Agentick environments are Gymnasium-compatible, so SB3, CleanRL, and any Gym-compatible framework work out of the box. For fine-tuning LLMs on oracle trajectories, see the SFT pipeline docs.

The Tasks

Every task is procedurally generated with 4 difficulty levels (easy → expert). Every run produces a unique layout, so agents can't memorize solutions. Click a category and difficulty to explore.

Every task scales across 4 difficulty levels. Easy tasks use 7x7 grids with simple mechanics. Expert tasks are 15-20 cell grids with multiple interacting systems, decoys, and tight step budgets. This controlled scaling lets you see exactly where an agent's capabilities break down.

Leaderboard and Reproducibility

Agentick includes a deterministic seed system for reproducible evaluation. Seeds are derived from a SHA-256 hash of "{task}::{difficulty}::eval", so every submission runs on the exact same 25 episodes per task-difficulty pair (3,800 total episodes).

Scoring is normalized against random and oracle baselines:

score = (agent_return - random) / (oracle - random)

Per-category scores break down into the six capability axes. Submit your results to appear on the leaderboard and see how your agent's capability profile compares.

Preliminary Results

We ran initial evaluations to understand where current agents stand. The results are early — we're still collecting data — but they already reveal striking patterns.

Frontier LLMs on Hard Tasks

We evaluated three frontier LLMs on hard difficulty across navigation, planning, and reasoning using ASCII observations and a chain-of-thought reasoning harness.

Hard-difficulty results expose why a single aggregate score is not enough: stronger prompting helps on reasoning-heavy tasks, while RL baselines remain competitive on control-heavy planning tasks where dense pixel feedback is useful.

Full Benchmark Overview

For agents with complete evaluations across all tasks and difficulties, we can compute Oracle-Normalized Scores (ONS).

Observation Mode & Reasoning Harness: Qwen Model Family

How does observation mode (ASCII vs. language) and the reasoning harness affect performance across model scales? We evaluated the full Qwen model family — Qwen3-4B and Qwen3.5 at 0.8B, 2B, and 4B parameters — with every combination.

The pattern is striking: the reasoning harness consistently multiplies performance by 3-10x across every model and observation mode. ASCII observations outperform language descriptions, and Qwen3.5-4B with ASCII + Reasoner achieves the highest score (22.8%) among all local models tested.

Category Breakdown

ONS across all six capability categories for the three fully-evaluated agents.

The per-category view is designed for diagnosis: an agent can be strong on navigation while still failing on memory or compositional planning, and those differences are visible without digging into every task trace.

SFT on Oracle Trajectories: Closing the Gap

Beyond evaluation, Agentick is designed as a training environment. Every task ships with an oracle policy — a deterministic, code-defined solver that produces optimal actions for any state — so generating a high-quality behaviour-cloning corpus is a single command. We fine-tuned Qwen3.5-4B with LoRA on two sizes of oracle-trajectory corpora — 120k and 250k per-step rows — over the ASCII grid representation. Each run trained for two epochs with the same Qwen sampling settings used at eval time.

Qwen3.5-4B SFT training loss curves (120k and 250k)

Both runs converge cleanly past the 2-epoch mark we use as the evaluation checkpoint. The 250k corpus reaches a slightly lower final loss but, more importantly, exposes the model to a wider distribution of (state, oracle-action) pairs across the full task suite.

With the markovian harness, the base model is essentially at floor (2.3% overall ONS); SFT-120k lifts it to 35.4% and SFT-250k to 44.7% — a ~19× improvement, driven mostly by navigation (1.3% → 56.8%), planning (6.1% → 55.7%) and reasoning (0.0% → 44.5%). The markovian reasoner harness shows the same trend on top of its own boost: base 22.8%, SFT-120k 34.9%, SFT-250k 44.4%. The category breakdown reveals where SFT helps most — symbolic, search-shaped problems (navigation, planning, reasoning) — and where it barely moves the needle: generalization, where SFT-120k and SFT-250k both stall around 35% because the held-out distribution shift is exactly what behaviour cloning struggles with.

After SFT, the markovian and markovian reasoner harnesses converge to almost identical scores (44.7% vs 44.4% at 250k). The oracle trajectories contain only state → optimal action pairs — no chain-of-thought, no rationales — so the fine-tuned policy learns to emit the optimal action directly. The reasoner scaffold no longer has anything new to contribute on top, because the model has internalised the reasoning into its action distribution. This is the expected behaviour of behaviour cloning on terminal-action data, and it's a clean empirical demonstration of what gets transferred (the policy) versus what doesn't (an explicit reasoning trace).

The result with the largest implication, though, is the comparison against frontier proprietary models. Zero-shot, the picture is what you'd expect from any agentic benchmark: GPT-5 mini (30.9%) and other proprietary frontier LLMs sit clearly above open-weights Qwen3.5-4B (22.8%). But the same open-weights 4B model, fine-tuned on Agentick's own oracle trajectories, reaches 44.7% — comfortably ahead of GPT-5 mini and the rest of the frontier zero-shot pack. In other words, the capabilities Agentick exposes as weaknesses in current frontier models are not fundamental limitations — they are missing training data. Agentick is a benchmark and a turnkey way to manufacture exactly the training signal needed to close those gaps. We see this as the most valuable contribution of the project: it points to a path where open-source models, trained with the right curriculum, can match or exceed proprietary frontier systems on a specific class of cognitive skills.

This run is also only the first half of the post-training story — everything above is pure supervised fine-tuning. The next iteration of Agentick adds RL post-training (PPO and GRPO on the same task suite, using the Agentick reward signal) on top of the SFT checkpoint; we expect another step up, particularly on generalization, where behaviour cloning has the most room to improve.

The whole pipeline ran end-to-end: oracles → trajectories → SFT → eval → leaderboard. Re-running it for a different model or a different dataset slice is a config change.

What's Next

More evaluations — open-source models are being evaluated across the full benchmark
RL post-training — PPO / GRPO on the same task suite, on top of the SFT checkpoints, to push past the limits of behaviour cloning (especially on generalization)
Fine-tuning datasets — SFT corpora available on HuggingFace (120k, 250k, 500k per-step rows)
VLM evaluation — pixel observation benchmarks for vision-language models
Better RL baselines — longer training, curriculum learning, multi-task agents
New tasks — community contributions welcome

Help us populate the leaderboard! If you have API access to frontier models and want to contribute, see the submission instructions.

Get Started

git clone https://github.com/roger-creus/agentick.git && cd agentick
uv sync

import agentick

env = agentick.make("GoToGoal-v0", difficulty="easy")
obs, info = env.reset(seed=42)

for _ in range(100):
    obs, reward, terminated, truncated, info = env.step(env.action_space.sample())
    if terminated or truncated:
        break

env.close()

uv run agentick webapp          # Play tasks yourself in the browser
uv run agentick list-tasks      # See every task in the suite

Browse the documentation, explore the task catalog, or check out the example configs to get running.

Built for the Research Community

Agentick is MIT-licensed and designed for open research. Whether you're training RL agents, evaluating foundation models, building hybrid systems, or studying what makes agents "general" - we built this to be the common ground where all of that work can be measured and compared.

We'd love to see:

Leaderboard submissions from open-source and commercial models
New training recipes - RL post-training, SFT, behavior cloning, curriculum learning
Analysis of capability profiles across agent paradigms
Community contributions - new tasks, observation modes, agent harnesses

The benchmark, documentation, examples, and leaderboard are all live. Give it a try and let us know what you find.

View on GitHub

What are models saying about Agentick?

"Loading testimonials..."

— Initializing