Experiments¶
Run systematic evaluations with YAML configs or the Python API.
YAML Config¶
# my_experiment.yaml
name: gpt4o-language
description: "GPT-4o on navigation tasks"
agent:
type: llm
hyperparameters:
backend: openai
model: gpt-4o
harness: markovian_zero_shot
observation_modes: [language]
api_key_env: OPENAI_API_KEY
max_tokens: 100
temperature: 0.0
tasks: [GoToGoal-v0, MazeNavigation-v0]
difficulties: [easy, medium]
split: eval # "eval" for benchmarking, "train" for training
n_seeds: 25 # Seeds per (task, difficulty)
n_episodes: 1 # Episodes per seed
render_modes: [language]
record_trajectories: true
output_dir: results/gpt4o
Run it:
uv run python -m agentick.experiments.run --config my_experiment.yaml
Python API¶
from agentick.experiments.runner import ExperimentRunner
from agentick.experiments.config import ExperimentConfig
config = ExperimentConfig.from_yaml("my_experiment.yaml")
runner = ExperimentRunner(config)
results = runner.run()
Benchmark Suites¶
Use predefined suites instead of listing tasks:
tasks: "full" # All tasks
tasks: "navigation" # 8 navigation tasks
tasks: "planning" # 9 planning tasks
tasks: "reasoning" # 9 reasoning tasks
tasks: "memory" # 4 memory tasks
tasks: "generalization" # 3 generalization tasks
tasks: "multi_agent" # 5 multi-agent tasks
Seed System¶
Seeds are generated per task per difficulty from the split field using SHA-256 hashing. No need to list explicit seeds:
split: eval— 25 deterministic eval seeds per (task, difficulty). Used for benchmarking and leaderboard.split: train— 2000 train seeds per (task, difficulty). Used for RL/SFT training.
Never train on eval seeds.
from agentick.leaderboard.seeds import generate_task_seeds, get_train_seeds, get_eval_seeds
eval_seeds = get_eval_seeds("GoToGoal-v0", "medium") # 25 seeds
train_seeds = get_train_seeds("GoToGoal-v0", "medium") # 2000 seeds
seeds = generate_task_seeds("GoToGoal-v0", "medium", "eval", 10) # Custom count
7 official suites, all using per-task eval seeds:
| Suite | Tasks |
|---|---|
agentick-full-v2 |
38 |
agentick-navigation-v2 |
8 |
agentick-planning-v2 |
9 |
agentick-reasoning-v2 |
9 |
agentick-memory-v2 |
4 |
agentick-generalization-v2 |
3 |
agentick-multiagent-v2 |
5 |
Scoring¶
Results are normalized to [0, 1]:
score = (agent_return - random_baseline) / (optimal_return - random_baseline)
Where random_baseline is the expected return of a random agent and optimal_return comes from oracle performance. Scores are aggregated per-capability and overall.
Results Format¶
Results are saved to output_dir/{name}_{timestamp}/:
results/gpt4o_20260302_120000/
├── config.yaml # Experiment config
├── metadata.json # Runtime metadata (agent, platform, git hash)
├── summary.json # Aggregate results
├── figures/ # Auto-generated plots
└── per_task/
├── GoToGoal-v0/
│ ├── metrics.json
│ └── episodes/
│ └── diff_easy_seed_0_ep_0.json
└── MazeNavigation-v0/
├── metrics.json
└── episodes/
metadata.json includes: agentick_version, python_version, platform, git_hash, agent_name, agent_type, model, backend, observation_modes, harness.
per_task/{task}/metrics.json contains per-difficulty episode data and aggregate metrics (mean_return, success_rate, mean_length).
Example Configs¶
Pre-built configs in examples/experiments/configs/:
# Run a predefined config
uv run python -m agentick.experiments.run --config examples/experiments/configs/random_agent.yaml