Leaderboard¶
What Can You Submit?¶
You don't need to evaluate on the full benchmark to submit. We accept:
- Full benchmark — all tasks, all 4 difficulties (appears in the overall ranking + category breakdown + per-task view)
- Partial results — any subset of tasks and/or difficulties (appears in the per-task breakdown only, with
–for missing entries)
Even a single task at a single difficulty is welcome. Every data point helps the community understand where current agents stand.
Submission Flow¶
- Run evaluation using the official eval seeds (25 seeds per task-difficulty pair):
uv run python -m agentick.experiments.run --config your_config.yaml
- Validate your results:
uv run python scripts/validate_submission.py results/<your_run>/
- Submit — email the generated zip to
roger.creus-castanyer@mila.quebec
Include in the email: agent name, your name/affiliation, agent type (rl/llm/vlm/hybrid), observation mode, model name, and whether weights are open.
Scoring Methodology¶
Each task-difficulty pair gets a normalized score (ONS):
ONS = (agent_return - random_baseline) / (oracle_return - random_baseline)
Scores are clipped to [0, 1]. The Agentick Score is the mean across all evaluated task-difficulty pairs. 95% bootstrap confidence intervals are computed over the per-episode results.
Category Scores¶
Scores are also computed per category (only when all tasks in a category are fully evaluated):
| Category | Tasks |
|---|---|
| Navigation | 8 |
| Planning | 9 |
| Reasoning | 8 |
| Memory | 4 |
| Generalization | 3 |
| Multi-Agent | 5 |
Evaluation Seeds¶
Seeds are deterministic and derived from a SHA-256 hash of "{task_name}::{difficulty}::eval". See the Experiments page for details on seed generation and results format.