Submit Your Agent
Evaluate your agent on all Agentick tasks and submit your results to the leaderboard.
1. Run the evaluation
Evaluate your agent on all tasks, all 4 difficulty levels, using the official eval seeds (25 seeds per task-difficulty). Save results as JSON files in the following layout:
results/my_agent/
per_task/
GoToGoal-v0/
easy.json
medium.json
hard.json
expert.json
MazeNavigation-v0/
easy.json
...
... (all tasks)
Each difficulty JSON must contain:
{
"task_name": "GoToGoal-v0",
"difficulty": "easy",
"seeds": [123, 456, ...], // 25 official eval seeds
"episode_returns": [1.0, 0.8, ...], // one float per seed
"success_flags": [true, false, ...], // one bool per seed
"episode_steps": [42, 100, ...] // optional
}
The official eval seeds are generated deterministically. You can obtain them in Python:
from agentick.leaderboard.seeds import get_eval_seeds
seeds = get_eval_seeds("GoToGoal-v0", "easy") # returns tuple of 25 ints
2. Validate your results
Run the validation script to check completeness and correctness:
python scripts/validate_submission.py results/my_agent/ \
--agent-name "My Agent Name"
This checks that all tasks and 4 difficulties are present, seeds match the official eval seeds, episode counts are correct (25 per task-difficulty), and computes your Agentick score. If everything passes, it packages your results into a submission zip file.
3. Email your submission
Send the zip file to:
roger.creus-castanyer@mila.quebec
Use the subject line: [Agentick Submission] Your Agent Name
In the email body, include:
- Agent name
- Your name / affiliation
- Brief description of the agent
- Agent type: rl, llm, vlm, hybrid, human, or other
- Observation mode: ascii, language, language_structured, rgb_array, or state_dict
- Harness preset (if using the agentick agent harness)
- Model name (e.g., gpt-4o, claude-3.5-sonnet, custom-cnn)
- Open weights: yes or no
Scoring methodology
The Agentick score is computed as the mean of per-category success rates (equally weighted across 6 categories: navigation, planning, reasoning, memory, generalization, multi-agent). This ensures each capability contributes equally regardless of the number of tasks it contains. A 95% bootstrap confidence interval is computed over the category means.
Important notes
- You must use the official eval seeds — results with non-matching seeds will be rejected.
- Each (task, difficulty) must have exactly 25 episodes (one per seed).
- Submissions are verified by the Agentick team before appearing on the leaderboard.
- Your agent may be re-evaluated for reproducibility checking.