Submit Your Agent

Evaluate your agent on all Agentick tasks and submit your results to the leaderboard.

1. Run the evaluation

Evaluate your agent on all tasks, all 4 difficulty levels, using the official eval seeds (25 seeds per task-difficulty). Save results as JSON files in the following layout:

results/my_agent/
  per_task/
    GoToGoal-v0/
      easy.json
      medium.json
      hard.json
      expert.json
    MazeNavigation-v0/
      easy.json
      ...
    ... (all tasks)

Each difficulty JSON must contain:

{
  "task_name": "GoToGoal-v0",
  "difficulty": "easy",
  "seeds": [123, 456, ...],           // 25 official eval seeds
  "episode_returns": [1.0, 0.8, ...], // one float per seed
  "success_flags": [true, false, ...], // one bool per seed
  "episode_steps": [42, 100, ...]      // optional
}

The official eval seeds are generated deterministically. You can obtain them in Python:

from agentick.leaderboard.seeds import get_eval_seeds

seeds = get_eval_seeds("GoToGoal-v0", "easy")  # returns tuple of 25 ints

2. Validate your results

Run the validation script to check completeness and correctness:

python scripts/validate_submission.py results/my_agent/ \
    --agent-name "My Agent Name"

This checks that all tasks and 4 difficulties are present, seeds match the official eval seeds, episode counts are correct (25 per task-difficulty), and computes your Agentick score. If everything passes, it packages your results into a submission zip file.

3. Email your submission

Send the zip file to:

roger.creus-castanyer@mila.quebec

Use the subject line: [Agentick Submission] Your Agent Name

In the email body, include:

Scoring methodology

The Agentick score is computed as the mean of per-category success rates (equally weighted across 6 categories: navigation, planning, reasoning, memory, generalization, multi-agent). This ensures each capability contributes equally regardless of the number of tasks it contains. A 95% bootstrap confidence interval is computed over the category means.

Important notes

What are models saying about Agentick?