Submit Your Agent

Evaluate your agent on all Agentick tasks and submit your results to the leaderboard.

1. Run the evaluation

Evaluate your agent on all tasks, all 4 difficulty levels, using the official eval seeds (25 seeds per task-difficulty). Save results as JSON files in the following layout:

results/my_agent/
  per_task/
    GoToGoal-v0/
      easy.json
      medium.json
      hard.json
      expert.json
    MazeNavigation-v0/
      easy.json
      ...
    ... (all tasks)

Each difficulty JSON must contain:

{
  "task_name": "GoToGoal-v0",
  "difficulty": "easy",
  "seeds": [123, 456, ...],           // 25 official eval seeds
  "episode_returns": [1.0, 0.8, ...], // one float per seed
  "success_flags": [true, false, ...], // one bool per seed
  "episode_steps": [42, 100, ...]      // optional
}

The official eval seeds are generated deterministically. You can obtain them in Python:

from agentick.leaderboard.seeds import get_eval_seeds

seeds = get_eval_seeds("GoToGoal-v0", "easy")  # returns tuple of 25 ints

2. Validate your results

Run the validation script to check completeness and correctness:

python scripts/validate_submission.py results/my_agent/ \
    --agent-name "My Agent Name"

This checks that all tasks and 4 difficulties are present, seeds match the official eval seeds, episode counts are correct (25 per task-difficulty), and computes your Agentick score. If everything passes, it packages your results into a submission zip file.

3. Email your submission

Send the zip file to:

roger.creus-castanyer@mila.quebec

Use the subject line: [Agentick Submission] Your Agent Name

In the email body, include:

Agent name
Your name / affiliation
Brief description of the agent
Agent type: rl, llm, vlm, hybrid, human, or other
Observation mode: ascii, language, language_structured, rgb_array, or state_dict
Harness preset (if using the agentick agent harness)
Model name (e.g., gpt-4o, claude-3.5-sonnet, custom-cnn)
Open weights: yes or no

Scoring methodology

The Agentick score is computed as the mean of per-category success rates (equally weighted across 6 categories: navigation, planning, reasoning, memory, generalization, multi-agent). This ensures each capability contributes equally regardless of the number of tasks it contains. A 95% bootstrap confidence interval is computed over the category means.

Important notes

You must use the official eval seeds — results with non-matching seeds will be rejected.
Each (task, difficulty) must have exactly 25 episodes (one per seed).
Submissions are verified by the Agentick team before appearing on the leaderboard.
Your agent may be re-evaluated for reproducibility checking.