About me
I am a PhD student at Mila Québec & University of Montréal since Fall 2024, working with Professors Pablo Samuel Castro and Glen Berseth. I obtained my MSc at Mila Québec & University of Montréal in Montréal, Canada and my BSc in Data Science and Engineering at Universitat Politècnica de Catalunya (UPC) in Barcelona, Spain.
From Barcelona, Spain
Currently in Montreal, Canada
Research
My research centers on general, autonomous agents built on Deep Reinforcement Learning (RL) and Foundation Models (LLMs, VLMs). I explore how to integrate the structured learning and adaptability of RL with the broad priors and reasoning abilities of foundation models — using them to improve exploration, credit assignment, and skill discovery. I'm particularly interested in how RL can make foundation models more agentic, unifying reasoning and control for general-purpose AI agents.
Interests
- Deep Reinforcement Learning
- Foundation Models (LLMs, VLMs)
- AI Agents
Experience
- Research Intern @ Vmax AI
- Research Intern @ Ubisoft LaForge
- Teaching Assistant @ University of Montreal
- Junior Data Scientist @ HP Inc
- Research Assistant @ UPC
- Basketball Coach @ Sagrada Familia Claror
Recent News
- January 2026: ARM-FM: Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning accepted at International Conference on Learning Representations (ICLR) 2026.
- January 2026: Started a research internship at Vmax, working on LLM post-training and synthetic data generation at scale for SWE-agents.
- June 2025: Stable Gradients for Stable Learning at Scale in Deep Reinforcement Learning published at NeurIPS 2025 as Spotlight (top 3% submissions)
- February 2025: Obtained the AI Research Scholarship from Université de Montréal.
- December 2024: Obtained the Academic Excellence Scholarship from Université de Montréal.
- September 2024: Started my PhD at Université de Montréal and Mila under the supervision of Professors Pablo Samuel Castro and Glen Berseth.
- August 2024: Graduated with a Research MSc in Computer Science & Artificial Intelligence from Université de Montréal and Mila. I made it to the Dean's List! . Read my thesis
- June 2024: Received the End-of-Studies Scholarship from Université de Montréal to complete my MSc thesis.
- May 2024: Surprise-Adaptive Intrinsic Motivation for Unsupervised Reinforcement Learning published at Reinforcement Learning Conference (RLC) 2024.
- April 2024: RLeXplore: Accelerating Research in Intrinsically-Motivated Reinforcement Learning published at Transactions on Machine Learning Research (TMLR) 2024.
- February 2024: Improving Intrinsic Exploration by Creating Stationary Objectives published at International Conference on Learning Representations (ICLR) 2024.
- December 2023: Surprise-Adaptive Intrinsic Motivation for Unsupervised Reinforcement Learning accepted at Intrinsically Motivated Open-ended Learning workshop @ NeurIPS 2023. Selected for Oral Presentation!
- 2021: PixelEDL: Unsupervised Skill Discovery and Learning from Pixels accepted at Embodied AI workshop @ Computer Vision and Pattern Recognition (CVPR) 2021.
- 2021: Unsupervised Skill-Discovery and Skill-Learning in Minecraft accepted at Unsupervised Reinforcement Learning workshop @ International Conference on Machine Learning (ICML) 2021.
- 2021: PiCoEDL: Discovery and Learning of Minecraft Navigation Goals from Pixels and Coordinates accepted at Embodied AI workshop @ Computer Vision and Pattern Recognition (CVPR) 2021.
- 2021: Integration of Convolutional Neural Networks in Mobile Applications accepted at Workshop on AI Engineering @ International Conference on Software Engineering (ICSE) 2021.
- 2021: Which Design Decisions in AI-enabled Mobile Applications Contribute to Greener AI? published at Empiricial Software Engineering Journal (EMSE) 2022.
- 2021: Enhancing sequence-to-sequence modelling for RDF triples to natural text accepted at WebNLG workshop @ EMNLP 2020.
Publications
ARM-FM: Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning
Roger Creus Castanyer, Faisal Mohamed, Pablo Samuel Castro, Cyrus Neary, Glen Berseth
Accepted at International Conference on Learning Representations (ICLR) 2026
We present ARM-FM, a framework for automated, compositional reward design in reinforcement learning using foundation models to automatically generate reward machines (formal automata for specifying objectives) directly from natural language. By pairing high-level reasoning of foundation models with reward machines' structured formalism, ARM-FM enables robust, generalizable RL agents and demonstrates effectiveness—including zero-shot generalization—on diverse, challenging environments.
Stable Gradients for Stable Learning at Scale in Deep Reinforcement Learning
Roger Creus Castanyer, Johan Obando-Ceron, Lu Li, Pierre-Luc Bacon, Glen Berseth, Aaron Courville, Pablo Samuel Castro
Published at NeurIPS 2025 (spotlight)
This work investigates why scaling deep reinforcement learning networks often degrades performance, identifying the interplay of non-stationarity and gradient pathologies from suboptimal architectures as key causes. Through empirical analysis, we propose simple, easily integrated interventions that stabilize gradient flow, enabling robust performance across varying depths and widths. Our approach is compatible with standard algorithms and achieves strong results across diverse agents and environments, offering a practical path toward scaling deep RL effectively.
Improving Intrinsic Exploration by Creating Stationary Objectives
Roger Creus Castanyer, Joshua Romoff, Glen Berseth
Published at International Conference on Learning Representations (ICLR) 2024
Accepted at Agent Learning in Open Endedness workshop @ NeurIPS 2023
We identify that any intrinsic reward function derived from count-based methods is non-stationary and hence induces a difficult objective to optimize for the agent. The key contribution of our work lies in transforming the original non-stationary rewards into stationary rewards through an augmented state representation. We introduce the Stationary Objectives For Exploration (SOFE) framework. Our experiments show that SOFE improves the agents' performance in challenging exploration problems, including sparse-reward tasks, pixel-based observations, 3D navigation, and procedurally generated environments.
Surprise-Adaptive Intrinsic Motivation for Unsupervised Reinforcement Learning
Adriana Hugessen*, Roger Creus Castanyer*, Faisal Mohamed*, Glen Berseth
Published at the Reinforcement Learning Conference (RLC) 2024
(Oral Presentation) Accepted at Intrinsically Motivated Open-ended Learning workshop @ NeurIPS 2023
Both surprise-minimizing and surprise-maximizing (curiosity) objectives for unsupervised reinforcement learning (RL) have been shown to be effective in different environments, depending on the environment's level of natural entropy. However, neither method can perform well across all entropy regimes. In an effort to find a single surprise-based method that will encourage emergent behaviors in any environment, we propose an agent that can adapt its objective depending on the entropy conditions it faces, by framing the choice as a multi-armed bandit problem. We devise a novel intrinsic feedback signal for the bandit which captures the ability of the agent to control the entropy in its environment. We demonstrate that such agents can learn to control entropy and exhibit emergent behaviors in both high-and low-entropy regimes.
RLeXplore: Accelerating Research in Intrinsically-Motivated Reinforcement Learning
Mingqi Yuan*, Roger Creus Castanyer*, Bo Li, Xin Jin, Wenjun Zeng, Glen Berseth
Published at Transactions on Machine Learning Research (TMLR) 2024
Extrinsic rewards can effectively guide reinforcement learning (RL) agents in specific tasks. However, extrinsic rewards frequently fall short in complex environments due to the significant human effort needed for their design and annotation. This limitation underscores the necessity for intrinsic rewards, which offer auxiliary and dense signals and can enable agents to learn in an unsupervised manner. Although various intrinsic reward formulations have been proposed, their implementation and optimization details are insufficiently explored and lack standardization, thereby hindering research progress. To address this gap, we introduce RLeXplore, a unified, highly modularized, and plug-and-play framework offering reliable implementations of eight state-of-the-art intrinsic reward algorithms. Furthermore, we conduct an in-depth study that identifies critical implementation details and establishes well-justified standard practices in intrinsically-motivated RL. The source code for RLeXplore is available at https://github.com/RLE-Foundation/RLeXplore.
Unsupervised Skill-Discovery and Skill-Learning in Minecraft
Juan José Nieto, Roger Creus Castanyer, Xavier Giró-i-Nieto
Accepted at Unsupervised Reinforcement Learning workshop @ ICML 2021.
Pre-training Reinforcement Learning agents in a task-agnostic manner has shown promising results. However, previous works still struggle in learning and discovering meaningful skills in high-dimensional state-spaces, such as pixel-spaces. We approach the problem by leveraging unsupervised skill discovery and self-supervised learning of state representations.
PixelEDL: Unsupervised Skill Discovery and Learning from Pixels
Roger Creus Castanyer, Juan José Nieto, Xavier Giró-i-Nieto
Accepted at Embodied AI workshop @ CVPR 2021.
We tackle embodied visual navigation in a task-agnostic set-up by putting the focus on the unsupervised discovery of skills that provide a good coverage of states. Our approach intersects with empowerment: we address the reward-free skill discovery and learning tasks to discover what can be done in an environment and how.
PiCoEDL: Discovery and Learning of Minecraft Navigation Goals from Pixels and Coordinates
Juan José Nieto, Roger Creus Castanyer, Xavier Giró-i-Nieto
Accepted at Embodied AI workshop @ CVPR 2021.
Defining a reward function in Reinforcement Learning(RL) is not always possible or very costly. For this reason, there is a great interest in training agents in a task-agnostic manner making use of intrinsic motivations and unsupervised techniques. We hypothesize that RL agents will also benefit from unsupervised pre-trainings with no extrinsic rewards, analogously to how humans mostly learn, especially in the early stages of life.
Which Design Decisions in AI-enabled Mobile Applications Contribute to Greener AI?
Roger Creus Castanyer, Silverio Martínez-Fernández, Xavier Franch
Published at Empiricial Software Engineering Journal (EMSE) 2022
The construction, evolution and usage of complex artificial intelligence (AI) models demand expensive computational resources. While currently available high-performance computing environments support well this complexity, the deployment of AI models in mobile devices, which is an increasing trend, is challenging. Our objective is to systematically assess the trade-off between accuracy and complexity when deploying complex AI models (e.g. neural networks) to mobile devices, which have an implicit resource limitation.
Integration of Convolutional Neural Networks in Mobile Applications
Roger Creus Castanyer, Silverio Martínez-Fernández, Xavier Franch
Accepted at Workshop on AI Engineering @ ICSE 2021.
When building Deep Learning (DL) models, data scientists and software engineers manage the trade-off between their accuracy, or any other suitable success criteria, and their complexity. In an environment with high computational power, a common practice is making the models go deeper by designing more sophisticated architectures. However, in the context of mobile devices, which possess less computational power, keeping complexity under control is a must.
Enhancing sequence-to-sequence modelling for RDF triples to natural text
Oriol Domingo, David Bergés, Roser Cantenys, Roger Creus Castanyer, José Adrian Rodríguez Fonollosa
Accepted at WebNLG workshop 2021
This work establishes key guidelines on how, which and when Machine Translation (MT) techniques are worth applying to RDF-to-Text task. Not only do we apply and compare the most prominent MT architecture, the Transformer, but we also analyze state-of-the-art techniques such as Byte Pair Encoding or Back Translation to demonstrate an improvement in generalization.
MT-adapted datasheets for datasets: Template and repository
Marta Costa-jussà, Roger Creus Castanyer, Oriol Domingo, Albert Domínguez, Miquel Escobar, Cayetana López, Marina Garcia, Margarita Geleta
In this report we are taking the standardized model proposed by Gebru et al. (2018) for documenting the popular machine translation datasets of the EuroParl (Koehn, 2005) and News-Commentary (Barrault et al., 2019). Within this documentation process, we have adapted the original datasheet to the particular case of data consumers within the Machine Translation area. We are also proposing a repository for collecting the adapted datasheets in this research area.
Intrinsic Exploration for Reinforcement Learning Beyond Rewards
Roger Creus Castanyer
MSc Thesis
This thesis advances intrinsic motivation in reinforcement learning by tackling the instability of non-stationary rewards with SOFE, an approach that stabilizes exploration through augmented states; introducing S-Adapt, an adaptive entropy-based mechanism enabling emergent behaviors without extrinsic rewards; and developing RLeXplore, a standardized framework for consistent implementation of intrinsic reward methods. Collectively, these contributions improve stability, adaptability, and reproducibility, fostering more autonomous agent behavior in complex environments.
Unsupervised Skill Learning from Pixels
Roger Creus Castanyer
BSc Thesis
This work focuses on the self-acquirement of the fundamental task-agnostic knowledge available within an environment. The aim is to discover and learn baseline representations and behaviors that can later be useful for solving embodied visual navigation downstream tasks.
Projects
Centralized control for multi-agent RL in a complex Real-Time-Strategy game
This repository contains the source code for the project "Centralized control for multi-agent RL in a complex Real-Time-Strategy game", which was submitted as the final project in the COMP579 - Reinforcement Learning course at McGill given by Prof. Doina Precup in Winter 2023.
Blokus RL Learning Environment
This project is an implementation of the Blokus board game environment using the Gymnasium framework. This environment is designed to be used for training AI agents to play Blokus.
Ball Sort RL Learning Environment
This project provides a Reinforcement Learning Environment for the Ball Sort Color Puzzle Game. The environment is based on the Open AI's Gym framework. It also provides baseline Deep Reinforcement Learning models that solve some levels of the game.
xgenius
xgenius is a command-line tool for managing remote jobs and containerized experiments across multiple clusters. It simplifies the process of building Docker images, converting them to Singularity format, and submitting jobs to clusters using SLURM.
RLLTE
RLLTE: Long-Term Evolution Project of Reinforcement Learning
Wave Defense RL Learning Environment
This project provides a Reinforcement Learning Environment for the custom Wave Defense game. The environment is based on the Open AI's Gym framework. It also provides baseline Deep Reinforcement Learning models that solve the game.
Details also in this video: https://www.youtube.com/watch?v=VOmj7_nnPJ0&t=1s&ab_channel=RogerCreusCastanyer
Learning to play a simple game with Genetic Neuroevolution
This project provides an implementation of a Genetic Neuroevolution algorithm in Matlab that learns to play a custom game.