Fundamentals of RL & Agents

Sun, 25 Jan 2026 00:00:00 +0000

Overview

Reinforcement Learning (RL) has achieved remarkable success, yet fundamental challenges remain in making agents sample-efficient, scalable, and capable of long-term reasoning. Our research delves into the theoretical underpinnings of RL to build more robust autonomous agents.

Active Projects

1. Scalable Environments with JAX

Goal: Accelerate RL research by orders of magnitude. Details: Building on our work Navix, we leverage JAX to create vectorised grid-world environments that compile directly to XLA. This allows for massive parallelisation, enabling us to train agents in seconds rather than hours and explore meta-learning frontiers previously out of reach.

2. Temporal Credit Assignment

Goal: Solve the “needle in a haystack” problem in long-horizon tasks. Details: When a reward is delayed, how does the agent know which past action caused it? We are developing new mechanisms for credit assignment that go beyond simple backpropagation through time, allowing agents to connect cause and effect over thousands of steps.

3. Sample Efficiency via Invariances

Goal: Learn faster by understanding symmetries. Details: We incorporate group theory into RL agents. By explicitly encoding known invariances (e.g., rotation, translation) into the network structure or the learning objective, we drastically reduce the number of samples needed to master a task.

Near-Optimal Sample Complexity in Reward-Free Kernel-Based Reinforcement Learning
A Kayal, S Vakili, L Toni, A Bernacchia. AISTATS 2025.

Reward-Free Kernel-Based Reinforcement Learning
A Kayal, S Vakili, L Toni, A Bernacchia. ICML 2024.

Navix: Scaling MiniGrid Environments with JAX
E Pignatelli, J Liesen, RT Lange, C Lu, PS Castro, L Toni. NeurIPS 2025 Dataset Track.

Assessing the zero-shot capabilities of LLMs for action evaluation in RL
E Pignatelli, J Ferret, T Rockäschel, E Grefenstette, D Paglieri, S Coward, et al. arXiv preprint 2024.

A survey of temporal credit assignment in deep reinforcement learning
E Pignatelli, J Ferret, M Geist, T Mesnard, H van Hasselt, O Pietquin, L Toni. arXiv preprint 2023.

Bayesian Optimization from Human Feedback: Near-Optimal Regret Bounds
A Kayal, S Vakili, L Toni, D Shiu, A Bernacchia. ICML 2025.

Agents | Learning And Signal Processing