Implements the P3O algorithm from the NeurIPS 2025 paper Sequential Monte Carlo for Policy Optimization in Continuous POMDPs. This code was written by Sahel Iqbal and Hany Abdulsamad.
P3O is a policy optimization algorithm for partially observable Markov decision processes (POMDPs) with continuous state, action and observation spaces. See the scripts in examples/ for demonstrations of how to train policies using P3O.
Install JAX for the available hardware. Then run
$ pip install -e .for an editable install.
We provide multiple environments to test P3O's optimal information gathering behavior:
pendulum: A pendulum swing-up task, where only the angular position is observable.cartpole: A cart-pole swing-up task, where only the angular and Cartesian positions are observable.light-dark-2d: A 2D navigation task with location-dependent noise.triangulation: A 2D navigation task with heading-only observations.
Each environment can be ran with two policies:
- a policy with history inputs -
recurrent - a policy with belief state inputs -
attention
For example, for the light-dark environment run:
python examples/lightdark2d/p3o_recurrent.pyor
python examples/lightdark2d/p3o_attention.pyWe provide the following baselines for comparison:
- Deep Variational Reinforcement Learning for POMDPs (DVRL) - See
baselines/dvrl. - Stochastic Latent Actor-Critic (SLAC) - See
baselines/slac. - DualSMC - See
baselines/dsmc.
See baselines/README.md for details.
If you find the code useful, please cite our paper
@inproceedings{abdulsamad2025sequential,
title = {Sequential {Monte Carlo} for policy optimization in continuous {POMDPs}},
author = {Hany Abdulsamad and Sahel Iqbal and Simo S{\"a}rkk{\"a}},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
}