Stable-Baselines-Team · araffin · Dec 29, 2021 · Aug 7, 2021 · Aug 9, 2021 · Aug 17, 2021
diff --git a/README.md b/README.md
@@ -28,6 +28,7 @@ See documentation for the full list of included features.
 - [Truncated Quantile Critics (TQC)](https://arxiv.org/abs/2005.04269)
 - [Quantile Regression DQN (QR-DQN)](https://arxiv.org/abs/1710.10044)
 - [PPO with invalid action masking (MaskablePPO)](https://arxiv.org/abs/2006.14171)
+- [Trust Region Policy Optimization (TRPO)](https://arxiv.org/abs/1502.05477)
 
 **Gym Wrappers**:
 - [Time Feature Wrapper](https://arxiv.org/abs/1712.00378)

diff --git a/docs/common/utils.rst b/docs/common/utils.rst
@@ -0,0 +1,7 @@
+.. _utils:
+
+Utils
+=====
+
+.. automodule:: sb3_contrib.common.utils
+  :members:
diff --git a/docs/guide/algos.rst b/docs/guide/algos.rst
@@ -9,7 +9,8 @@ along with some useful characteristics: support for discrete/continuous actions,
 Name         ``Box``     ``Discrete`` ``MultiDiscrete`` ``MultiBinary`` Multi Processing
 ============ =========== ============ ================= =============== ================
 TQC          ✔️          ❌            ❌                ❌                ✔️
-QR-DQN       ️❌          ️✔️            ❌                ❌                ✔️ 
+TRPO         ✔️          ✔️             ✔️                ✔️                ✔️
+QR-DQN       ️❌          ️✔️            ❌                ❌                ✔️
 ============ =========== ============ ================= =============== ================
 
 

diff --git a/docs/guide/examples.rst b/docs/guide/examples.rst
@@ -44,3 +44,16 @@ Train a PPO with invalid action masking agent on a toy environment.
   model = MaskablePPO("MlpPolicy", env, verbose=1)
   model.learn(5000)
   model.save("qrdqn_cartpole")
+
+  TRPO
+  ----
+
+  Train a Trust Region Policy Optimization (TRPO) agent on the Pendulum environment.
+
+  .. code-block:: python
+
+    from sb3_contrib import TRPO
+
+    model = TRPO("MlpPolicy", "Pendulum-v0", gamma=0.9, verbose=1)
+    model.learn(total_timesteps=100_000, log_interval=4)
+    model.save("trpo_pendulum")
diff --git a/docs/index.rst b/docs/index.rst
@@ -32,13 +32,15 @@ RL Baselines3 Zoo also offers a simple interface to train, evaluate agents and d
   :caption: RL Algorithms
 
   modules/tqc
+  modules/trpo
   modules/qrdqn
   modules/ppo_mask
 
 .. toctree::
   :maxdepth: 1
   :caption: Common
 
+  common/utils
   common/wrappers
 
 .. toctree::

diff --git a/docs/misc/changelog.rst b/docs/misc/changelog.rst
@@ -4,8 +4,9 @@ Changelog
 ==========
 
 
-Release 1.3.1a6 (WIP)
+Release 1.3.1a7 (WIP)
 -------------------------------
+**Add TRPO**
 
 Breaking Changes:
 ^^^^^^^^^^^^^^^^^
@@ -15,6 +16,7 @@ Breaking Changes:
 
 New Features:
 ^^^^^^^^^^^^^
+- Added ``TRPO`` (@cyprienc)
 - Added experimental support to train off-policy algorithms with multiple envs (note: ``HerReplayBuffer`` currently not supported)
 
 Bug Fixes:
@@ -34,7 +36,7 @@ Documentation:
 Release 1.3.0 (2021-10-23)
 -------------------------------
 
-**Invalid action masking for PPO**
+**Add Invalid action masking for PPO**
 
 .. warning::
 
@@ -52,6 +54,7 @@ New Features:
 - Added ``MaskablePPO`` algorithm (@kronion)
 - ``MaskablePPO`` Dictionary Observation support (@glmcdona)
 
+
 Bug Fixes:
 ^^^^^^^^^^
 
@@ -75,9 +78,6 @@ Breaking Changes:
 ^^^^^^^^^^^^^^^^^
 - Upgraded to Stable-Baselines3 >= 1.2.0
 
-New Features:
-^^^^^^^^^^^^^
-
 Bug Fixes:
 ^^^^^^^^^^
 - QR-DQN and TQC updated so that their policies are switched between train and eval mode at the correct time (@ayeright)
@@ -221,4 +221,4 @@ Stable-Baselines3 is currently maintained by `Antonin Raffin`_ (aka `@araffin`_)
 Contributors:
 -------------
 
-@ku2482 @guyk1971 @minhlong94 @ayeright @kronion @glmcdona
+@ku2482 @guyk1971 @minhlong94 @ayeright @kronion @glmcdona @cyprienc
diff --git a/docs/modules/trpo.rst b/docs/modules/trpo.rst
@@ -0,0 +1,151 @@
+.. _tqc:
+
+.. automodule:: sb3_contrib.trpo
+
+TRPO
+====
+
+`Trust Region Policy Optimization (TRPO) <https://arxiv.org/abs/1502.05477>`_
+is an iterative approach for optimizing policies with guaranteed monotonic improvement.
+
+.. rubric:: Available Policies
+
+.. autosummary::
+    :nosignatures:
+
+    MlpPolicy
+    CnnPolicy
+    MultiInputPolicy
+
+
+Notes
+-----
+
+-  Original paper:  https://arxiv.org/abs/1502.05477
+-  OpenAI blog post: https://blog.openai.com/openai-baselines-ppo/
+
+
+Can I use?
+----------
+
+-  Recurrent policies: ❌
+-  Multi processing: ✔️
+-  Gym spaces:
+
+
+============= ====== ===========
+Space         Action Observation
+============= ====== ===========
+Discrete      ✔️      ✔️
+Box           ✔️      ✔️
+MultiDiscrete ✔️      ✔️
+MultiBinary   ✔️      ✔️
+Dict          ❌      ✔️
+============= ====== ===========
+
+
+Example
+-------
+
+.. code-block:: python
+
+  import gym
+  import numpy as np
+
+  from sb3_contrib import TRPO
+
+  env = gym.make("Pendulum-v0")
+
+  model = TRPO("MlpPolicy", env, verbose=1)
+  model.learn(total_timesteps=10000, log_interval=4)
+  model.save("trpo_pendulum")
+
+  del model # remove to demonstrate saving and loading
+
+  model = TRPO.load("trpo_pendulum")
+
+  obs = env.reset()
+  while True:
+      action, _states = model.predict(obs, deterministic=True)
+      obs, reward, done, info = env.step(action)
+      env.render()
+      if done:
+        obs = env.reset()
+
+
+Results
+-------
+
+Result on the MuJoCo benchmark (1M steps on ``-v3`` envs with MuJoCo v2.1.0) using 3 seeds.
+The complete learning curves are available in the `associated PR <https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/pull/40>`_.
+
+
+===================== ============
+Environments          TRPO
+===================== ============
+HalfCheetah           1803 +/- 46
+Ant                   3554 +/- 591
+Hopper                3372 +/- 215
+Walker2d              4502 +/- 234
+Swimmer               359 +/- 2
+===================== ============
+
+
+How to replicate the results?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Clone RL-Zoo and checkout the branch ``feat/trpo``:
+
+.. code-block:: bash
+
+  git clone https://github.com/cyprienc/rl-baselines3-zoo
+  cd rl-baselines3-zoo/
+
+Run the benchmark (replace ``$ENV_ID`` by the envs mentioned above):
+
+.. code-block:: bash
+
+  python train.py --algo tqc --env $ENV_ID --n-eval-envs 10 --eval-episodes 20 --eval-freq 50000
+
+
+Plot the results:
+
+.. code-block:: bash
+
+  python scripts/all_plots.py -a trpo -e HalfCheetah Ant Hopper Walker2d Swimmer -f logs/ -o logs/trpo_results
+  python scripts/plot_from_file.py -i logs/trpo_results.pkl -latex -l TRPO
+
+
+Parameters
+----------
+
+.. autoclass:: TRPO
+  :members:
+  :inherited-members:
+
+.. _trpo_policies:
+
+TRPO Policies
+-------------
+
+.. autoclass:: MlpPolicy
+  :members:
+  :inherited-members:
+
+.. autoclass:: stable_baselines3.common.policies.ActorCriticPolicy
+  :members:
+  :noindex:
+
+.. autoclass:: CnnPolicy
+  :members:
+
+.. autoclass:: stable_baselines3.common.policies.ActorCriticCnnPolicy
+  :members:
+  :noindex:
+
+.. autoclass:: MultiInputPolicy
+  :members:
+
+.. autoclass:: stable_baselines3.common.policies.MultiInputActorCriticPolicy
+  :members:
+  :noindex:
diff --git a/sb3_contrib/__init__.py b/sb3_contrib/__init__.py
@@ -3,6 +3,7 @@
 from sb3_contrib.ppo_mask import MaskablePPO
 from sb3_contrib.qrdqn import QRDQN
 from sb3_contrib.tqc import TQC
+from sb3_contrib.trpo import TRPO
 
 # Read version from file
 version_file = os.path.join(os.path.dirname(__file__), "version.txt")

diff --git a/sb3_contrib/common/utils.py b/sb3_contrib/common/utils.py
@@ -1,6 +1,7 @@
-from typing import Optional
+from typing import Callable, Optional, Sequence
 
 import torch as th
+from torch import nn
 
 
 def quantile_huber_loss(
@@ -67,3 +68,96 @@ def quantile_huber_loss(
     else:
         loss = loss.mean()
     return loss
+
+
+def conjugate_gradient_solver(
+    matrix_vector_dot_fn: Callable[[th.Tensor], th.Tensor],
+    b,
+    max_iter=10,
+    residual_tol=1e-10,
+) -> th.Tensor:
+    """
+    Finds an approximate solution to a set of linear equations Ax = b
+
+    Sources:
+     - https://github.com/ajlangley/trpo-pytorch/blob/master/conjugate_gradient.py
+     - https://github.com/joschu/modular_rl/blob/master/modular_rl/trpo.py#L122
+
+    Reference:
+     - https://epubs.siam.org/doi/abs/10.1137/1.9781611971446.ch6
+
+    :param matrix_vector_dot_fn:
+        a function that right multiplies a matrix A by a vector v
+    :param b:
+        the right hand term in the set of linear equations Ax = b
+    :param max_iter:
+        the maximum number of iterations (default is 10)
+    :param residual_tol:
+        residual tolerance for early stopping of the solving (default is 1e-10)
+    :return x:
+        the approximate solution to the system of equations defined by `matrix_vector_dot_fn`
+        and b
+    """
+
+    # The vector is not initialized at 0 because of the instability issues when the gradient becomes small.
+    # A small random gaussian noise is used for the initialization.
+    x = 1e-4 * th.randn_like(b)
+    residual = b - matrix_vector_dot_fn(x)
+    # Equivalent to th.linalg.norm(residual) ** 2 (L2 norm squared)
+    residual_squared_norm = th.matmul(residual, residual)
+
+    if residual_squared_norm < residual_tol:
+        # If the gradient becomes extremely small
+        # The denominator in alpha will become zero
+        # Leading to a division by zero
+        return x
+
+    p = residual.clone()
+
+    for i in range(max_iter):
+        # A @ p (matrix vector multiplication)
+        A_dot_p = matrix_vector_dot_fn(p)
+
+        alpha = residual_squared_norm / p.dot(A_dot_p)
+        x += alpha * p
+
+        if i == max_iter - 1:
+            return x
+
+        residual -= alpha * A_dot_p
+        new_residual_squared_norm = th.matmul(residual, residual)
+
+        if new_residual_squared_norm < residual_tol:
+            return x
+
+        beta = new_residual_squared_norm / residual_squared_norm
+        residual_squared_norm = new_residual_squared_norm
+        p = residual + beta * p
+
+
+def flat_grad(
+    output,
+    parameters: Sequence[nn.parameter.Parameter],
+    create_graph: bool = False,
+    retain_graph: bool = False,
+) -> th.Tensor:
+    """
+    Returns the gradients of the passed sequence of parameters into a flat gradient.
+    Order of parameters is preserved.
+
+    :param output: functional output to compute the gradient for
+    :param parameters: sequence of ``Parameter``
+    :param retain_graph: – If ``False``, the graph used to compute the grad will be freed.
+        Defaults to the value of ``create_graph``.
+    :param create_graph: – If ``True``, graph of the derivative will be constructed,
+        allowing to compute higher order derivative products. Default: ``False``.
+    :return: Tensor containing the flattened gradients
+    """
+    grads = th.autograd.grad(
+        output,
+        parameters,
+        create_graph=create_graph,
+        retain_graph=retain_graph,
+        allow_unused=True,
+    )
+    return th.cat([th.ravel(grad) for grad in grads if grad is not None])
diff --git a/sb3_contrib/trpo/__init__.py b/sb3_contrib/trpo/__init__.py
@@ -0,0 +1,2 @@
+from sb3_contrib.trpo.policies import CnnPolicy, MlpPolicy, MultiInputPolicy
+from sb3_contrib.trpo.trpo import TRPO
diff --git a/sb3_contrib/trpo/policies.py b/sb3_contrib/trpo/policies.py
@@ -0,0 +1,16 @@
+# This file is here just to define MlpPolicy/CnnPolicy
+# that work for TRPO
+from stable_baselines3.common.policies import (
+    ActorCriticCnnPolicy,
+    ActorCriticPolicy,
+    MultiInputActorCriticPolicy,
+    register_policy,
+)
+
+MlpPolicy = ActorCriticPolicy
+CnnPolicy = ActorCriticCnnPolicy
+MultiInputPolicy = MultiInputActorCriticPolicy
+
+register_policy("MlpPolicy", ActorCriticPolicy)
+register_policy("CnnPolicy", ActorCriticCnnPolicy)
+register_policy("MultiInputPolicy", MultiInputPolicy)
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		from sb3_contrib.trpo.policies import CnnPolicy, MlpPolicy, MultiInputPolicy
		from sb3_contrib.trpo.trpo import TRPO