DLR-RM
diff --git a/‎Makefile‎
Lines changed: 1 addition & 1 deletion b/‎Makefile‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md‎
Lines changed: 1 addition & 12 deletions b/‎README.md‎
Lines changed: 1 addition & 12 deletions
diff --git a/‎docs/guide/examples.rst‎
Lines changed: 76 additions & 2 deletions b/‎docs/guide/examples.rst‎
Lines changed: 76 additions & 2 deletions
diff --git a/‎docs/guide/migration.rst‎
Lines changed: 8 additions & 0 deletions b/‎docs/guide/migration.rst‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎docs/index.rst‎
Lines changed: 1 addition & 0 deletions b/‎docs/index.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/misc/changelog.rst‎
Lines changed: 7 additions & 3 deletions b/‎docs/misc/changelog.rst‎
Lines changed: 7 additions & 3 deletions
diff --git a/‎docs/modules/a2c.rst‎
Lines changed: 20 additions & 1 deletion b/‎docs/modules/a2c.rst‎
Lines changed: 20 additions & 1 deletion
diff --git a/‎docs/modules/ddpg.rst‎
Lines changed: 5 additions & 3 deletions b/‎docs/modules/ddpg.rst‎
Lines changed: 5 additions & 3 deletions
diff --git a/‎docs/modules/dqn.rst‎
Lines changed: 4 additions & 0 deletions b/‎docs/modules/dqn.rst‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎docs/modules/her.rst‎
Lines changed: 121 additions & 0 deletions b/‎docs/modules/her.rst‎
Lines changed: 121 additions & 0 deletions
@@ -5,7 +5,7 @@ pytest:
 	./scripts/run_tests.sh
 
 type:
-	pytype
+	pytype -j auto
 
 lint:
 	# stop the build if there are Python syntax errors or undefined names
 
@@ -35,20 +35,9 @@ These algorithms will make it easier for the research community and industry to
 | Type hints                  | :heavy_check_mark: |
 
 
-### Roadmap to V1.0
-
-Please look at the issue for more details.
-Planned features:
-
-- [ ] HER
-
 ### Planned features (v1.1+)
 
-- [ ] DQN extensions (prioritized replay, double q-learning, ...)
-- [ ] Support for `Tuple` and `Dict` observation spaces
-- [ ] Recurrent Policies
-- [ ] TRPO
-
+Please take a look at the [Roadmap](https://github.com/DLR-RM/stable-baselines3/issues/1) and [Milestones](https://github.com/DLR-RM/stable-baselines3/milestones).
 
 ## Migration guide: from Stable-Baselines (SB2) to Stable-Baselines3 (SB3)
 
 
@@ -18,8 +18,7 @@ notebooks:
 -  `Atari Games`_
 -  `RL Baselines zoo`_
 -  `PyBullet`_
-
-.. -  `Hindsight Experience Replay`_
+-  `Hindsight Experience Replay`_
 
 .. _Getting Started: https://colab.research.google.com/github/Stable-Baselines-Team/rl-colab-notebooks/blob/sb3/stable_baselines_getting_started.ipynb
 .. _Training, Saving, Loading: https://colab.research.google.com/github/Stable-Baselines-Team/rl-colab-notebooks/blob/sb3/saving_loading_dqn.ipynb
@@ -343,6 +342,81 @@ will compute a running average and standard deviation of input features (it can
   env.norm_reward = False
 
 
+Hindsight Experience Replay (HER)
+---------------------------------
+
+For this example, we are using `Highway-Env <https://github.com/eleurent/highway-env>`_ by `@eleurent <https://github.com/eleurent>`_.
+
+
+.. image:: ../_static/img/colab-badge.svg
+   :target: https://colab.research.google.com/github/Stable-Baselines-Team/rl-colab-notebooks/blob/sb3/stable_baselines_her.ipynb
+
+
+.. figure:: https://raw.githubusercontent.com/eleurent/highway-env/gh-media/docs/media/parking-env.gif
+
+   The highway-parking-v0 environment.
+
+The parking env is a goal-conditioned continuous control task, in which the vehicle must park in a given space with the appropriate heading.
+
+.. note::
+
+  The hyperparameters in the following example were optimized for that environment.
+
+
+.. code-block:: python
+
+  import gym
+  import highway_env
+  import numpy as np
+
+  from stable_baselines3 import HER, SAC, DDPG, TD3
+  from stable_baselines3.common.noise import NormalActionNoise
+
+  env = gym.make("parking-v0")
+
+  # Create 4 artificial transitions per real transition
+  n_sampled_goal = 4
+
+  # SAC hyperparams:
+  model = HER(
+      "MlpPolicy",
+      env,
+      SAC,
+      n_sampled_goal=n_sampled_goal,
+      goal_selection_strategy="future",
+      # IMPORTANT: because the env is not wrapped with a TimeLimit wrapper
+      # we have to manually specify the max number of steps per episode
+      max_episode_length=100,
+      verbose=1,
+      buffer_size=int(1e6),
+      learning_rate=1e-3,
+      gamma=0.95,
+      batch_size=256,
+      online_sampling=True,
+      policy_kwargs=dict(net_arch=[256, 256, 256]),
+  )
+
+  model.learn(int(2e5))
+  model.save("her_sac_highway")
+
+  # Load saved model
+  model = HER.load("her_sac_highway", env=env)
+
+  obs = env.reset()
+
+  # Evaluate the agent
+  episode_reward = 0
+  for _ in range(100):
+      action, _ = model.predict(obs, deterministic=True)
+      obs, reward, done, info = env.step(action)
+      env.render()
+      episode_reward += reward
+      if done or info.get("is_success", False):
+          print("Reward:", episode_reward, "Success?", info.get("is_success", False))
+          episode_reward = 0.0
+          obs = env.reset()
+
+
 Record a Video
 --------------
 
 
@@ -163,6 +163,14 @@ Despite this change, no change in performance should be expected.
 	To match SB2 behavior, you need to explicitly pass ``deterministic=True``
 
 
+HER
+^^^
+
+The ``HER`` implementation now also supports online sampling of the new goals. This is done in a vectorized version.
+The goal selection strategy ``RANDOM`` is no longer supported.
+``HER`` now supports ``VecNormalize`` wrapper but only when ``online_sampling=True``.
+For performance reasons, the maximum number of steps per episodes must be specified (see :ref:`HER <her>` documentation).
+
 
 New logger API
 --------------
 
@@ -57,6 +57,7 @@ Main Features
   modules/a2c
   modules/ddpg
   modules/dqn
+  modules/her
   modules/ppo
   modules/sac
   modules/td3
 
@@ -4,7 +4,7 @@ Changelog
 ==========
 
 
-Pre-Release 0.10.0a0 (WIP)
+Pre-Release 0.10.0a1 (WIP)
 ------------------------------
 
 Breaking Changes:
@@ -14,11 +14,14 @@ Breaking Changes:
 New Features:
 ^^^^^^^^^^^^^
 - Allow custom actor/critic network architectures using ``net_arch=dict(qf=[400, 300], pi=[64, 64])`` for off-policy algorithms (SAC, TD3, DDPG)
+- Added Hindsight Experience Replay ``HER``. (@megan-klaiber)
+- ``VecNormalize`` now supports ``gym.spaces.Dict`` observation spaces
 - Support logging videos to Tensorboard (@SwamyDev)
 
 Bug Fixes:
 ^^^^^^^^^^
 - Fix GAE computation for on-policy algorithms (off-by one for the last value) (thanks @Wovchena)
+- Fixed potential issue when loading a different environment
 - Fix ignoring the exclude parameter when recording logs using json, csv or log as logging format (@SwamyDev)
 - Make ``make_vec_env`` support the ``env_kwargs`` argument when using an env ID str (@ManifoldFR)
 - Fix model creation initializing CUDA even when `device="cpu"` is provided
@@ -37,6 +40,7 @@ Others:
 Documentation:
 ^^^^^^^^^^^^^^
 - Added first draft of migration guide
+- Enabled doc for ``CnnPolicies``
 
 
 Pre-Release 0.9.0 (2020-10-03)
@@ -68,6 +72,7 @@ New Features:
 
 Bug Fixes:
 ^^^^^^^^^^
+- Added ``unwrap_vec_wrapper()`` to ``common.vec_env`` to extract ``VecEnvWrapper`` if needed
 - Fixed a bug where the environment was reset twice when using ``evaluate_policy``
 - Fix logging of ``clip_fraction`` in PPO (@diditforlulz273)
 - Fixed a bug where cuda support was wrongly checked when passing the GPU index, e.g., ``device="cuda:0"`` (@liorcohen5)
@@ -160,7 +165,6 @@ Documentation:
 - Fixed typo in custom policy doc (@RaphaelWag)
 
 
-
 Pre-Release 0.7.0 (2020-06-10)
 ------------------------------
 
@@ -461,4 +465,4 @@ And all the contributors:
 @MarvineGothic @jdossgollin @SyllogismRXS @rusu24edward @jbulow @Antymon @seheevic @justinkterry @edbeeching
 @flodorner @KuKuXia @NeoExtended @PartiallyTyped @mmcenta @richardwu @kinalmehta @rolandgvc @tkelestemur @mloo3
 @tirafesi @blurLake @koulakis @joeljosephjin @shwang @rk37 @andyshih12 @RaphaelWag @xicocaio
-@diditforlulz273 @liorcohen5 @ManifoldFR @mloo3 @SwamyDev @wmmc88
+@diditforlulz273 @liorcohen5 @ManifoldFR @mloo3 @SwamyDev @wmmc88 @megan-klaiber
@@ -11,7 +11,7 @@ It uses multiple workers to avoid the use of a replay buffer.
 
 
 .. warning::
-  
+
   If you find training unstable or want to match performance of stable-baselines A2C, consider using
   ``RMSpropTFLike`` optimizer from ``stable_baselines3.common.sb2_compat.rmsprop_tf_like``.
   You can change optimizer with ``A2C(policy_kwargs=dict(optimizer_class=RMSpropTFLike))``.
@@ -79,3 +79,22 @@ Parameters
 .. autoclass:: A2C
   :members:
   :inherited-members:
+
+
+A2C Policies
+-------------
+
+.. autoclass:: MlpPolicy
+  :members:
+  :inherited-members:
+
+.. autoclass:: stable_baselines3.common.policies.ActorCriticPolicy
+  :members:
+  :noindex:
+
+.. autoclass:: CnnPolicy
+  :members:
+
+.. autoclass:: stable_baselines3.common.policies.ActorCriticCnnPolicy
+  :members:
+  :noindex:
@@ -98,7 +98,9 @@ DDPG Policies
   :members:
   :inherited-members:
 
+.. autoclass:: stable_baselines3.td3.policies.TD3Policy
+  :members:
+  :noindex:
 
-.. .. autoclass:: CnnPolicy
-..   :members:
-..   :inherited-members:
+.. autoclass:: CnnPolicy
+  :members:
@@ -90,5 +90,9 @@ DQN Policies
   :members:
   :inherited-members:
 
+.. autoclass:: stable_baselines3.dqn.policies.DQNPolicy
+  :members:
+  :noindex:
+
 .. autoclass:: CnnPolicy
   :members:
@@ -0,0 +1,121 @@
+.. _her:
+
+.. automodule:: stable_baselines3.her
+
+
+HER
+====
+
+`Hindsight Experience Replay (HER) <https://arxiv.org/abs/1707.01495>`_
+
+HER is an algorithm that works with off-policy methods (DQN, SAC, TD3 and DDPG for example).
+HER uses the fact that even if a desired goal was not achieved, other goal may have been achieved during a rollout.
+It creates "virtual" transitions by relabeling transitions (changing the desired goal) from past episodes.
+
+
+
+.. warning::
+
+    HER requires the environment to inherits from `gym.GoalEnv <https://github.com/openai/gym/blob/3394e245727c1ae6851b504a50ba77c73cd4c65b/gym/core.py#L160>`_
+
+
+.. warning::
+
+  For performance reasons, the maximum number of steps per episodes must be specified.
+  In most cases, it will be inferred if you specify ``max_episode_steps`` when registering the environment
+  or if you use a ``gym.wrappers.TimeLimit`` (and ``env.spec`` is not None).
+  Otherwise, you can directly pass ``max_episode_length`` to the model constructor
+
+
+.. warning::
+
+	``HER`` supports ``VecNormalize`` wrapper but only when ``online_sampling=True``
+
+
+Notes
+-----
+
+- Original paper: https://arxiv.org/abs/1707.01495
+- OpenAI paper: `Plappert et al. (2018)`_
+- OpenAI blog post: https://openai.com/blog/ingredients-for-robotics-research/
+
+
+.. _Plappert et al. (2018): https://arxiv.org/abs/1802.09464
+
+Can I use?
+----------
+
+Please refer to the used model (DQN, SAC, TD3 or DDPG) for that section.
+
+Example
+-------
+
+.. code-block:: python
+
+    from stable_baselines3 import HER, DDPG, DQN, SAC, TD3
+    from stable_baselines3.her.goal_selection_strategy import GoalSelectionStrategy
+    from stable_baselines3.common.bit_flipping_env import BitFlippingEnv
+    from stable_baselines3.common.vec_env import DummyVecEnv
+    from stable_baselines3.common.vec_env.obs_dict_wrapper import ObsDictWrapper
+
+    model_class = DQN  # works also with SAC, DDPG and TD3
+    N_BITS = 15
+
+    env = BitFlippingEnv(n_bits=N_BITS, continuous=model_class in [DDPG, SAC, TD3], max_steps=N_BITS)
+
+    # Available strategies (cf paper): future, final, episode
+    goal_selection_strategy = 'future' # equivalent to GoalSelectionStrategy.FUTURE
+
+    # If True the HER transitions will get sampled online
+    online_sampling = True
+    # Time limit for the episodes
+    max_episode_length = N_BITS
+
+    # Initialize the model
+    model = HER('MlpPolicy', env, model_class, n_sampled_goal=4, goal_selection_strategy=goal_selection_strategy, online_sampling=online_sampling,
+                            verbose=1, max_episode_length=max_episode_length)
+    # Train the model
+    model.learn(1000)
+
+    model.save("./her_bit_env")
+    model = HER.load('./her_bit_env', env=env)
+
+    obs = env.reset()
+    for _ in range(100):
+        action, _ = model.model.predict(obs, deterministic=True)
+        obs, reward, done, _ = env.step(action)
+
+        if done:
+            obs = env.reset()
+
+
+Parameters
+----------
+
+.. autoclass:: HER
+  :members:
+
+Goal Selection Strategies
+-------------------------
+
+.. autoclass:: GoalSelectionStrategy
+  :members:
+  :inherited-members:
+    :undoc-members:
+
+
+Obs Dict Wrapper
+----------------
+
+.. autoclass:: ObsDictWrapper
+  :members:
+  :inherited-members:
+    :undoc-members:
+
+
+HER Replay Buffer
+-----------------
+
+.. autoclass:: HerReplayBuffer
+  :members:
+  :inherited-members: