diff --git a/README.md b/README.md index 9799ed5da..e05699512 100644 --- a/README.md +++ b/README.md @@ -25,11 +25,11 @@ coach -p CartPole_DQN -r Doom Health Gathering PyBullet Minitaur Gym Extensions Ant

-Blog posts from the Intel® AI website: * [Release 0.8.0](https://ai.intel.com/reinforcement-learning-coach-intel/) (initial release) * [Release 0.9.0](https://ai.intel.com/reinforcement-learning-coach-carla-qr-dqn/) * [Release 0.10.0](https://ai.intel.com/introducing-reinforcement-learning-coach-0-10-0/) -* [Release 0.11.0](https://ai.intel.com/rl-coach-data-science-at-scale) (current release) +* [Release 0.11.0](https://ai.intel.com/rl-coach-data-science-at-scale) +* Release 0.12.0 (current release) Contacting the Coach development team is also possible through the email [coach@intel.com](coach@intel.com) @@ -277,6 +277,7 @@ dashboard * [Clipped Proximal Policy Optimization (CPPO)](https://arxiv.org/pdf/1707.06347.pdf) | **Multi Worker Single Node** ([code](rl_coach/agents/clipped_ppo_agent.py)) * [Generalized Advantage Estimation (GAE)](https://arxiv.org/abs/1506.02438) ([code](rl_coach/agents/actor_critic_agent.py#L86)) * [Sample Efficient Actor-Critic with Experience Replay (ACER)](https://arxiv.org/abs/1611.01224) | **Multi Worker Single Node** ([code](rl_coach/agents/acer_agent.py)) +* [Soft Actor-Critic (SAC)](https://arxiv.org/abs/1801.01290) ([code](rl_coach/agents/soft_actor_critic_agent.py)) ### General Agents * [Direct Future Prediction (DFP)](https://arxiv.org/abs/1611.01779) | **Multi Worker Single Node** ([code](rl_coach/agents/dfp_agent.py)) diff --git a/benchmarks/README.md b/benchmarks/README.md index 603e8823e..fe22f89a5 100644 --- a/benchmarks/README.md +++ b/benchmarks/README.md @@ -37,6 +37,7 @@ The environments that were used for testing include: |**[ACER](acer)** | ![#2E8B57](https://placehold.it/15/2E8B57/000000?text=+) |Atari | | |**[Clipped PPO](clipped_ppo)** | ![#2E8B57](https://placehold.it/15/2E8B57/000000?text=+) |Mujoco | | |**[DDPG](ddpg)** | ![#2E8B57](https://placehold.it/15/2E8B57/000000?text=+) |Mujoco | | +|**[SAC](sac)** | ![#2E8B57](https://placehold.it/15/2E8B57/000000?text=+) |Mujoco | | |**[NEC](nec)** | ![#2E8B57](https://placehold.it/15/2E8B57/000000?text=+) |Atari | | |**[HER](ddpg_her)** | ![#2E8B57](https://placehold.it/15/2E8B57/000000?text=+) |Fetch | | |**[DFP](dfp)** | ![#ceffad](https://placehold.it/15/ceffad/000000?text=+) |Doom | Doom Battle was not verified | diff --git a/benchmarks/clipped_ppo/README.md b/benchmarks/clipped_ppo/README.md index 9eedf095f..f29152add 100644 --- a/benchmarks/clipped_ppo/README.md +++ b/benchmarks/clipped_ppo/README.md @@ -1,6 +1,6 @@ # Clipped PPO -Each experiment uses 3 seeds and is trained for 10k environment steps. +Each experiment uses 3 seeds and is trained for 10M environment steps. The parameters used for Clipped PPO are the same parameters as described in the [original paper](https://arxiv.org/abs/1707.06347). ### Inverted Pendulum Clipped PPO - single worker diff --git a/benchmarks/ddpg/README.md b/benchmarks/ddpg/README.md index 0163d09e1..59a5cdff2 100644 --- a/benchmarks/ddpg/README.md +++ b/benchmarks/ddpg/README.md @@ -1,6 +1,6 @@ # DDPG -Each experiment uses 3 seeds and is trained for 2k environment steps. +Each experiment uses 3 seeds and is trained for 2M environment steps. The parameters used for DDPG are the same parameters as described in the [original paper](https://arxiv.org/abs/1509.02971). ### Inverted Pendulum DDPG - single worker diff --git a/benchmarks/sac/README.md b/benchmarks/sac/README.md new file mode 100644 index 000000000..50e7b6ebc --- /dev/null +++ b/benchmarks/sac/README.md @@ -0,0 +1,48 @@ +# Soft Actor Critic + +Each experiment uses 3 seeds and is trained for 3M environment steps. +The parameters used for SAC are the same parameters as described in the [original paper](https://arxiv.org/abs/1801.01290). + +### Inverted Pendulum SAC - single worker + +```bash +coach -p Mujoco_SAC -lvl inverted_pendulum +``` + +Inverted Pendulum SAC + + +### Hopper Clipped SAC - single worker + +```bash +coach -p Mujoco_SAC -lvl hopper +``` + +Hopper SAC + + +### Half Cheetah Clipped SAC - single worker + +```bash +coach -p Mujoco_SAC -lvl half_cheetah +``` + +Half Cheetah SAC + + +### Walker 2D Clipped SAC - single worker + +```bash +coach -p Mujoco_SAC -lvl walker2d +``` + +Walker 2D SAC + + +### Humanoid Clipped SAC - single worker + +```bash +coach -p Mujoco_SAC -lvl humanoid +``` + +Humanoid SAC diff --git a/benchmarks/sac/half_cheetah_sac.png b/benchmarks/sac/half_cheetah_sac.png new file mode 100644 index 000000000..00b18b54d Binary files /dev/null and b/benchmarks/sac/half_cheetah_sac.png differ diff --git a/benchmarks/sac/hopper_sac.png b/benchmarks/sac/hopper_sac.png new file mode 100644 index 000000000..68d250f72 Binary files /dev/null and b/benchmarks/sac/hopper_sac.png differ diff --git a/benchmarks/sac/humanoid_sac.png b/benchmarks/sac/humanoid_sac.png new file mode 100644 index 000000000..72b73e524 Binary files /dev/null and b/benchmarks/sac/humanoid_sac.png differ diff --git a/benchmarks/sac/inverted_pendulum_sac.png b/benchmarks/sac/inverted_pendulum_sac.png new file mode 100644 index 000000000..0c174fad3 Binary files /dev/null and b/benchmarks/sac/inverted_pendulum_sac.png differ diff --git a/benchmarks/sac/walker2d_sac.png b/benchmarks/sac/walker2d_sac.png new file mode 100644 index 000000000..c4ae67fea Binary files /dev/null and b/benchmarks/sac/walker2d_sac.png differ diff --git a/docs/_images/algorithms.png b/docs/_images/algorithms.png index 983df679b..b3310c076 100644 Binary files a/docs/_images/algorithms.png and b/docs/_images/algorithms.png differ diff --git a/docs/_images/sac.png b/docs/_images/sac.png new file mode 100644 index 000000000..6d51b3944 Binary files /dev/null and b/docs/_images/sac.png differ diff --git a/docs/_modules/index.html b/docs/_modules/index.html index 20c647694..a8cf7d4c0 100644 --- a/docs/_modules/index.html +++ b/docs/_modules/index.html @@ -179,6 +179,7 @@

All modules for which code is available