Skip to content
This repository was archived by the owner on Dec 11, 2022. It is now read-only.

Commit 74db141

Browse files
guyk1971shadiendrawis
authored andcommitted
SAC algorithm (#282)
* SAC algorithm * SAC - updates to agent (learn_from_batch), sac_head and sac_q_head to fix problem in gradient calculation. Now SAC agents is able to train. gym_environment - fixing an error in access to gym.spaces * Soft Actor Critic - code cleanup * code cleanup * V-head initialization fix * SAC benchmarks * SAC Documentation * typo fix * documentation fixes * documentation and version update * README typo
1 parent 33dc29e commit 74db141

File tree

92 files changed

+2813
-403
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

92 files changed

+2813
-403
lines changed

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,11 +25,11 @@ coach -p CartPole_DQN -r
2525
<img src="img/doom_health.gif" alt="Doom Health Gathering"/> <img src="img/minitaur.gif" alt="PyBullet Minitaur" width = "249" height ="200"/> <img src="img/ant.gif" alt="Gym Extensions Ant"/>
2626
<br><br>
2727

28-
Blog posts from the Intel® AI website:
2928
* [Release 0.8.0](https://ai.intel.com/reinforcement-learning-coach-intel/) (initial release)
3029
* [Release 0.9.0](https://ai.intel.com/reinforcement-learning-coach-carla-qr-dqn/)
3130
* [Release 0.10.0](https://ai.intel.com/introducing-reinforcement-learning-coach-0-10-0/)
32-
* [Release 0.11.0](https://ai.intel.com/rl-coach-data-science-at-scale) (current release)
31+
* [Release 0.11.0](https://ai.intel.com/rl-coach-data-science-at-scale)
32+
* Release 0.12.0 (current release)
3333

3434
Contacting the Coach development team is also possible through the email [[email protected]]([email protected])
3535

@@ -277,6 +277,7 @@ dashboard
277277
* [Clipped Proximal Policy Optimization (CPPO)](https://arxiv.org/pdf/1707.06347.pdf) | **Multi Worker Single Node** ([code](rl_coach/agents/clipped_ppo_agent.py))
278278
* [Generalized Advantage Estimation (GAE)](https://arxiv.org/abs/1506.02438) ([code](rl_coach/agents/actor_critic_agent.py#L86))
279279
* [Sample Efficient Actor-Critic with Experience Replay (ACER)](https://arxiv.org/abs/1611.01224) | **Multi Worker Single Node** ([code](rl_coach/agents/acer_agent.py))
280+
* [Soft Actor-Critic (SAC)](https://arxiv.org/abs/1801.01290) ([code](rl_coach/agents/soft_actor_critic_agent.py))
280281
281282
### General Agents
282283
* [Direct Future Prediction (DFP)](https://arxiv.org/abs/1611.01779) | **Multi Worker Single Node** ([code](rl_coach/agents/dfp_agent.py))

benchmarks/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ The environments that were used for testing include:
3737
|**[ACER](acer)** | ![#2E8B57](https://placehold.it/15/2E8B57/000000?text=+) |Atari | |
3838
|**[Clipped PPO](clipped_ppo)** | ![#2E8B57](https://placehold.it/15/2E8B57/000000?text=+) |Mujoco | |
3939
|**[DDPG](ddpg)** | ![#2E8B57](https://placehold.it/15/2E8B57/000000?text=+) |Mujoco | |
40+
|**[SAC](sac)** | ![#2E8B57](https://placehold.it/15/2E8B57/000000?text=+) |Mujoco | |
4041
|**[NEC](nec)** | ![#2E8B57](https://placehold.it/15/2E8B57/000000?text=+) |Atari | |
4142
|**[HER](ddpg_her)** | ![#2E8B57](https://placehold.it/15/2E8B57/000000?text=+) |Fetch | |
4243
|**[DFP](dfp)** | ![#ceffad](https://placehold.it/15/ceffad/000000?text=+) |Doom | Doom Battle was not verified |

benchmarks/clipped_ppo/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Clipped PPO
22

3-
Each experiment uses 3 seeds and is trained for 10k environment steps.
3+
Each experiment uses 3 seeds and is trained for 10M environment steps.
44
The parameters used for Clipped PPO are the same parameters as described in the [original paper](https://arxiv.org/abs/1707.06347).
55

66
### Inverted Pendulum Clipped PPO - single worker

benchmarks/ddpg/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# DDPG
22

3-
Each experiment uses 3 seeds and is trained for 2k environment steps.
3+
Each experiment uses 3 seeds and is trained for 2M environment steps.
44
The parameters used for DDPG are the same parameters as described in the [original paper](https://arxiv.org/abs/1509.02971).
55

66
### Inverted Pendulum DDPG - single worker

benchmarks/sac/README.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Soft Actor Critic
2+
3+
Each experiment uses 3 seeds and is trained for 3M environment steps.
4+
The parameters used for SAC are the same parameters as described in the [original paper](https://arxiv.org/abs/1801.01290).
5+
6+
### Inverted Pendulum SAC - single worker
7+
8+
```bash
9+
coach -p Mujoco_SAC -lvl inverted_pendulum
10+
```
11+
12+
<img src="inverted_pendulum_sac.png" alt="Inverted Pendulum SAC" width="800"/>
13+
14+
15+
### Hopper Clipped SAC - single worker
16+
17+
```bash
18+
coach -p Mujoco_SAC -lvl hopper
19+
```
20+
21+
<img src="hopper_sac.png" alt="Hopper SAC" width="800"/>
22+
23+
24+
### Half Cheetah Clipped SAC - single worker
25+
26+
```bash
27+
coach -p Mujoco_SAC -lvl half_cheetah
28+
```
29+
30+
<img src="half_cheetah_sac.png" alt="Half Cheetah SAC" width="800"/>
31+
32+
33+
### Walker 2D Clipped SAC - single worker
34+
35+
```bash
36+
coach -p Mujoco_SAC -lvl walker2d
37+
```
38+
39+
<img src="walker2d_sac.png" alt="Walker 2D SAC" width="800"/>
40+
41+
42+
### Humanoid Clipped SAC - single worker
43+
44+
```bash
45+
coach -p Mujoco_SAC -lvl humanoid
46+
```
47+
48+
<img src="humanoid_sac.png" alt="Humanoid SAC" width="800"/>
65.6 KB
Loading

benchmarks/sac/hopper_sac.png

97.3 KB
Loading

benchmarks/sac/humanoid_sac.png

89.6 KB
Loading
48.9 KB
Loading

benchmarks/sac/walker2d_sac.png

76.6 KB
Loading

0 commit comments

Comments
 (0)