Take mean, rather than sum, of the q-learning loss over batch in DQN baseline.

aslanides · copybara-github · commit 1b49e41f3942 · 2020-05-03T22:12:20.000-07:00
While this probably makes little to no difference to the optimization, it does allow easier comparison of losses for different agents by making the loss invariant to the batch size. Resolves #23. PiperOrigin-RevId: 309683380 Change-Id: Id5fbefbd10af4e9ee58ab8add887fd8e8c50c033
diff --git a/bsuite/baselines/tf/dqn/agent.py b/bsuite/baselines/tf/dqn/agent.py
@@ -126,7 +126,7 @@ def _training_step(self, transitions: Sequence[tf.Tensor]) -> tf.Tensor:
       # One-step Q-learning loss.
       target = r_t + d_t * self._discount * qa_t
       td_error = qa_tm1 - target
-      loss = 0.5 * tf.reduce_sum(td_error**2)  # []
+      loss = 0.5 * tf.reduce_mean(td_error**2)  # []
 
     # Update the online network via SGD.
     variables = self._online_network.trainable_variables