Thanks for sharing the code, but I have a question.
According to buffer.py.,here
def _shift_sequences(self, obs, actions, rewards, terminals):
obs = obs[1:]
actions = actions[:-1]
rewards = rewards[:-1]
terminals = terminals[:-1]
return obs, actions, rewards, terminals
I think you want to align states with rewards, but in trainer.py, here
obs, actions, rewards, terms = self.buffer.sample()
obs = torch.tensor(obs, dtype=torch.float32).to(self.device) # t, t+seq_len
actions = torch.tensor(actions, dtype=torch.float32).to(self.device) # t-1, t+seq_len-1
rewards = torch.tensor(rewards, dtype=torch.float32).to(self.device).unsqueeze(-1) # t-1 to t+seq_len-1
nonterms = torch.tensor(1-terms, dtype=torch.float32).to(self.device).unsqueeze(-1) # t-1 to t+seq_len-1
Why does the sequence of rewards start at t-1?
When prefilling the buffer, a transition (s_t, a_t, r_t+1, d_t+1) is pushed into the buffer, but the r_t+1 corresponds to the s_t+1, so
when calling the _shift_sequences, the states and the rewards should be aligned, so I think the rewards may start at t rather than t - 1
Thanks for sharing the code, but I have a question.
According to buffer.py.,here
I think you want to align states with rewards, but in trainer.py, here
Why does the sequence of rewards start at t-1?
When prefilling the buffer, a transition (s_t, a_t, r_t+1, d_t+1) is pushed into the buffer, but the r_t+1 corresponds to the s_t+1, so
when calling the
_shift_sequences, the states and the rewards should be aligned, so I think the rewards may start at t rather than t - 1