Why does the sequences of rewards start at t-1?

Thanks for sharing the code, but I have a question.
According to buffer.py.，here
```
def _shift_sequences(self, obs, actions, rewards, terminals):
        obs = obs[1:]
        actions = actions[:-1]
        rewards = rewards[:-1]
        terminals = terminals[:-1]
        return obs, actions, rewards, terminals
```
I think you want to align states with rewards, but in trainer.py, here
```
obs, actions, rewards, terms = self.buffer.sample()
obs = torch.tensor(obs, dtype=torch.float32).to(self.device)                         # t, t+seq_len
actions = torch.tensor(actions, dtype=torch.float32).to(self.device)                 # t-1, t+seq_len-1
rewards = torch.tensor(rewards, dtype=torch.float32).to(self.device).unsqueeze(-1)   # t-1 to t+seq_len-1
nonterms = torch.tensor(1-terms, dtype=torch.float32).to(self.device).unsqueeze(-1)  # t-1 to t+seq_len-1
```
Why does the sequence of rewards start at t-1?
When prefilling the buffer, a transition (s_t, a_t, r_t+1, d_t+1) is pushed into the buffer, but the r_t+1 corresponds to the s_t+1, so
when calling the `_shift_sequences`, the states and the rewards should be aligned, so I think the rewards may start at t rather than t - 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does the sequences of rewards start at t-1? #4

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Why does the sequences of rewards start at t-1? #4

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions