Skip to content

Why does the sequences of rewards start at t-1? #4

@xjh1020

Description

@xjh1020

Thanks for sharing the code, but I have a question.
According to buffer.py.,here

def _shift_sequences(self, obs, actions, rewards, terminals):
        obs = obs[1:]
        actions = actions[:-1]
        rewards = rewards[:-1]
        terminals = terminals[:-1]
        return obs, actions, rewards, terminals

I think you want to align states with rewards, but in trainer.py, here

obs, actions, rewards, terms = self.buffer.sample()
obs = torch.tensor(obs, dtype=torch.float32).to(self.device)                         # t, t+seq_len
actions = torch.tensor(actions, dtype=torch.float32).to(self.device)                 # t-1, t+seq_len-1
rewards = torch.tensor(rewards, dtype=torch.float32).to(self.device).unsqueeze(-1)   # t-1 to t+seq_len-1
nonterms = torch.tensor(1-terms, dtype=torch.float32).to(self.device).unsqueeze(-1)  # t-1 to t+seq_len-1

Why does the sequence of rewards start at t-1?
When prefilling the buffer, a transition (s_t, a_t, r_t+1, d_t+1) is pushed into the buffer, but the r_t+1 corresponds to the s_t+1, so
when calling the _shift_sequences, the states and the rewards should be aligned, so I think the rewards may start at t rather than t - 1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions