-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Hello!
In my research with PPO I have a number of samples equal to 365, which is basically just a feature vector for every day in one year. When training the model on this data I used to keep the default parameter n_steps equal to 2048 and just set the number of total_timesteps to 100 000.
However, I do not quite understand what happens when the agent finishes these 365 steps? Does it keep restarting the data until it reaches 2048 steps overall and then updates its policy?
The results I got seemed a bit better when using 2048 steps versus setting them to 365, in both cases keeping the total_timesteps parameter the same. Could the reason for this might be that updating policy more rarely (every 2048) makes it more stable than having updates every 365 steps?
I would greatly appreciate any tip regarding these parameters and especially explanation of reaching 2048 steps with episode of 365!
Here is the sample code of parameter setting:
# case I
model = PPO('MlpPolicy', env, verbose=1, tensorboard_log=logdir)
model.learn(total_timesteps=100000)
# case II
model = PPO('MlpPolicy', env, n_steps=365, verbose=1, tensorboard_log=logdir)
model.learn(total_timesteps=100000)