You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, that way you train ppo_agent with 1 large batch of experiences for 40 epochs. However, if the number of experiences is high (e.g. 1024 experiences), you might want to to train PPO on mini batches (e.g. 4 mini-batches of 256 experiences, 40 epochs each mini-batch).
The only way to do that is to build a dataset from replay_buffer and fetch experiences by iterating the dataset. However, this produces random batches, instead of equally selected mini-batches:
# Use 1 epoch per batch
ppo_agent = PPOClipAgent(num_epochs=1, ...)
# Build dataset iter
dataset = replay_buffer.as_dataset(sample_batch_size=200, num_steps=2, num_parallel_calls=2).prefetch(2)
dataset_iter = iter(dataset)
# Training part
loss = 0
for _ in range(40):
for _ in range(4):
mini_batch_experiences, _ = next(dataset_iter)
loss += ppo_agent.train(mini_batch_experiences)
replay_buffer.clear()
loss /= (40*4)
However, this approach has the following issue: It randomly selects 256 experiences from the memory, in a uniform way, but that doesn't ensure that each experience will be equally selected. Is there a better method to train PPO? Also, for some reason, this takes way more time to train than using a single batch as in the approach, and gets worse training results, so am I missing something else here?
The text was updated successfully, but these errors were encountered:
The documentation of PPO describes the training process of PPO as the following:
However, that way you train ppo_agent with 1 large batch of experiences for 40 epochs. However, if the number of experiences is high (e.g. 1024 experiences), you might want to to train PPO on mini batches (e.g. 4 mini-batches of 256 experiences, 40 epochs each mini-batch).
The only way to do that is to build a dataset from replay_buffer and fetch experiences by iterating the dataset. However, this produces random batches, instead of equally selected mini-batches:
However, this approach has the following issue: It randomly selects 256 experiences from the memory, in a uniform way, but that doesn't ensure that each experience will be equally selected. Is there a better method to train PPO? Also, for some reason, this takes way more time to train than using a single batch as in the approach, and gets worse training results, so am I missing something else here?
The text was updated successfully, but these errors were encountered: