Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PPO with Mini-Batches Tutorial #805

Open
kochlisGit opened this issue Dec 14, 2022 · 0 comments
Open

PPO with Mini-Batches Tutorial #805

kochlisGit opened this issue Dec 14, 2022 · 0 comments

Comments

@kochlisGit
Copy link

The documentation of PPO describes the training process of PPO as the following:

# Build PPO agent
ppo_agent = PPOClipAgent(num_epochs=40, ...)

# Build Replay Buffer
replay_buffer = TFUniformReplayBuffer(data_spec=ppo_agent.collect_data_spec,batch_size=env.batch_size, max_length=1000)

# Train agent
experiences, _ = replay_buffer.gather_all()
loss = ppo_agent.train(experiences).loss
replay_buffer.clear()

However, that way you train ppo_agent with 1 large batch of experiences for 40 epochs. However, if the number of experiences is high (e.g. 1024 experiences), you might want to to train PPO on mini batches (e.g. 4 mini-batches of 256 experiences, 40 epochs each mini-batch).

The only way to do that is to build a dataset from replay_buffer and fetch experiences by iterating the dataset. However, this produces random batches, instead of equally selected mini-batches:

# Use 1 epoch per batch
ppo_agent = PPOClipAgent(num_epochs=1, ...)

# Build dataset iter
dataset = replay_buffer.as_dataset(sample_batch_size=200, num_steps=2, num_parallel_calls=2).prefetch(2)
dataset_iter = iter(dataset)

# Training part
loss = 0
for _ in range(40):
    for _ in range(4):
        mini_batch_experiences, _ = next(dataset_iter)
        loss += ppo_agent.train(mini_batch_experiences)
replay_buffer.clear()
loss /= (40*4)

However, this approach has the following issue: It randomly selects 256 experiences from the memory, in a uniform way, but that doesn't ensure that each experience will be equally selected. Is there a better method to train PPO? Also, for some reason, this takes way more time to train than using a single batch as in the approach, and gets worse training results, so am I missing something else here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant