Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DreamerV3: Hardware Resources Underutilized? #288

Closed
defrag-bambino opened this issue May 15, 2024 · 3 comments
Closed

DreamerV3: Hardware Resources Underutilized? #288

defrag-bambino opened this issue May 15, 2024 · 3 comments

Comments

@defrag-bambino
Copy link

Hi,

when I run (DreamerV3) experiments, especially ones with a replay_ratio > 1.0, training takes quite a long time. During these runs, my hardware resources are not being used much (e.g. only 1-2 cpu cores at around 50% each) - so clearly there is more computational power available.
I was wondering if there is anything I can do to make SheepRL use more of the available hardware resources. I am already running multiple environments in parallel. I also tried increasing the num_threads, but this seems to have no effect.

Here is a simple example training command:

sheeprl fabric.accelerator=cuda fabric.strategy=ddp fabric.devices=1 fabric.precision=16-mixed exp=dreamer_v3 algo=dreamer_v3_S env=gym env.id=CartPole-v1 algo.total_steps=10000 algo.cnn_keys.encoder=\[\] algo.mlp_keys.encoder=\["vector"\] algo.cnn_keys.decoder=\[\] algo.mlp_keys.decoder=\["vector"\] env.num_envs=12 num_threads=16 checkpoint.every=1000 metric.log_every=100 algo.replay_ratio=10.0

Training this for up to aroung 8000 steps, where it reached the ~500 reward threshold, took around 3 hours. In the log data it lists a Time/sps_train of ~0.046 (which I assume is environment steps per second).

Thanks in advance for this great library!

@belerico
Copy link
Member

belerico commented May 15, 2024

Hi @defrag-bambino, the slowdown when raising the replay-ratio is expected, as the higher the replay-ratio the more gradient steps are computed by the agent per policy-step. Since the training steps happens mainly in the GPU i would look at the GPU rather than the CPU (which is used mainly for saving experiences in the buffer and running a fairly simple env in this case) stats.

Furthermore I suggest you to not use the fabric.strategy=ddp when running on single device.

Another suggestion to speedup the training is to use this branch where we have introduced the compilation through torch.compile which should speedup your training on the right GPU.

If you try out that branch can you kindly report your findings in this issue?

Thank you

@belerico
Copy link
Member

Hi @defrag-bambino, have this fixed your issue? Are there any other consideration that you want to share?

@defrag-bambino
Copy link
Author

Yes, this is OK for now! Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants