DreamerV3: Hardware Resources Underutilized? #288

defrag-bambino · 2024-05-15T14:34:54Z

Hi,

when I run (DreamerV3) experiments, especially ones with a replay_ratio > 1.0, training takes quite a long time. During these runs, my hardware resources are not being used much (e.g. only 1-2 cpu cores at around 50% each) - so clearly there is more computational power available.
I was wondering if there is anything I can do to make SheepRL use more of the available hardware resources. I am already running multiple environments in parallel. I also tried increasing the num_threads, but this seems to have no effect.

Here is a simple example training command:

sheeprl fabric.accelerator=cuda fabric.strategy=ddp fabric.devices=1 fabric.precision=16-mixed exp=dreamer_v3 algo=dreamer_v3_S env=gym env.id=CartPole-v1 algo.total_steps=10000 algo.cnn_keys.encoder=\[\] algo.mlp_keys.encoder=\["vector"\] algo.cnn_keys.decoder=\[\] algo.mlp_keys.decoder=\["vector"\] env.num_envs=12 num_threads=16 checkpoint.every=1000 metric.log_every=100 algo.replay_ratio=10.0

Training this for up to aroung 8000 steps, where it reached the ~500 reward threshold, took around 3 hours. In the log data it lists a Time/sps_train of ~0.046 (which I assume is environment steps per second).

Thanks in advance for this great library!

belerico · 2024-05-15T21:20:44Z

Hi @defrag-bambino, the slowdown when raising the replay-ratio is expected, as the higher the replay-ratio the more gradient steps are computed by the agent per policy-step. Since the training steps happens mainly in the GPU i would look at the GPU rather than the CPU (which is used mainly for saving experiences in the buffer and running a fairly simple env in this case) stats.

Furthermore I suggest you to not use the fabric.strategy=ddp when running on single device.

Another suggestion to speedup the training is to use this branch where we have introduced the compilation through torch.compile which should speedup your training on the right GPU.

If you try out that branch can you kindly report your findings in this issue?

Thank you

belerico · 2024-06-25T08:05:44Z

Hi @defrag-bambino, have this fixed your issue? Are there any other consideration that you want to share?

defrag-bambino · 2024-06-27T08:43:32Z

Yes, this is OK for now! Thanks

defrag-bambino closed this as completed Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DreamerV3: Hardware Resources Underutilized? #288

DreamerV3: Hardware Resources Underutilized? #288

defrag-bambino commented May 15, 2024

belerico commented May 15, 2024 •

edited

Loading

belerico commented Jun 25, 2024

defrag-bambino commented Jun 27, 2024

DreamerV3: Hardware Resources Underutilized? #288

DreamerV3: Hardware Resources Underutilized? #288

Comments

defrag-bambino commented May 15, 2024

belerico commented May 15, 2024 • edited Loading

belerico commented Jun 25, 2024

defrag-bambino commented Jun 27, 2024

belerico commented May 15, 2024 •

edited

Loading