You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! I've been using ts for a very long time and am currently running into a small problem.
I'm planning to run around 500-1000 experiments, for which ts is perfect. I have successfully done so with many more in the past.
I let it use all 16 GPUs with TS_SLOTS=16, and it runs all 16 in parallel. So far, so good. However, the individual experiments are not very memory-hungry. They each take around 10-20% of GPU memory. So, I could run 64 or even 128 in parallel.
So, I start TS_VISIBLE_DEVICES=0..15 TS_SLOTS=128 ts --set_gpu_free_perc 20 and schedule the jobs with a bash script
(~1000 calls to ts -G 1 ...)
The first 16 are running while the rest are in the allocating state. Starting an individual experiment and allocating the 20% GPU takes a few seconds to minutes, so upon starting, all memory is free.
When does it evaluate that at least 20% of memory is free? If it does in the beginning, it might start all 128 simultaneously, thus probably running into OOM if one of the runs allocates a bit more.
I hope I have explained my issue clearly, it is related to #8.
Do you have a suggestion?
The text was updated successfully, but these errors were encountered:
Hi! I've been using ts for a very long time and am currently running into a small problem.
I'm planning to run around 500-1000 experiments, for which ts is perfect. I have successfully done so with many more in the past.
I let it use all 16 GPUs with TS_SLOTS=16, and it runs all 16 in parallel. So far, so good. However, the individual experiments are not very memory-hungry. They each take around 10-20% of GPU memory. So, I could run 64 or even 128 in parallel.
So, I start
TS_VISIBLE_DEVICES=0..15 TS_SLOTS=128 ts --set_gpu_free_perc 20
and schedule the jobs with a bash script(~1000 calls to
ts -G 1 ...
)The first 16 are running while the rest are in the allocating state. Starting an individual experiment and allocating the 20% GPU takes a few seconds to minutes, so upon starting, all memory is free.
When does it evaluate that at least 20% of memory is free? If it does in the beginning, it might start all 128 simultaneously, thus probably running into OOM if one of the runs allocates a bit more.
I hope I have explained my issue clearly, it is related to #8.
Do you have a suggestion?
The text was updated successfully, but these errors were encountered: