Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When is the GPU memory measured when running a large list #59

Open
bermeitinger-b opened this issue May 30, 2024 · 1 comment
Open

When is the GPU memory measured when running a large list #59

bermeitinger-b opened this issue May 30, 2024 · 1 comment

Comments

@bermeitinger-b
Copy link

bermeitinger-b commented May 30, 2024

Hi! I've been using ts for a very long time and am currently running into a small problem.

I'm planning to run around 500-1000 experiments, for which ts is perfect. I have successfully done so with many more in the past.

I let it use all 16 GPUs with TS_SLOTS=16, and it runs all 16 in parallel. So far, so good. However, the individual experiments are not very memory-hungry. They each take around 10-20% of GPU memory. So, I could run 64 or even 128 in parallel.

So, I start TS_VISIBLE_DEVICES=0..15 TS_SLOTS=128 ts --set_gpu_free_perc 20 and schedule the jobs with a bash script
(~1000 calls to ts -G 1 ...)

The first 16 are running while the rest are in the allocating state. Starting an individual experiment and allocating the 20% GPU takes a few seconds to minutes, so upon starting, all memory is free.
When does it evaluate that at least 20% of memory is free? If it does in the beginning, it might start all 128 simultaneously, thus probably running into OOM if one of the runs allocates a bit more.

I hope I have explained my issue clearly, it is related to #8.

Do you have a suggestion?

@lucasb-eyer
Copy link

According to this, it is not measured anymore, and one job on one GPU means it's occupied: #11 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants