-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BaseMultivariate] OutOfMemoryError on GPU #1202
Comments
The following code trains just fine on max 14 GB GPU mem on my RTX 3090, meaning that it should run fine on any V100 too. You should avoid looping through tryouts, even with all the
|
Hello, thanks @elephaint for you answer. In you example you still have You're right, the combo |
Argh, I was stupid. The following runs with a very low mem cost on my GPU:
This should close the issue, re-open if required. |
Hello again @elephaint, sorry perhaps I didn't express myself clearly enough in my first message.
Thanks in advance! |
I think these are indeed two issues.
|
Exactly @jmoralez, for the first issue it's actually what I suspected, for the 2nd it's indeed at this line that the error happens. I'll try running it on other types of graphics card... Edit: Same issue with NVIDIA T4 on Google Collab |
What happened + What you expected to happen
Hello Nixtla community,
I suspect a bug, or a sampling optimization problem, especially for
BaseMultivariate
models!First of all, the number of additional columns in
df
(exogenous variables) has a strong impact on memory demand, even if they are not taken into account by the model (see the example below withTSMixer
). I think this problem also exists for univariate solutions, but the impact is less exponential than for multivariate ones.Secondly, even if you keep only the necessary columns, it's still hard for the samples to fit into GPU memory during training when you have tens of thousands series.
I know that multivariate scales badly with a number of series, but it seems feasible if I refer to TSMixer's paper: experiments on M5 data (30,490 series with static features) with an NVIDIA Tesla V100 GPU.
Versions / Dependencies
neuralforecast==1.7.5
Databricks Runtime: 14.3 LTS ML
GPU: g5.8xlarge
Reproduction script
Crash from 4 static features:
OutOfMemoryError: CUDA out of memory. Tried to allocate 10.88 GiB. GPU 0 has a total capacity of 21.99 GiB of which 10.24 GiB is free. Process 31599 has 11.74 GiB memory in use. Of the allocated memory 949.35 MiB is allocated by PyTorch, and 10.52 GiB is reserved by PyTorch but unallocated.
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: