Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BaseMultivariate] OutOfMemoryError on GPU #1202

Open
Antoine-Schwartz opened this issue Nov 15, 2024 · 6 comments
Open

[BaseMultivariate] OutOfMemoryError on GPU #1202

Antoine-Schwartz opened this issue Nov 15, 2024 · 6 comments

Comments

@Antoine-Schwartz
Copy link

What happened + What you expected to happen

Hello Nixtla community,

I suspect a bug, or a sampling optimization problem, especially for BaseMultivariate models!

First of all, the number of additional columns in df (exogenous variables) has a strong impact on memory demand, even if they are not taken into account by the model (see the example below with TSMixer). I think this problem also exists for univariate solutions, but the impact is less exponential than for multivariate ones.

Secondly, even if you keep only the necessary columns, it's still hard for the samples to fit into GPU memory during training when you have tens of thousands series.
I know that multivariate scales badly with a number of series, but it seems feasible if I refer to TSMixer's paper: experiments on M5 data (30,490 series with static features) with an NVIDIA Tesla V100 GPU.

Versions / Dependencies

neuralforecast==1.7.5
Databricks Runtime: 14.3 LTS ML
GPU: g5.8xlarge

Reproduction script

n_series = 30000

# Add columns to df (static features) until memory crash
for n_static_features in range(0, 10):

    print(f"====== Nb of static feat: {n_static_features} ======")

    df = generate_series(
        n_series=n_series,
        freq="W",
        min_length=208,
        max_length=208,
        equal_ends=True,
        n_static_features=n_static_features,
    )

    model = TSMixer(
        h=52,
        input_size=104,
        n_series=n_series, 
        max_steps=60,
        val_check_steps=60,
        enable_model_summary=False,
    )
    nf = NeuralForecast(models=[model], freq='W')

    nf.fit(df=df)

    del df, model, nf
    gc.collect()
    torch.cuda.empty_cache()
    time.sleep(60)

Crash from 4 static features:
OutOfMemoryError: CUDA out of memory. Tried to allocate 10.88 GiB. GPU 0 has a total capacity of 21.99 GiB of which 10.24 GiB is free. Process 31599 has 11.74 GiB memory in use. Of the allocated memory 949.35 MiB is allocated by PyTorch, and 10.52 GiB is reserved by PyTorch but unallocated.

image

Issue Severity

High: It blocks me from completing my task.

@elephaint
Copy link
Contributor

The following code trains just fine on max 14 GB GPU mem on my RTX 3090, meaning that it should run fine on any V100 too.

You should avoid looping through tryouts, even with all the gc and empty_cache shenanigans it's nearly impossible to properly clear the memory.

from neuralforecast.models import TSMixer
from utilsforecast.data import generate_series
import torch

n_series = 30490
n_static_features = 10

df = generate_series(
    n_series=n_series,
    freq="W",
    min_length=208,
    max_length=208,
    equal_ends=True,
    n_static_features=0,
)

model = TSMixer(
    h=52,
    input_size=104,
    n_series=n_series, 
    max_steps=60,
    val_check_steps=60,
    enable_model_summary=False,

)
nf = NeuralForecast(models=[model], freq='W')

nf.fit(df=df)

@Antoine-Schwartz
Copy link
Author

Hello, thanks @elephaint for you answer.

In you example you still have n_static_features=0 in the generate_seriescall. Does it really run on 14GB of your RTX 3090 with 10 features?

You're right, the combo gc + empty_cache isn't perfect, but it does allow you to get very close to 0 memory usage for testing, as shown by the graph above :)

@elephaint
Copy link
Contributor

elephaint commented Nov 19, 2024

Argh, I was stupid. TSMixer doesn't even support exogenous variables, so you're starting with the wrong model. You should use TSMixerx and use the correct way of incorporating exogenous variables.

The following runs with a very low mem cost on my GPU:

from neuralforecast import NeuralForecast
from neuralforecast.models import TSMixerx
from utilsforecast.data import generate_series

n_series = 30490
n_static_features = 10

df = generate_series(
    n_series=n_series,
    freq="W",
    min_length=208,
    max_length=208,
    equal_ends=True,
    n_static_features=10,
)

static_df = df.groupby("unique_id", as_index=False).first().drop(columns=["y", "ds"])

model = TSMixerx(
    h=52,
    input_size=104,
    n_series=n_series, 
    max_steps=5,
    val_check_steps=60,
    enable_model_summary=False,
    stat_exog_list=[f"static_{i}" for i in range(n_static_features)],
)
nf = NeuralForecast(models=[model], freq='W')

nf.fit(df=df, static_df=static_df)

This should close the issue, re-open if required.

@Antoine-Schwartz
Copy link
Author

Hello again @elephaint, sorry perhaps I didn't express myself clearly enough in my first message.

  1. I'm well aware that TSMixer (without x) doesn't support covariates, however my example above served to demonstrate that even though static columns are not used, they still have an impact on the total memory required during training.

  2. I re-run your code with TSMixerx, and no surprise: OutOfMemoryError: CUDA out of memory. Tried to allocate 22.12 GiB. GPU 0 has a total capacity of 21.99 GiB of which 20.57 GiB is free.
    Could this be a set-up problem with Databricks GPUs, or more generally with NVIDIA A10G?

Thanks in advance!

@jmoralez
Copy link
Member

I think these are indeed two issues.

  1. I think we currently build the dataset with every column in the dataframe without checking if they're actually used by the models, we can probably be smarter than this and filter it first to keep only the features that will be used.
  2. Does the error happen here?
    windows = windows[:, :, final_condition, :]

    The windows are a view so they don't consume memory until we materialize them, we can try keeping them as a view and just materialize the sample that is taken below
    # Sample windows
    n_windows = windows.shape[2]
    if self.batch_size is not None:
    w_idxs = np.random.choice(
    n_windows,
    size=self.batch_size,
    replace=(n_windows < self.batch_size),
    )
    windows = windows[:, :, w_idxs, :]

@Antoine-Schwartz
Copy link
Author

Antoine-Schwartz commented Nov 20, 2024

Exactly @jmoralez, for the first issue it's actually what I suspected, for the 2nd it's indeed at this line that the error happens.
However, I don't understand how the code can pass on a 14GB GPU.

I'll try running it on other types of graphics card...

Edit: Same issue with NVIDIA T4 on Google Collab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants