VRAM usage increases in version 3.1.0 #3038

aW3st · 2025-02-19T18:31:38Z

System Info

Using the 3.1.0 docker container in an AWS g6.12xlarge instance. --env output:

2025-02-19T17:51:35.116359Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.84.0
Commit sha: 463228ebfc444f60fa351da34a2ba158af0fe9d8
Docker label: sha-463228e
nvidia-smi:
Wed Feb 19 17:51:34 2025
   +-----------------------------------------------------------------------------------------+
   | NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
   |-----------------------------------------+------------------------+----------------------+
   | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
   |                                         |                        |               MIG M. |
   |=========================================+========================+======================|
   |   0  NVIDIA L4                      On  |   00000000:38:00.0 Off |                    0 |
   | N/A   45C    P0             27W /   72W |       1MiB /  23034MiB |      0%      Default |
   |                                         |                        |                  N/A |
   +-----------------------------------------+------------------------+----------------------+
   |   1  NVIDIA L4                      On  |   00000000:3A:00.0 Off |                    0 |
   | N/A   42C    P0             26W /   72W |       1MiB /  23034MiB |      0%      Default |
   |                                         |                        |                  N/A |
   +-----------------------------------------+------------------------+----------------------+
   |   2  NVIDIA L4                      On  |   00000000:3C:00.0 Off |                    0 |
   | N/A   45C    P0             26W /   72W |       1MiB /  23034MiB |      0%      Default |
   |                                         |                        |                  N/A |
   +-----------------------------------------+------------------------+----------------------+
   |   3  NVIDIA L4                      On  |   00000000:3E:00.0 Off |                    0 |
   | N/A   41C    P0             28W /   72W |       1MiB /  23034MiB |      0%      Default |
   |                                         |                        |                  N/A |
   +-----------------------------------------+------------------------+----------------------+

   +-----------------------------------------------------------------------------------------+
   | Processes:                                                                              |
   |  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
   |        ID   ID                                                               Usage      |
   |=========================================================================================|
   |  No running processes found                                                             |
   +-----------------------------------------------------------------------------------------+
xpu-smi:
N/A

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Running docker run --gpus all -p 8000:80 --shm-size 1g ghcr.io/huggingface/text-generation-inference:3.1.0 --model-id hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 --num-shard=4 --quantize awq --max-total-tokens 25000 results in the following memory usage:

Running the same command with version 3.0.1 uses ~6.5 GiB less VRAM:

I tried to run the same experiment with version 3.0.2, but raised a CUDA-related error and failed to start. Perhaps a clue as to the source of the issue?

Expected behavior

I don't expect a minor/patch version upgrade to result in substantially increased memory usage. Upgrading caused our service that's running the model to crash with OOM errors.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VRAM usage increases in version 3.1.0 #3038

VRAM usage increases in version 3.1.0 #3038

aW3st commented Feb 19, 2025

VRAM usage increases in version 3.1.0 #3038

VRAM usage increases in version 3.1.0 #3038

Comments

aW3st commented Feb 19, 2025

System Info

Information

Tasks

Reproduction

Expected behavior