Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VRAM usage increases in version 3.1.0 #3038

Open
2 of 4 tasks
aW3st opened this issue Feb 19, 2025 · 0 comments
Open
2 of 4 tasks

VRAM usage increases in version 3.1.0 #3038

aW3st opened this issue Feb 19, 2025 · 0 comments

Comments

@aW3st
Copy link
Contributor

aW3st commented Feb 19, 2025

System Info

Using the 3.1.0 docker container in an AWS g6.12xlarge instance. --env output:

2025-02-19T17:51:35.116359Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.84.0
Commit sha: 463228ebfc444f60fa351da34a2ba158af0fe9d8
Docker label: sha-463228e
nvidia-smi:
Wed Feb 19 17:51:34 2025
   +-----------------------------------------------------------------------------------------+
   | NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
   |-----------------------------------------+------------------------+----------------------+
   | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
   |                                         |                        |               MIG M. |
   |=========================================+========================+======================|
   |   0  NVIDIA L4                      On  |   00000000:38:00.0 Off |                    0 |
   | N/A   45C    P0             27W /   72W |       1MiB /  23034MiB |      0%      Default |
   |                                         |                        |                  N/A |
   +-----------------------------------------+------------------------+----------------------+
   |   1  NVIDIA L4                      On  |   00000000:3A:00.0 Off |                    0 |
   | N/A   42C    P0             26W /   72W |       1MiB /  23034MiB |      0%      Default |
   |                                         |                        |                  N/A |
   +-----------------------------------------+------------------------+----------------------+
   |   2  NVIDIA L4                      On  |   00000000:3C:00.0 Off |                    0 |
   | N/A   45C    P0             26W /   72W |       1MiB /  23034MiB |      0%      Default |
   |                                         |                        |                  N/A |
   +-----------------------------------------+------------------------+----------------------+
   |   3  NVIDIA L4                      On  |   00000000:3E:00.0 Off |                    0 |
   | N/A   41C    P0             28W /   72W |       1MiB /  23034MiB |      0%      Default |
   |                                         |                        |                  N/A |
   +-----------------------------------------+------------------------+----------------------+

   +-----------------------------------------------------------------------------------------+
   | Processes:                                                                              |
   |  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
   |        ID   ID                                                               Usage      |
   |=========================================================================================|
   |  No running processes found                                                             |
   +-----------------------------------------------------------------------------------------+
xpu-smi:
N/A

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Running docker run --gpus all -p 8000:80 --shm-size 1g ghcr.io/huggingface/text-generation-inference:3.1.0 --model-id hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 --num-shard=4 --quantize awq --max-total-tokens 25000 results in the following memory usage:

Image

Running the same command with version 3.0.1 uses ~6.5 GiB less VRAM:

Image

I tried to run the same experiment with version 3.0.2, but raised a CUDA-related error and failed to start. Perhaps a clue as to the source of the issue?

Expected behavior

I don't expect a minor/patch version upgrade to result in substantially increased memory usage. Upgrading caused our service that's running the model to crash with OOM errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant