You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using the 3.1.0 docker container in an AWS g6.12xlarge instance. --env output:
2025-02-19T17:51:35.116359Z INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.84.0
Commit sha: 463228ebfc444f60fa351da34a2ba158af0fe9d8
Docker label: sha-463228e
nvidia-smi:
Wed Feb 19 17:51:34 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03 Driver Version: 550.144.03 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L4 On | 00000000:38:00.0 Off | 0 |
| N/A 45C P0 27W / 72W | 1MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA L4 On | 00000000:3A:00.0 Off | 0 |
| N/A 42C P0 26W / 72W | 1MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA L4 On | 00000000:3C:00.0 Off | 0 |
| N/A 45C P0 26W / 72W | 1MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA L4 On | 00000000:3E:00.0 Off | 0 |
| N/A 41C P0 28W / 72W | 1MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
xpu-smi:
N/A
Information
Docker
The CLI directly
Tasks
An officially supported command
My own modifications
Reproduction
Running docker run --gpus all -p 8000:80 --shm-size 1g ghcr.io/huggingface/text-generation-inference:3.1.0 --model-id hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 --num-shard=4 --quantize awq --max-total-tokens 25000 results in the following memory usage:
Running the same command with version 3.0.1 uses ~6.5 GiB less VRAM:
I tried to run the same experiment with version 3.0.2, but raised a CUDA-related error and failed to start. Perhaps a clue as to the source of the issue?
Expected behavior
I don't expect a minor/patch version upgrade to result in substantially increased memory usage. Upgrading caused our service that's running the model to crash with OOM errors.
The text was updated successfully, but these errors were encountered:
System Info
Using the 3.1.0 docker container in an AWS g6.12xlarge instance.
--env
output:Information
Tasks
Reproduction
Running
docker run --gpus all -p 8000:80 --shm-size 1g ghcr.io/huggingface/text-generation-inference:3.1.0 --model-id hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 --num-shard=4 --quantize awq --max-total-tokens 25000
results in the following memory usage:Running the same command with version 3.0.1 uses ~6.5 GiB less VRAM:
I tried to run the same experiment with version 3.0.2, but raised a CUDA-related error and failed to start. Perhaps a clue as to the source of the issue?
Expected behavior
I don't expect a minor/patch version upgrade to result in substantially increased memory usage. Upgrading caused our service that's running the model to crash with OOM errors.
The text was updated successfully, but these errors were encountered: