-
Notifications
You must be signed in to change notification settings - Fork 333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0.3.4 #1038 works on RTX 4070 but doesn't on GTX1650 4GB VRAM #1043
Comments
Same problem on a Quadro A2000 6GB VRAM ***C:\Users\MATTIA\Desktop\rustsrc\mistral.rs.0.3.4.1038>.\mistralrs-server -i -n 10 gguf -m K:.cache\lm-studio\models\bartowski\Mistral-7B-Instruct-v0.3-GGUF -f Mistral-7B-Instruct-v0.3-Q6_K.gguf
|
I have the same issue using the |
Hi @misureaudio, the error is likely due to the fact that device mapping, when paged attention is disabled, does not map the cache alongside the layers. This issue occurred with paged attention before its fix in #1011. When it worked on the 4070, was the model fully loaded on the GPU, rather than being split between the GPU and CPU? |
Hi, probably yes, on the 4070 laptop I do testing using Phi3.5 and isq |
@misureaudio @dancixx this bug with the non-paged cache not being mapped was fixed in a recent PR, can you please try it again? |
Same problem with #1047 RTX A2000 - 6GB VRAM **C:\Users\MATTIA\Desktop\rustsrc\mistral.rs.0.3.4.1047>.\mistralrs-server -i -n 10 gguf -m \WD-BACKUP02\mattia\AI-Models\lmstudio-community\Meta-Llama-3.1-8B-Instruct-GGUF -f Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
|
I see that the non-paged cache is mapping correctly now, so it's not a cache mapping issue. However, for certain models like llama 3.1 8B GGUF, errors occur specifically when the model is split between the GPU and CPU, whereas splitting it across multiple GPUs works fine. |
@misureaudio @dancixx |
Describe the bug
C:\Users\misur\Desktop\rustsrc\mistral.rs.0.3.4.1038>.\mistralrs-server -i -n 10 gguf -m ..\GGUF -f Mistral-7B-Instruct-v0.3-Q8_0.gguf
2025-01-09T09:40:18.289603Z INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2025-01-09T09:40:18.289988Z INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-01-09T09:40:18.290347Z INFO mistralrs_server: Model kind is: gguf quantized from gguf (no adapters)
2025-01-09T09:40:18.293064Z INFO mistralrs_core::pipeline::paths: Loading
Mistral-7B-Instruct-v0.3-Q8_0.gguf
locally at..\GGUF\Mistral-7B-Instruct-v0.3-Q8_0.gguf
2025-01-09T09:40:18.467512Z INFO mistralrs_core::gguf::content: Model config:
general.architecture: llama
general.file_type: 7
general.name: Mistral-7B-Instruct-v0.3
general.quantization_version: 2
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 32
llama.context_length: 32768
llama.embedding_length: 4096
llama.feed_forward_length: 14336
llama.rope.dimension_count: 128
llama.rope.freq_base: 1000000
llama.vocab_size: 32768
quantize.imatrix.chunks_count: 228
quantize.imatrix.dataset: /training_data/calibration_data.txt
quantize.imatrix.entries_count: 224
quantize.imatrix.file: /models/Mistral-7B-Instruct-v0.3-GGUF/Mistral-7B-Instruct-v0.3.imatrix
2025-01-09T09:40:18.496830Z INFO mistralrs_core::gguf::gguf_tokenizer: GGUF tokenizer model is
llama
, kind:Unigram
, num tokens: 32768, num added tokens: 0, num merges: 0, num scores: 327682025-01-09T09:40:18.497589Z INFO mistralrs_core::gguf::chat_template: Discovered and using GGUF chat template:
{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}
2025-01-09T09:40:18.534091Z INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 7.5
2025-01-09T09:40:18.534307Z INFO mistralrs_core::utils::normal: Skipping BF16 because CC < 8.0
2025-01-09T09:40:18.566049Z INFO mistralrs_core::utils::normal: DType selected is F16.
2025-01-09T09:40:18.868166Z INFO mistralrs_core::device_map: Model has 32 repeating layers.
2025-01-09T09:40:18.868307Z INFO mistralrs_core::device_map: Loading model according to the following repeating layer mappings:
2025-01-09T09:40:18.868543Z INFO mistralrs_core::device_map: Layer 0: cuda[0]
2025-01-09T09:40:18.868679Z INFO mistralrs_core::device_map: Layer 1: cuda[0]
2025-01-09T09:40:18.868849Z INFO mistralrs_core::device_map: Layer 2: cuda[0]
2025-01-09T09:40:18.868905Z INFO mistralrs_core::device_map: Layer 3: cuda[0]
2025-01-09T09:40:18.868972Z INFO mistralrs_core::device_map: Layer 4: cuda[0]
2025-01-09T09:40:18.869026Z INFO mistralrs_core::device_map: Layer 5: cuda[0]
2025-01-09T09:40:18.869096Z INFO mistralrs_core::device_map: Layer 6: cuda[0]
2025-01-09T09:40:18.869140Z INFO mistralrs_core::device_map: Layer 7: cuda[0]
2025-01-09T09:40:18.869191Z INFO mistralrs_core::device_map: Layer 8: cuda[0]
2025-01-09T09:40:18.869232Z INFO mistralrs_core::device_map: Layer 9: cuda[0]
2025-01-09T09:40:18.869281Z INFO mistralrs_core::device_map: Layer 10: cpu
2025-01-09T09:40:18.869330Z INFO mistralrs_core::device_map: Layer 11: cpu
2025-01-09T09:40:18.869386Z INFO mistralrs_core::device_map: Layer 12: cpu
2025-01-09T09:40:18.869441Z INFO mistralrs_core::device_map: Layer 13: cpu
2025-01-09T09:40:18.869523Z INFO mistralrs_core::device_map: Layer 14: cpu
2025-01-09T09:40:18.869596Z INFO mistralrs_core::device_map: Layer 15: cpu
2025-01-09T09:40:18.869719Z INFO mistralrs_core::device_map: Layer 16: cpu
2025-01-09T09:40:18.869777Z INFO mistralrs_core::device_map: Layer 17: cpu
2025-01-09T09:40:18.869827Z INFO mistralrs_core::device_map: Layer 18: cpu
2025-01-09T09:40:18.869871Z INFO mistralrs_core::device_map: Layer 19: cpu
2025-01-09T09:40:18.869935Z INFO mistralrs_core::device_map: Layer 20: cpu
2025-01-09T09:40:18.870010Z INFO mistralrs_core::device_map: Layer 21: cpu
2025-01-09T09:40:18.870065Z INFO mistralrs_core::device_map: Layer 22: cpu
2025-01-09T09:40:18.870112Z INFO mistralrs_core::device_map: Layer 23: cpu
2025-01-09T09:40:18.870159Z INFO mistralrs_core::device_map: Layer 24: cpu
2025-01-09T09:40:18.870226Z INFO mistralrs_core::device_map: Layer 25: cpu
2025-01-09T09:40:18.870274Z INFO mistralrs_core::device_map: Layer 26: cpu
2025-01-09T09:40:18.870325Z INFO mistralrs_core::device_map: Layer 27: cpu
2025-01-09T09:40:18.870376Z INFO mistralrs_core::device_map: Layer 28: cpu
2025-01-09T09:40:18.870452Z INFO mistralrs_core::device_map: Layer 29: cpu
2025-01-09T09:40:18.870504Z INFO mistralrs_core::device_map: Layer 30: cpu
2025-01-09T09:40:18.870566Z INFO mistralrs_core::device_map: Layer 31: cpu
2025-01-09T09:40:26.607766Z INFO mistralrs_core::device_map: Model has 32 repeating layers.
2025-01-09T09:40:26.607920Z INFO mistralrs_core::device_map: Loading model according to the following repeating layer mappings:
2025-01-09T09:40:26.607998Z INFO mistralrs_core::device_map: Layer 0: cuda[0]
2025-01-09T09:40:26.608057Z INFO mistralrs_core::device_map: Layer 1: cuda[0]
2025-01-09T09:40:26.608108Z INFO mistralrs_core::device_map: Layer 2: cuda[0]
2025-01-09T09:40:26.608166Z INFO mistralrs_core::device_map: Layer 3: cuda[0]
2025-01-09T09:40:26.608219Z INFO mistralrs_core::device_map: Layer 4: cuda[0]
2025-01-09T09:40:26.608270Z INFO mistralrs_core::device_map: Layer 5: cuda[0]
2025-01-09T09:40:26.608320Z INFO mistralrs_core::device_map: Layer 6: cuda[0]
2025-01-09T09:40:26.608370Z INFO mistralrs_core::device_map: Layer 7: cuda[0]
2025-01-09T09:40:26.608418Z INFO mistralrs_core::device_map: Layer 8: cuda[0]
2025-01-09T09:40:26.608467Z INFO mistralrs_core::device_map: Layer 9: cuda[0]
2025-01-09T09:40:26.608516Z INFO mistralrs_core::device_map: Layer 10: cpu
2025-01-09T09:40:26.608565Z INFO mistralrs_core::device_map: Layer 11: cpu
2025-01-09T09:40:26.608612Z INFO mistralrs_core::device_map: Layer 12: cpu
2025-01-09T09:40:26.608666Z INFO mistralrs_core::device_map: Layer 13: cpu
2025-01-09T09:40:26.608717Z INFO mistralrs_core::device_map: Layer 14: cpu
2025-01-09T09:40:26.608770Z INFO mistralrs_core::device_map: Layer 15: cpu
2025-01-09T09:40:26.608822Z INFO mistralrs_core::device_map: Layer 16: cpu
2025-01-09T09:40:26.608873Z INFO mistralrs_core::device_map: Layer 17: cpu
2025-01-09T09:40:26.608923Z INFO mistralrs_core::device_map: Layer 18: cpu
2025-01-09T09:40:26.609053Z INFO mistralrs_core::device_map: Layer 19: cpu
2025-01-09T09:40:26.609109Z INFO mistralrs_core::device_map: Layer 20: cpu
2025-01-09T09:40:26.609167Z INFO mistralrs_core::device_map: Layer 21: cpu
2025-01-09T09:40:26.609224Z INFO mistralrs_core::device_map: Layer 22: cpu
2025-01-09T09:40:26.609282Z INFO mistralrs_core::device_map: Layer 23: cpu
2025-01-09T09:40:26.609338Z INFO mistralrs_core::device_map: Layer 24: cpu
2025-01-09T09:40:26.609395Z INFO mistralrs_core::device_map: Layer 25: cpu
2025-01-09T09:40:26.609452Z INFO mistralrs_core::device_map: Layer 26: cpu
2025-01-09T09:40:26.609508Z INFO mistralrs_core::device_map: Layer 27: cpu
2025-01-09T09:40:26.609564Z INFO mistralrs_core::device_map: Layer 28: cpu
2025-01-09T09:40:26.609621Z INFO mistralrs_core::device_map: Layer 29: cpu
2025-01-09T09:40:26.609752Z INFO mistralrs_core::device_map: Layer 30: cpu
2025-01-09T09:40:26.609814Z INFO mistralrs_core::device_map: Layer 31: cpu
2025-01-09T09:40:26.609973Z INFO mistralrs_core::pipeline::paths: Using literal chat template.
2025-01-09T09:40:26.655508Z INFO mistralrs_core::pipeline::chat_template: bos_toks = "
", eos_toks = "", unk_tok =2025-01-09T09:40:26.658311Z INFO mistralrs_server: Model loaded.
2025-01-09T09:40:26.667452Z INFO mistralrs_core: Enabling GEMM reduced precision in BF16.
2025-01-09T09:40:26.685026Z INFO mistralrs_core: Enabling GEMM reduced precision in F16.
2025-01-09T09:40:26.691946Z INFO mistralrs_core::cublaslt: Initialized cuBLASlt handle
2025-01-09T09:40:26.692086Z INFO mistralrs_core: Beginning dummy run.
2025-01-09T09:40:33.174922Z ERROR mistralrs_core::engine: prompt step - Model failed with error: DeviceMismatchBinaryOp { lhs: Cuda { gpu_id: 0 }, rhs: Cpu, op: "slice-set" }
2025-01-09T09:40:33.175279Z INFO mistralrs_core: Dummy run completed in 6.48313s.
2025-01-09T09:40:33.175441Z INFO mistralrs_server::interactive_mode: Starting interactive loop with sampling params: SamplingParams { temperature: Some(0.1), top_k: Some(32), top_p: Some(0.1), min_p: Some(0.05), top_n_logprobs: 0, frequency_penalty: Some(0.1), presence_penalty: Some(0.1), stop_toks: None, max_len: Some(4096), logits_bias: None, n_choices: 1, dry_params: Some(DrySamplingParams { sequence_breakers: ["\n", ":", """, "*"], multiplier: 0.0, base: 1.75, allowed_length: 2 }) }
====================
Welcome to interactive mode! Because this model is a text model, you can enter prompts and chat with the model.
Commands:
\help
: Display this message.\exit
: Quit interactive mode.\system <system message here>
:Add a system message to the chat without running the model.
Ex:
\system Always respond as a pirate.
====================
C:\Users\misur\Desktop\rustsrc\mistral.rs.0.3.4.1038>
Latest commit or version
0.3.4 #1038 on Windows 11 24H2
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Oct_30_01:18:48_Pacific_Daylight_Time_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 566.36 Driver Version: 566.36 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce GTX 1650 ... WDDM | 00000000:01:00.0 Off | N/A |
| N/A 40C P8 3W / 40W | 0MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
The text was updated successfully, but these errors were encountered: