Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.3.4 #1038 works on RTX 4070 but doesn't on GTX1650 4GB VRAM #1043

Open
misureaudio opened this issue Jan 9, 2025 · 8 comments
Open

0.3.4 #1038 works on RTX 4070 but doesn't on GTX1650 4GB VRAM #1043

misureaudio opened this issue Jan 9, 2025 · 8 comments
Labels
bug Something isn't working

Comments

@misureaudio
Copy link

misureaudio commented Jan 9, 2025

Describe the bug

C:\Users\misur\Desktop\rustsrc\mistral.rs.0.3.4.1038>.\mistralrs-server -i -n 10 gguf -m ..\GGUF -f Mistral-7B-Instruct-v0.3-Q8_0.gguf
2025-01-09T09:40:18.289603Z INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2025-01-09T09:40:18.289988Z INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-01-09T09:40:18.290347Z INFO mistralrs_server: Model kind is: gguf quantized from gguf (no adapters)
2025-01-09T09:40:18.293064Z INFO mistralrs_core::pipeline::paths: Loading Mistral-7B-Instruct-v0.3-Q8_0.gguf locally at ..\GGUF\Mistral-7B-Instruct-v0.3-Q8_0.gguf
2025-01-09T09:40:18.467512Z INFO mistralrs_core::gguf::content: Model config:
general.architecture: llama
general.file_type: 7
general.name: Mistral-7B-Instruct-v0.3
general.quantization_version: 2
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 32
llama.context_length: 32768
llama.embedding_length: 4096
llama.feed_forward_length: 14336
llama.rope.dimension_count: 128
llama.rope.freq_base: 1000000
llama.vocab_size: 32768
quantize.imatrix.chunks_count: 228
quantize.imatrix.dataset: /training_data/calibration_data.txt
quantize.imatrix.entries_count: 224
quantize.imatrix.file: /models/Mistral-7B-Instruct-v0.3-GGUF/Mistral-7B-Instruct-v0.3.imatrix
2025-01-09T09:40:18.496830Z INFO mistralrs_core::gguf::gguf_tokenizer: GGUF tokenizer model is llama, kind: Unigram, num tokens: 32768, num added tokens: 0, num merges: 0, num scores: 32768
2025-01-09T09:40:18.497589Z INFO mistralrs_core::gguf::chat_template: Discovered and using GGUF chat template: {{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}
2025-01-09T09:40:18.534091Z INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 7.5
2025-01-09T09:40:18.534307Z INFO mistralrs_core::utils::normal: Skipping BF16 because CC < 8.0
2025-01-09T09:40:18.566049Z INFO mistralrs_core::utils::normal: DType selected is F16.
2025-01-09T09:40:18.868166Z INFO mistralrs_core::device_map: Model has 32 repeating layers.
2025-01-09T09:40:18.868307Z INFO mistralrs_core::device_map: Loading model according to the following repeating layer mappings:
2025-01-09T09:40:18.868543Z INFO mistralrs_core::device_map: Layer 0: cuda[0]
2025-01-09T09:40:18.868679Z INFO mistralrs_core::device_map: Layer 1: cuda[0]
2025-01-09T09:40:18.868849Z INFO mistralrs_core::device_map: Layer 2: cuda[0]
2025-01-09T09:40:18.868905Z INFO mistralrs_core::device_map: Layer 3: cuda[0]
2025-01-09T09:40:18.868972Z INFO mistralrs_core::device_map: Layer 4: cuda[0]
2025-01-09T09:40:18.869026Z INFO mistralrs_core::device_map: Layer 5: cuda[0]
2025-01-09T09:40:18.869096Z INFO mistralrs_core::device_map: Layer 6: cuda[0]
2025-01-09T09:40:18.869140Z INFO mistralrs_core::device_map: Layer 7: cuda[0]
2025-01-09T09:40:18.869191Z INFO mistralrs_core::device_map: Layer 8: cuda[0]
2025-01-09T09:40:18.869232Z INFO mistralrs_core::device_map: Layer 9: cuda[0]
2025-01-09T09:40:18.869281Z INFO mistralrs_core::device_map: Layer 10: cpu
2025-01-09T09:40:18.869330Z INFO mistralrs_core::device_map: Layer 11: cpu
2025-01-09T09:40:18.869386Z INFO mistralrs_core::device_map: Layer 12: cpu
2025-01-09T09:40:18.869441Z INFO mistralrs_core::device_map: Layer 13: cpu
2025-01-09T09:40:18.869523Z INFO mistralrs_core::device_map: Layer 14: cpu
2025-01-09T09:40:18.869596Z INFO mistralrs_core::device_map: Layer 15: cpu
2025-01-09T09:40:18.869719Z INFO mistralrs_core::device_map: Layer 16: cpu
2025-01-09T09:40:18.869777Z INFO mistralrs_core::device_map: Layer 17: cpu
2025-01-09T09:40:18.869827Z INFO mistralrs_core::device_map: Layer 18: cpu
2025-01-09T09:40:18.869871Z INFO mistralrs_core::device_map: Layer 19: cpu
2025-01-09T09:40:18.869935Z INFO mistralrs_core::device_map: Layer 20: cpu
2025-01-09T09:40:18.870010Z INFO mistralrs_core::device_map: Layer 21: cpu
2025-01-09T09:40:18.870065Z INFO mistralrs_core::device_map: Layer 22: cpu
2025-01-09T09:40:18.870112Z INFO mistralrs_core::device_map: Layer 23: cpu
2025-01-09T09:40:18.870159Z INFO mistralrs_core::device_map: Layer 24: cpu
2025-01-09T09:40:18.870226Z INFO mistralrs_core::device_map: Layer 25: cpu
2025-01-09T09:40:18.870274Z INFO mistralrs_core::device_map: Layer 26: cpu
2025-01-09T09:40:18.870325Z INFO mistralrs_core::device_map: Layer 27: cpu
2025-01-09T09:40:18.870376Z INFO mistralrs_core::device_map: Layer 28: cpu
2025-01-09T09:40:18.870452Z INFO mistralrs_core::device_map: Layer 29: cpu
2025-01-09T09:40:18.870504Z INFO mistralrs_core::device_map: Layer 30: cpu
2025-01-09T09:40:18.870566Z INFO mistralrs_core::device_map: Layer 31: cpu
2025-01-09T09:40:26.607766Z INFO mistralrs_core::device_map: Model has 32 repeating layers.
2025-01-09T09:40:26.607920Z INFO mistralrs_core::device_map: Loading model according to the following repeating layer mappings:
2025-01-09T09:40:26.607998Z INFO mistralrs_core::device_map: Layer 0: cuda[0]
2025-01-09T09:40:26.608057Z INFO mistralrs_core::device_map: Layer 1: cuda[0]
2025-01-09T09:40:26.608108Z INFO mistralrs_core::device_map: Layer 2: cuda[0]
2025-01-09T09:40:26.608166Z INFO mistralrs_core::device_map: Layer 3: cuda[0]
2025-01-09T09:40:26.608219Z INFO mistralrs_core::device_map: Layer 4: cuda[0]
2025-01-09T09:40:26.608270Z INFO mistralrs_core::device_map: Layer 5: cuda[0]
2025-01-09T09:40:26.608320Z INFO mistralrs_core::device_map: Layer 6: cuda[0]
2025-01-09T09:40:26.608370Z INFO mistralrs_core::device_map: Layer 7: cuda[0]
2025-01-09T09:40:26.608418Z INFO mistralrs_core::device_map: Layer 8: cuda[0]
2025-01-09T09:40:26.608467Z INFO mistralrs_core::device_map: Layer 9: cuda[0]
2025-01-09T09:40:26.608516Z INFO mistralrs_core::device_map: Layer 10: cpu
2025-01-09T09:40:26.608565Z INFO mistralrs_core::device_map: Layer 11: cpu
2025-01-09T09:40:26.608612Z INFO mistralrs_core::device_map: Layer 12: cpu
2025-01-09T09:40:26.608666Z INFO mistralrs_core::device_map: Layer 13: cpu
2025-01-09T09:40:26.608717Z INFO mistralrs_core::device_map: Layer 14: cpu
2025-01-09T09:40:26.608770Z INFO mistralrs_core::device_map: Layer 15: cpu
2025-01-09T09:40:26.608822Z INFO mistralrs_core::device_map: Layer 16: cpu
2025-01-09T09:40:26.608873Z INFO mistralrs_core::device_map: Layer 17: cpu
2025-01-09T09:40:26.608923Z INFO mistralrs_core::device_map: Layer 18: cpu
2025-01-09T09:40:26.609053Z INFO mistralrs_core::device_map: Layer 19: cpu
2025-01-09T09:40:26.609109Z INFO mistralrs_core::device_map: Layer 20: cpu
2025-01-09T09:40:26.609167Z INFO mistralrs_core::device_map: Layer 21: cpu
2025-01-09T09:40:26.609224Z INFO mistralrs_core::device_map: Layer 22: cpu
2025-01-09T09:40:26.609282Z INFO mistralrs_core::device_map: Layer 23: cpu
2025-01-09T09:40:26.609338Z INFO mistralrs_core::device_map: Layer 24: cpu
2025-01-09T09:40:26.609395Z INFO mistralrs_core::device_map: Layer 25: cpu
2025-01-09T09:40:26.609452Z INFO mistralrs_core::device_map: Layer 26: cpu
2025-01-09T09:40:26.609508Z INFO mistralrs_core::device_map: Layer 27: cpu
2025-01-09T09:40:26.609564Z INFO mistralrs_core::device_map: Layer 28: cpu
2025-01-09T09:40:26.609621Z INFO mistralrs_core::device_map: Layer 29: cpu
2025-01-09T09:40:26.609752Z INFO mistralrs_core::device_map: Layer 30: cpu
2025-01-09T09:40:26.609814Z INFO mistralrs_core::device_map: Layer 31: cpu
2025-01-09T09:40:26.609973Z INFO mistralrs_core::pipeline::paths: Using literal chat template.
2025-01-09T09:40:26.655508Z INFO mistralrs_core::pipeline::chat_template: bos_toks = "", eos_toks = "", unk_tok =
2025-01-09T09:40:26.658311Z INFO mistralrs_server: Model loaded.
2025-01-09T09:40:26.667452Z INFO mistralrs_core: Enabling GEMM reduced precision in BF16.
2025-01-09T09:40:26.685026Z INFO mistralrs_core: Enabling GEMM reduced precision in F16.
2025-01-09T09:40:26.691946Z INFO mistralrs_core::cublaslt: Initialized cuBLASlt handle
2025-01-09T09:40:26.692086Z INFO mistralrs_core: Beginning dummy run.
2025-01-09T09:40:33.174922Z ERROR mistralrs_core::engine: prompt step - Model failed with error: DeviceMismatchBinaryOp { lhs: Cuda { gpu_id: 0 }, rhs: Cpu, op: "slice-set" }
2025-01-09T09:40:33.175279Z INFO mistralrs_core: Dummy run completed in 6.48313s.
2025-01-09T09:40:33.175441Z INFO mistralrs_server::interactive_mode: Starting interactive loop with sampling params: SamplingParams { temperature: Some(0.1), top_k: Some(32), top_p: Some(0.1), min_p: Some(0.05), top_n_logprobs: 0, frequency_penalty: Some(0.1), presence_penalty: Some(0.1), stop_toks: None, max_len: Some(4096), logits_bias: None, n_choices: 1, dry_params: Some(DrySamplingParams { sequence_breakers: ["\n", ":", """, "*"], multiplier: 0.0, base: 1.75, allowed_length: 2 }) }
====================

Welcome to interactive mode! Because this model is a text model, you can enter prompts and chat with the model.

Commands:

  • \help: Display this message.
  • \exit: Quit interactive mode.
  • \system <system message here>:
    Add a system message to the chat without running the model.
    Ex: \system Always respond as a pirate.
    ====================

How much is log(-2)?
2025-01-09T09:40:47.954416Z ERROR mistralrs_core::engine: prompt step - Model failed with error: DeviceMismatchBinaryOp { lhs: Cuda { gpu_id: 0 }, rhs: Cpu, op: "slice-set" }
2025-01-09T09:40:47.954662Z ERROR mistralrs_server::interactive_mode: Got a model error: "device mismatch in slice-set, lhs: Cuda { gpu_id: 0 }, rhs: Cpu", response: ChatCompletionResponse { id: "1", choices: [Choice { finish_reason: "error", index: 0, message: ResponseMessage { content: Some(""), role: "assistant", tool_calls: [] }, logprobs: None }], created: 1736415647, model: "..\GGUF", system_fingerprint: "local", object: "chat.completion", usage: Usage { completion_tokens: 0, prompt_tokens: 20, total_tokens: 20, avg_tok_per_sec: 53.0504, avg_prompt_tok_per_sec: inf, avg_compl_tok_per_sec: NaN, total_time_sec: 0.377, total_prompt_time_sec: 0.0, total_completion_time_sec: 0.0 } }

C:\Users\misur\Desktop\rustsrc\mistral.rs.0.3.4.1038>

Latest commit or version

0.3.4 #1038 on Windows 11 24H2

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Oct_30_01:18:48_Pacific_Daylight_Time_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 566.36 Driver Version: 566.36 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce GTX 1650 ... WDDM | 00000000:01:00.0 Off | N/A |
| N/A 40C P8 3W / 40W | 0MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+

@misureaudio misureaudio added the bug Something isn't working label Jan 9, 2025
@misureaudio
Copy link
Author

misureaudio commented Jan 9, 2025

Same problem on a Quadro A2000 6GB VRAM

***C:\Users\MATTIA\Desktop\rustsrc\mistral.rs.0.3.4.1038>.\mistralrs-server -i -n 10 gguf -m K:.cache\lm-studio\models\bartowski\Mistral-7B-Instruct-v0.3-GGUF -f Mistral-7B-Instruct-v0.3-Q6_K.gguf
2025-01-09T12:42:42.378864Z INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2025-01-09T12:42:42.379020Z INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-01-09T12:42:42.379123Z INFO mistralrs_server: Model kind is: gguf quantized from gguf (no adapters)
2025-01-09T12:42:46.321470Z INFO mistralrs_core::pipeline::paths: Loading Mistral-7B-Instruct-v0.3-Q6_K.gguf locally at K:\.cache\lm-studio\models\bartowski\Mistral-7B-Instruct-v0.3-GGUF\Mistral-7B-Instruct-v0.3-Q6_K.gguf
2025-01-09T12:42:46.597271Z INFO mistralrs_core::gguf::content: Model config:
general.architecture: llama
general.file_type: 18
general.name: Mistral-7B-Instruct-v0.3
general.quantization_version: 2
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 32
llama.context_length: 32768
llama.embedding_length: 4096
llama.feed_forward_length: 14336
llama.rope.dimension_count: 128
llama.rope.freq_base: 1000000
llama.vocab_size: 32768
quantize.imatrix.chunks_count: 228
quantize.imatrix.dataset: /training_data/calibration_data.txt
quantize.imatrix.entries_count: 224
quantize.imatrix.file: /models/Mistral-7B-Instruct-v0.3-GGUF/Mistral-7B-Instruct-v0.3.imatrix
2025-01-09T12:42:46.648090Z INFO mistralrs_core::gguf::gguf_tokenizer: GGUF tokenizer model is llama, kind: Unigram, num tokens: 32768, num added tokens: 0, num merges: 0, num scores: 32768
2025-01-09T12:42:46.649484Z INFO mistralrs_core::gguf::chat_template: Discovered and using GGUF chat template: {{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}
2025-01-09T12:42:46.708409Z INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 8.6
2025-01-09T12:42:46.806418Z INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-01-09T12:42:48.203797Z INFO mistralrs_core::device_map: Model has 32 repeating layers.
2025-01-09T12:42:48.204058Z INFO mistralrs_core::device_map: Loading model according to the following repeating layer mappings:
2025-01-09T12:42:48.204239Z INFO mistralrs_core::device_map: Layer 0: cuda[0]
2025-01-09T12:42:48.204464Z INFO mistralrs_core::device_map: Layer 1: cuda[0]
2025-01-09T12:42:48.204577Z INFO mistralrs_core::device_map: Layer 2: cuda[0]
2025-01-09T12:42:48.204873Z INFO mistralrs_core::device_map: Layer 3: cuda[0]
2025-01-09T12:42:48.205102Z INFO mistralrs_core::device_map: Layer 4: cuda[0]
2025-01-09T12:42:48.205204Z INFO mistralrs_core::device_map: Layer 5: cuda[0]
2025-01-09T12:42:48.205313Z INFO mistralrs_core::device_map: Layer 6: cuda[0]
2025-01-09T12:42:48.205461Z INFO mistralrs_core::device_map: Layer 7: cuda[0]
2025-01-09T12:42:48.205569Z INFO mistralrs_core::device_map: Layer 8: cuda[0]
2025-01-09T12:42:48.205669Z INFO mistralrs_core::device_map: Layer 9: cuda[0]
2025-01-09T12:42:48.205754Z INFO mistralrs_core::device_map: Layer 10: cpu
2025-01-09T12:42:48.205846Z INFO mistralrs_core::device_map: Layer 11: cpu
2025-01-09T12:42:48.205935Z INFO mistralrs_core::device_map: Layer 12: cpu
2025-01-09T12:42:48.206027Z INFO mistralrs_core::device_map: Layer 13: cpu
2025-01-09T12:42:48.206139Z INFO mistralrs_core::device_map: Layer 14: cpu
2025-01-09T12:42:48.206224Z INFO mistralrs_core::device_map: Layer 15: cpu
2025-01-09T12:42:48.206317Z INFO mistralrs_core::device_map: Layer 16: cpu
2025-01-09T12:42:48.206415Z INFO mistralrs_core::device_map: Layer 17: cpu
2025-01-09T12:42:48.206491Z INFO mistralrs_core::device_map: Layer 18: cpu
2025-01-09T12:42:48.206575Z INFO mistralrs_core::device_map: Layer 19: cpu
2025-01-09T12:42:48.206673Z INFO mistralrs_core::device_map: Layer 20: cpu
2025-01-09T12:42:48.206763Z INFO mistralrs_core::device_map: Layer 21: cpu
2025-01-09T12:42:48.206842Z INFO mistralrs_core::device_map: Layer 22: cpu
2025-01-09T12:42:48.206922Z INFO mistralrs_core::device_map: Layer 23: cpu
2025-01-09T12:42:48.207013Z INFO mistralrs_core::device_map: Layer 24: cpu
2025-01-09T12:42:48.207104Z INFO mistralrs_core::device_map: Layer 25: cpu
2025-01-09T12:42:48.207190Z INFO mistralrs_core::device_map: Layer 26: cpu
2025-01-09T12:42:48.207290Z INFO mistralrs_core::device_map: Layer 27: cpu
2025-01-09T12:42:48.207379Z INFO mistralrs_core::device_map: Layer 28: cpu
2025-01-09T12:42:48.207457Z INFO mistralrs_core::device_map: Layer 29: cpu
2025-01-09T12:42:48.207519Z INFO mistralrs_core::device_map: Layer 30: cpu
2025-01-09T12:42:48.207606Z INFO mistralrs_core::device_map: Layer 31: cpu
2025-01-09T12:43:24.033183Z INFO mistralrs_core::device_map: Model has 32 repeating layers.
2025-01-09T12:43:24.033505Z INFO mistralrs_core::device_map: Loading model according to the following repeating layer mappings:
2025-01-09T12:43:24.033709Z INFO mistralrs_core::device_map: Layer 0: cuda[0]
2025-01-09T12:43:24.033877Z INFO mistralrs_core::device_map: Layer 1: cuda[0]
2025-01-09T12:43:24.034036Z INFO mistralrs_core::device_map: Layer 2: cuda[0]
2025-01-09T12:43:24.034197Z INFO mistralrs_core::device_map: Layer 3: cuda[0]
2025-01-09T12:43:24.034356Z INFO mistralrs_core::device_map: Layer 4: cuda[0]
2025-01-09T12:43:24.034526Z INFO mistralrs_core::device_map: Layer 5: cuda[0]
2025-01-09T12:43:24.034685Z INFO mistralrs_core::device_map: Layer 6: cuda[0]
2025-01-09T12:43:24.034845Z INFO mistralrs_core::device_map: Layer 7: cuda[0]
2025-01-09T12:43:24.035003Z INFO mistralrs_core::device_map: Layer 8: cuda[0]
2025-01-09T12:43:24.035168Z INFO mistralrs_core::device_map: Layer 9: cuda[0]
2025-01-09T12:43:24.035325Z INFO mistralrs_core::device_map: Layer 10: cpu
2025-01-09T12:43:24.035490Z INFO mistralrs_core::device_map: Layer 11: cpu
2025-01-09T12:43:24.035650Z INFO mistralrs_core::device_map: Layer 12: cpu
2025-01-09T12:43:24.035827Z INFO mistralrs_core::device_map: Layer 13: cpu
2025-01-09T12:43:24.035994Z INFO mistralrs_core::device_map: Layer 14: cpu
2025-01-09T12:43:24.036163Z INFO mistralrs_core::device_map: Layer 15: cpu
2025-01-09T12:43:24.036329Z INFO mistralrs_core::device_map: Layer 16: cpu
2025-01-09T12:43:24.036507Z INFO mistralrs_core::device_map: Layer 17: cpu
2025-01-09T12:43:24.036675Z INFO mistralrs_core::device_map: Layer 18: cpu
2025-01-09T12:43:24.036845Z INFO mistralrs_core::device_map: Layer 19: cpu
2025-01-09T12:43:24.037011Z INFO mistralrs_core::device_map: Layer 20: cpu
2025-01-09T12:43:24.037182Z INFO mistralrs_core::device_map: Layer 21: cpu
2025-01-09T12:43:24.037350Z INFO mistralrs_core::device_map: Layer 22: cpu
2025-01-09T12:43:24.037527Z INFO mistralrs_core::device_map: Layer 23: cpu
2025-01-09T12:43:24.037695Z INFO mistralrs_core::device_map: Layer 24: cpu
2025-01-09T12:43:24.037866Z INFO mistralrs_core::device_map: Layer 25: cpu
2025-01-09T12:43:24.038031Z INFO mistralrs_core::device_map: Layer 26: cpu
2025-01-09T12:43:24.038201Z INFO mistralrs_core::device_map: Layer 27: cpu
2025-01-09T12:43:24.038470Z INFO mistralrs_core::device_map: Layer 28: cpu
2025-01-09T12:43:24.038685Z INFO mistralrs_core::device_map: Layer 29: cpu
2025-01-09T12:43:24.038836Z INFO mistralrs_core::device_map: Layer 30: cpu
2025-01-09T12:43:24.038975Z INFO mistralrs_core::device_map: Layer 31: cpu
2025-01-09T12:43:24.039123Z INFO mistralrs_core::pipeline::paths: Using literal chat template.
2025-01-09T12:43:24.129289Z INFO mistralrs_core::pipeline::chat_template: bos_toks = "", eos_toks = "", unk_tok =
2025-01-09T12:43:24.136983Z INFO mistralrs_server: Model loaded.
2025-01-09T12:43:24.137686Z INFO mistralrs_core: Enabling GEMM reduced precision in BF16.
2025-01-09T12:43:24.202665Z INFO mistralrs_core: Enabling GEMM reduced precision in F16.
2025-01-09T12:43:24.216459Z INFO mistralrs_core::cublaslt: Initialized cuBLASlt handle
2025-01-09T12:43:24.216781Z INFO mistralrs_core: Beginning dummy run.
2025-01-09T12:43:24.715606Z ERROR mistralrs_core::engine: prompt step - Model failed with error: DeviceMismatchBinaryOp { lhs: Cuda { gpu_id: 0 }, rhs: Cpu, op: "slice-set" }
2025-01-09T12:43:24.715815Z INFO mistralrs_core: Dummy run completed in 0.4989111s.
2025-01-09T12:43:24.715927Z INFO mistralrs_server::interactive_mode: Starting interactive loop with sampling params: SamplingParams { temperature: Some(0.1), top_k: Some(32), top_p: Some(0.1), min_p: Some(0.05), top_n_logprobs: 0, frequency_penalty: Some(0.1), presence_penalty: Some(0.1), stop_toks: None, max_len: Some(4096), logits_bias: None, n_choices: 1, dry_params: Some(DrySamplingParams { sequence_breakers: ["\n", ":", """, "
"], multiplier: 0.0, base: 1.75, allowed_length: 2 }) }

Welcome to interactive mode! Because this model is a text model, you can enter prompts and chat with the model.****

Commands:

  • \help: Display this message.
  • \exit: Quit interactive mode.
  • \system <system message here>:
    Add a system message to the chat without running the model.
    Ex: \system Always respond as a pirate.
    ====================

How much is log(-2)?
2025-01-09T12:43:42.346385Z ERROR mistralrs_core::engine: prompt step - Model failed with error: DeviceMismatchBinaryOp { lhs: Cuda { gpu_id: 0 }, rhs: Cpu, op: "slice-set" }
2025-01-09T12:43:42.346706Z ERROR mistralrs_server::interactive_mode: Got a model error: "device mismatch in slice-set, lhs: Cuda { gpu_id: 0 }, rhs: Cpu", response: ChatCompletionResponse { id: "1", choices: [Choice { finish_reason: "error", index: 0, message: ResponseMessage { content: Some(""), role: "assistant", tool_calls: [] }, logprobs: None }], created: 1736426622, model: "K:\.cache\lm-studio\models\bartowski\Mistral-7B-Instruct-v0.3-GGUF", system_fingerprint: "local", object: "chat.completion", usage: Usage { completion_tokens: 0, prompt_tokens: 20, total_tokens: 20, avg_tok_per_sec: 60.422962, avg_prompt_tok_per_sec: inf, avg_compl_tok_per_sec: NaN, total_time_sec: 0.331, total_prompt_time_sec: 0.0, total_completion_time_sec: 0.0 } }

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Oct_30_01:18:48_Pacific_Daylight_Time_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0

C:\Users\MATTIA>nvidia-smi
Thu Jan 9 13:47:24 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 561.17 Driver Version: 561.17 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A2000 WDDM | 00000000:21:00.0 On | 0 |
| 30% 31C P8 8W / 70W | 1056MiB / 5754MiB | 8% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1864 C+G ...n\131.0.2903.112\msedgewebview2.exe N/A |
| 0 N/A N/A 2080 C ...ta\Local\Programs\Ollama\ollama.exe N/A |
| 0 N/A N/A 3164 C+G ...n\NVIDIA app\CEF\NVIDIA Overlay.exe N/A |
| 0 N/A N/A 3560 C+G ...ogram Files\Java\jre7\bin\javaw.exe N/A |
| 0 N/A N/A 6604 C+G ...n\NVIDIA app\CEF\NVIDIA Overlay.exe N/A |
| 0 N/A N/A 9620 C+G ...nt.CBS_cw5n1h2txyewy\SearchHost.exe N/A |
| 0 N/A N/A 9864 C+G ...s\System32\ApplicationFrameHost.exe N/A |
| 0 N/A N/A 9972 C+G ...2txyewy\StartMenuExperienceHost.exe N/A |
| 0 N/A N/A 11008 C+G ...siveControlPanel\SystemSettings.exe N/A |
| 0 N/A N/A 12416 C+G ...ekyb3d8bbwe\PhoneExperienceHost.exe N/A |
| 0 N/A N/A 13084 C+G ...rogram Files\Java\jre7\bin\java.exe N/A |
| 0 N/A N/A 14460 C+G C:\Windows\explorer.exe N/A |
| 0 N/A N/A 16760 C+G ...crosoft\Edge\Application\msedge.exe N/A |
| 0 N/A N/A 17464 C+G ...m Files\Mozilla Firefox\firefox.exe N/A |
| 0 N/A N/A 17736 C+G ...__8wekyb3d8bbwe\WindowsTerminal.exe N/A |
| 0 N/A N/A 17916 C+G C:\Windows\System32\ShellHost.exe N/A |
| 0 N/A N/A 18420 C+G ...m Files\Mozilla Firefox\firefox.exe N/A |
| 0 N/A N/A 18916 C+G ...CBS_cw5n1h2txyewy\TextInputHost.exe N/A |
+-----------------------------------------------------------------------------------------+

@dancixx
Copy link

dancixx commented Jan 9, 2025

I have the same issue using the Llama 3.3 70B GGUF 4Bit. Using dummy device mapping just restarts the layer processing at the 39th layer with CUDA_OUT_OF_MEMORY but with custom device mapping same rhs issue occurs.

@cdoko
Copy link
Collaborator

cdoko commented Jan 10, 2025

Got a model error: "device mismatch in slice-set, lhs: Cuda { gpu_id: 0 }, rhs: Cpu", response: ChatCompletionResponse { id: "1", choices: [Choice { finish_reason: "error", index: 0, message: ResponseMessage { content: Some(""), role: "assistant", tool_calls: [] }, logprobs: None }], created: 1736426622, model: "K:.cache\lm-studio\models\bartowski\Mistral-7B-Instruct-v0.3-GGUF", system_fingerprint: "local", object: "chat.completion", usage: Usage { completion_tokens: 0, prompt_tokens: 20, total_tokens: 20, avg_tok_per_sec: 60.422962, avg_prompt_tok_per_sec: inf, avg_compl_tok_per_sec: NaN, total_time_sec: 0.331, total_prompt_time_sec: 0.0, total_completion_time_sec: 0.0 } }

Hi @misureaudio, the error is likely due to the fact that device mapping, when paged attention is disabled, does not map the cache alongside the layers. This issue occurred with paged attention before its fix in #1011.

When it worked on the 4070, was the model fully loaded on the GPU, rather than being split between the GPU and CPU?

@misureaudio
Copy link
Author

@cdoko

Hi, probably yes, on the 4070 laptop I do testing using Phi3.5 and isq

@EricLBuehler
Copy link
Owner

@misureaudio @dancixx this bug with the non-paged cache not being mapped was fixed in a recent PR, can you please try it again?

@misureaudio
Copy link
Author

misureaudio commented Jan 10, 2025

Same problem with #1047

RTX A2000 - 6GB VRAM

**C:\Users\MATTIA\Desktop\rustsrc\mistral.rs.0.3.4.1047>.\mistralrs-server -i -n 10 gguf -m \WD-BACKUP02\mattia\AI-Models\lmstudio-community\Meta-Llama-3.1-8B-Instruct-GGUF -f Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
2025-01-10T13:02:06.429938Z INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2025-01-10T13:02:06.430132Z INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-01-10T13:02:06.430251Z INFO mistralrs_server: Model kind is: gguf quantized from gguf (no adapters)
2025-01-10T13:02:06.807609Z INFO mistralrs_core::pipeline::paths: Loading Meta-Llama-3.1-8B-Instruct-Q8_0.gguf locally at \\WD-BACKUP02\mattia\AI-Models\lmstudio-community\Meta-Llama-3.1-8B-Instruct-GGUF\Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
2025-01-10T13:02:09.225937Z INFO mistralrs_core::gguf::content: Model config:
general.architecture: llama
general.basename: Meta-Llama-3.1
general.file_type: 7
general.finetune: Instruct
general.languages: en, de, fr, it, pt, hi, es, th
general.license: llama3.1
general.name: Meta Llama 3.1 8B Instruct
general.quantization_version: 2
general.size_label: 8B
general.tags: facebook, meta, pytorch, llama, llama-3, text-generation
general.type: model
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 32
llama.context_length: 131072
llama.embedding_length: 4096
llama.feed_forward_length: 14336
llama.rope.dimension_count: 128
llama.rope.freq_base: 500000
llama.vocab_size: 128256
quantize.imatrix.chunks_count: 125
quantize.imatrix.dataset: /training_dir/calibration_datav3.txt
quantize.imatrix.entries_count: 224
quantize.imatrix.file: /models_out/Meta-Llama-3.1-8B-Instruct-GGUF/Meta-Llama-3.1-8B-Instruct.imatrix
2025-01-10T13:02:09.228093Z INFO mistralrs_core::utils::log: Model has 32 repeating layers.
2025-01-10T13:02:09.228167Z INFO mistralrs_core::utils::log: Loading model according to the following repeating layer mappings:
2025-01-10T13:02:09.228238Z INFO mistralrs_core::utils::log: Layers 0-9: cuda[0]
2025-01-10T13:02:09.228294Z INFO mistralrs_core::utils::log: Layers 10-31: cpu
2025-01-10T13:02:09.585426Z INFO mistralrs_core::gguf::gguf_tokenizer: GGUF tokenizer model is gpt2, kind: Bpe, num tokens: 128256, num added tokens: 0, num merges: 280147, num scores: 0
2025-01-10T13:02:09.607884Z INFO mistralrs_core::gguf::chat_template: Discovered and using GGUF chat template: {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
2025-01-10T13:02:09.667880Z INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 8.6
2025-01-10T13:02:09.808690Z INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-01-10T13:04:15.344999Z INFO mistralrs_core::pipeline::paths: Using literal chat template.
2025-01-10T13:04:15.689541Z INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|eot_id|>", unk_tok = None
2025-01-10T13:04:15.710752Z INFO mistralrs_server: Model loaded.
2025-01-10T13:04:15.711874Z INFO mistralrs_core: Enabling GEMM reduced precision in BF16.
2025-01-10T13:04:15.793502Z INFO mistralrs_core: Enabling GEMM reduced precision in F16.
2025-01-10T13:04:15.834877Z INFO mistralrs_core::cublaslt: Initialized cuBLASlt handle
2025-01-10T13:04:15.835373Z INFO mistralrs_core: Beginning dummy run.
2025-01-10T13:04:16.460450Z ERROR mistralrs_core::engine: prompt step - Model failed with error: UnsupportedDTypeForOp(BF16, "matmul")
2025-01-10T13:04:16.460877Z INFO mistralrs_core: Dummy run completed in 0.6251258s.
2025-01-10T13:04:16.461205Z INFO mistralrs_server::interactive_mode: Starting interactive loop with sampling params: SamplingParams { temperature: Some(0.1), top_k: Some(32), top_p: Some(0.1), min_p: Some(0.05), top_n_logprobs: 0, frequency_penalty: Some(0.1), presence_penalty: Some(0.1), stop_toks: None, max_len: Some(4096), logits_bias: None, n_choices: 1, dry_params: Some(DrySamplingParams { sequence_breakers: ["\n", ":", """, "*"], multiplier: 0.0, base: 1.75, allowed_length: 2 }) }

Welcome to interactive mode! Because this model is a text model, you can enter prompts and chat with the model.

Commands:

  • \help: Display this message.
  • \exit: Quit interactive mode.
  • \system <system message here>:
    Add a system message to the chat without running the model.
    Ex: \system Always respond as a pirate.
    ====================

How much is log(-2)?
2025-01-10T13:04:36.135042Z ERROR mistralrs_core::engine: prompt step - Model failed with error: UnsupportedDTypeForOp(BF16, "matmul")
2025-01-10T13:04:36.135235Z ERROR mistralrs_server::interactive_mode: Got a model error: "unsupported dtype BF16 for op matmul", response: ChatCompletionResponse { id: "1", choices: [Choice { finish_reason: "error", index: 0, message: ResponseMessage { content: Some(""), role: "assistant", tool_calls: [] }, logprobs: None }], created: 1736514275, model: "\\WD-BACKUP02\mattia\AI-Models\lmstudio-community\Meta-Llama-3.1-8B-Instruct-GGUF", system_fingerprint: "local", object: "chat.completion", usage: Usage { completion_tokens: 0, prompt_tokens: 50, total_tokens: 50, avg_tok_per_sec: 101.41988, avg_prompt_tok_per_sec: inf, avg_compl_tok_per_sec: NaN, total_time_sec: 0.493, total_prompt_time_sec: 0.0, total_completion_time_sec: 0.0 } }**

@cdoko
Copy link
Collaborator

cdoko commented Jan 10, 2025

@EricLBuehler

this bug with the non-paged cache not being mapped was fixed in a recent PR, can you please try it again?

I see that the non-paged cache is mapping correctly now, so it's not a cache mapping issue. However, for certain models like llama 3.1 8B GGUF, errors occur specifically when the model is split between the GPU and CPU, whereas splitting it across multiple GPUs works fine.

@cdoko
Copy link
Collaborator

cdoko commented Jan 11, 2025

@misureaudio @dancixx
I've merged some fixes, and it should be working now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants