Streamed inference not as smooth (fast?) as with e.g. Ollama - Llama 3.1 #630

ChristianWeyer · 2024-07-25T16:22:34Z

Describe the bug

Have a look :-)

Inference-macOS.mov

Latest commit or version

0.22
MBP M3 Max

EricLBuehler · 2024-07-25T16:51:57Z

Hi @ChristianWeyer if you could please try to gather some T/s metrics, that'd be amazing for a quantitative comparison!

ChristianWeyer · 2024-07-25T17:12:09Z

Sure!

Ollama has --verbose:

❯ ollama run llama3.1:8b-instruct-fp16 --verbose
>>> tell me a joke
Here's one:

What do you call a fake noodle?

An impasta.

total duration:       1.29921225s
load duration:        34.187542ms
prompt eval count:    15 token(s)
prompt eval duration: 483.086ms
prompt eval rate:     31.05 tokens/s
eval count:           18 token(s)
eval duration:        781.205ms
eval rate:            23.04 tokens/s

Is there anything similar for mistral.ai @EricLBuehler ?

EricLBuehler · 2024-07-25T18:14:53Z

Yes, mistral.rs has --throughput before the model selector (plain). It can be used with the server.

ChristianWeyer · 2024-07-25T18:43:29Z

Is there a trick to see the throughput values in interactive mode? Or does it not work with -i?

❯ cargo run --release --features metal -- -i --throughput plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama
    Finished `release` profile [optimized] target(s) in 0.43s
     Running `target/release/mistralrs-server -i --throughput plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama`
2024-07-25T18:40:01.534562Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-07-25T18:40:01.534616Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-07-25T18:40:01.534652Z  INFO mistralrs_server: Model kind is: normal (no quant, no adapters)
2024-07-25T18:40:01.535339Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-07-25T18:40:01.535540Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-07-25T18:40:02.049554Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00004.safetensors", "model-00002-of-00004.safetensors", "model-00003-of-00004.safetensors", "model-00004-of-00004.safetensors"]
2024-07-25T18:40:02.205875Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-07-25T18:40:02.783612Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-07-25T18:40:02.786192Z  INFO mistralrs_core::utils::normal: DType selected is F16.
2024-07-25T18:40:02.786199Z  INFO mistralrs_core::pipeline::normal: Loading model `meta-llama/Meta-Llama-3.1-8B-Instruct` on metal[4294968875].
2024-07-25T18:40:02.786317Z  INFO mistralrs_core::pipeline::normal: Model config: Config { hidden_size: 4096, intermediate_size: 14336, vocab_size: 128256, num_hidden_layers: 32, num_attention_heads: 32, num_key_value_heads: 8, use_flash_attn: false, rms_norm_eps: 1e-5, rope_theta: 500000.0, max_position_embeddings: 131072, rope_scaling: Some(Llama3RopeConfig { factor: 8.0, low_freq_factor: 1.0, high_freq_factor: 4.0, original_max_position_embeddings: 8192, rope_type: Llama3 }) }
100%|█████████████████████████████████████████████████████████████| 82/82 [00:07<00:00, 29.71it/s]
100%|██████████████████████████████████████████████████████████| 104/104 [00:00<00:00, 128.57it/s]
100%|███████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 80.46it/s]
100%|█████████████████████████████████████████████████████████████| 5/5 [00:01<00:00, 1787.63it/s]
2024-07-25T18:40:12.803523Z  INFO mistralrs_core::utils::normal: DType selected is F16.
2024-07-25T18:40:13.201142Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|eot_id|>", "<|eot_id|>", "<|end_of_text|>", "<|eom_id|>", unk_tok = `None`
2024-07-25T18:40:13.214374Z  INFO mistralrs_server: Model loaded.
2024-07-25T18:40:13.214438Z  INFO mistralrs_server::interactive_mode: Starting interactive loop with sampling params: SamplingParams { temperature: Some(0.1), top_k: Some(32), top_p: Some(0.1), min_p: Some(0.05), top_n_logprobs: 0, frequency_penalty: Some(0.1), presence_penalty: Some(0.1), stop_toks: None, max_len: Some(4096), logits_bias: None, n_choices: 1 }
> tell me a joke
Here's one:

What do you call a fake noodle?

An impasta.
>

EricLBuehler · 2024-07-27T02:40:46Z

@ChristianWeyer not at the moment, it is only for the server. I will add that tomorrow, but in the meantime, if you start up an OAI server (perhaps in both) we can isolate whether the issue is in model performance or streaming implementation.

ChristianWeyer · 2024-07-31T10:43:29Z

Have you been able to update the code for the interactive mode @EricLBuehler ?

EricLBuehler · 2024-07-31T13:56:16Z

@ChristianWeyer yes in #655.

ChristianWeyer · 2024-08-14T10:38:00Z

Sorry for the late reply @EricLBuehler.

I just tried to run the latest commit (8cab33b) with
cargo run --release --features metal -- -i plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama

and got an error:

error[E0004]: non-exhaustive patterns: `DType::I32` not covered
   --> /Users/christianweyer/.cargo/git/checkouts/candle-c6a149c3b35a488f/7ad6494/candle-core/src/sort.rs:145:23
    |
145 |                 match storage.dtype() {
    |                       ^^^^^^^^^^^^^^^ pattern `DType::I32` not covered
    |
note: `DType` defined here
   --> /Users/christianweyer/.cargo/git/checkouts/candle-c6a149c3b35a488f/7ad6494/candle-core/src/dtype.rs:8:10
    |
8   | pub enum DType {
    |          ^^^^^
...
14  |     I32,
    |     --- not covered
    = note: the matched value is of type `DType`
help: ensure that all possible cases are being handled by adding a match arm with a wildcard pattern or an explicit pattern as shown
    |
152 ~                     DType::I64 => "asort_asc_i64",
153 ~                     DType::I32 => todo!(),
    |

error[E0004]: non-exhaustive patterns: `DType::I32` not covered
   --> /Users/christianweyer/.cargo/git/checkouts/candle-c6a149c3b35a488f/7ad6494/candle-core/src/sort.rs:155:23
    |
155 |                 match storage.dtype() {
    |                       ^^^^^^^^^^^^^^^ pattern `DType::I32` not covered
    |
note: `DType` defined here
   --> /Users/christianweyer/.cargo/git/checkouts/candle-c6a149c3b35a488f/7ad6494/candle-core/src/dtype.rs:8:10
    |
8   | pub enum DType {
    |          ^^^^^
...
14  |     I32,
    |     --- not covered
    = note: the matched value is of type `DType`
help: ensure that all possible cases are being handled by adding a match arm with a wildcard pattern or an explicit pattern as shown
    |
162 ~                     DType::I64 => "asort_desc_i64",
163 ~                     DType::I32 => todo!(),
    |

   Compiling pyo3-macros v0.22.2
   Compiling rust-embed-impl v8.5.0
   Compiling derive_builder v0.20.0
   Compiling esaxx-rs v0.1.10
   Compiling darling v0.11.0
   Compiling utoipa-gen v4.3.0
For more information about this error, try `rustc --explain E0004`.
error: could not compile `candle-core` (lib) due to 2 previous errors
warning: build failed, waiting for other jobs to finish...

EricLBuehler · 2024-08-14T11:45:07Z

@ChristianWeyer sorry for the trouble, I think this should be fixed in #681.

ChristianWeyer · 2024-08-14T11:59:08Z

Sure, no problem @EricLBuehler. Now it compiles.

But at runtime is crashes:

     Running `target/release/mistralrs-server -i plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama`
2024-08-14T11:51:59.682284Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-08-14T11:51:59.682438Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-08-14T11:51:59.682537Z  INFO mistralrs_server: Model kind is: normal (no quant, no adapters)
2024-08-14T11:51:59.684890Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-14T11:51:59.685635Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-14T11:52:03.626990Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00004.safetensors", "model-00002-of-00004.safetensors", "model-00003-of-00004.safetensors", "model-00004-of-00004.safetensors"]
2024-08-14T11:52:04.101942Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-14T11:52:05.642770Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-14T11:52:05.645004Z  INFO mistralrs_core::pipeline::normal: Loading model `meta-llama/Meta-Llama-3.1-8B-Instruct` on metal[4294968915].
Error: Metal error Error while loading function: "Function bgemm was not found in the library"

xfer · 2024-08-14T14:15:13Z

I have a similar issue (Error: Metal error Error while loading function: "Function bgemm was not found in the library"), but I'm using google/gemma-2-9b-it on a M1 Mac Studio.

ac3xx · 2024-08-14T18:09:40Z

Exact same issue using microsoft/Phi-3-vision-128k-instruct on a M1 Max.

EricLBuehler · 2024-08-14T21:57:21Z

@xfer @ac3xx @ChristianWeyer can you try to rollback to v0.2.4:

git fetch origin tag v0.2.4
git checkout v0.2.4

And then rebuild to see if it works?

ac3xx · 2024-08-14T21:59:00Z

@xfer @ac3xx @ChristianWeyer can you try to rollback to v0.2.4:
git fetch origin tag v0.2.4

git checkout v0.2.4
And then rebuild to see if it works?

I completely forgot to update my comment - I did this earlier and it ran fine. Let me know if you need a bisect/etc.

EricLBuehler · 2024-08-14T22:10:17Z

I completely forgot to update my comment - I did this earlier and it ran fine. Let me know if you need a bisect/etc.

Yeah a bisect would be very helpful!

ac3xx · 2024-08-15T01:23:58Z

% cargo run --release --features metal -- -i vision-plain -m microsoft/Phi-3-vision-128k-instruct -a phi3v
   Compiling mistralrs-core v0.2.4 (/Users/jl/Code/mistral.rs/mistralrs-core)
error[E0308]: arguments to this method are incorrect
   --> mistralrs-core/src/pipeline/isq.rs:128:30
    |
128 | ...                   .apply_isq(dtype, &n_quantized, device)
    |                        ^^^^^^^^^        ------------  ------ expected `&AtomicUsize`, found `candle_core::Device`
    |                                         |
    |                                         expected `candle_core::Device`, found `&AtomicUsize`
    |
note: method defined here
   --> /Users/jl/Code/mistral.rs/mistralrs-quant/src/lib.rs:126:8
    |
126 |     fn apply_isq(
    |        ^^^^^^^^^
help: swap these arguments
    |
128 |                             .apply_isq(dtype, device, &n_quantized)
    |                                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For more information about this error, try `rustc --explain E0308`.
error: could not compile `mistralrs-core` (lib) due to 1 previous error

@EricLBuehler #683 has broken compilation on master as an FYI.

Yeah a bisect would be very helpful!

Correcting for the red herring commits (because of the wrong candle commit), it's caused by the rewrite of the automatic dtype inference. Specifically, this change has led to the newer version of try_into_dtype calling determine_auto_dtype_all, which is missing a case (candle_core::Error::Metal(_)) - thrown due to the lack of BF16 support. Forcing f16 works fine.

I've opened #685 with the missing error case added, confirmed working without -d f16.

#630 (comment)

EricLBuehler · 2024-08-15T01:41:42Z

@xfer @ChristianWeyer I just merged @ac3xx's PR #685 which should fix this issue. I also merged #685 which should fix the compilation issue. So, I think master should be working now, but confirmation from someone with a Metal machine would be great.

xfer · 2024-08-15T11:43:12Z

@EricLBuehler for gemma-2-2b-it and gemma-2-2b it is working fine!

Also sorry for not testing the bisect 😞

ChristianWeyer · 2024-08-15T12:54:19Z

OK, so then here - finally - the stats you requested @EricLBuehler:

Ollama:
total duration: 2.240256875s
load duration: 32.448458ms
prompt eval count: 15 token(s)
prompt eval duration: 560.735ms
prompt eval rate: 26.75 tokens/s
eval count: 37 token(s)
eval duration: 1.646012s
eval rate: 22.48 tokens/s

mistral.rs
2024-08-15T12:54:06.636383Z INFO mistralrs_server::interactive_mode: Average T/s: 10.96718959597559

EricLBuehler · 2024-08-15T13:59:03Z

@EricLBuehler for gemma-2-2b-it and gemma-2-2b it is working fine!

Great, glad to hear @xfer! No worries about the bisect.

OK, so then here - finally - the stats you requested @EricLBuehler:

@ChristianWeyer thanks for letting me know. I'll see what optimizations we can make.

ChristianWeyer · 2024-08-16T10:26:41Z

Do you need more help to identify potential performance issues @EricLBuehler?

EricLBuehler · 2024-08-17T15:23:18Z

@ChristianWeyer if you could please paste the output of interactive mode with all the logging during loading, that would be very helpful!

ChristianWeyer · 2024-08-19T09:20:07Z

The latest commit (575286b) gives me this error @EricLBuehler :

cargo run --release --features metal -- -i --throughput plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llam

error[E0425]: cannot find value `rhs` in this scope
   --> mistralrs-quant/src/utils/ops.rs:306:31
    |
306 |         let original_device = rhs.device();
    |                               ^^^ not found in this scope

error[E0061]: this method takes 2 arguments but 1 argument was supplied
   --> mistralrs-quant/src/utils/ops.rs:308:14
    |
308 |             .apply_op2_no_bwd(&Leftshift(n))?
    |              ^^^^^^^^^^^^^^^^ ------------- an argument of type `&candle_core::Tensor` is missing
    |
note: method defined here
   --> /Users/christianweyer/.cargo/git/checkouts/candle-c6a149c3b35a488f/2386e4e/candle-core/src/custom_op.rs:162:12
    |
162 |     pub fn apply_op2_no_bwd<C: CustomOp2>(&self, rhs: &Self, c: &C) -> Result<Self> {
    |            ^^^^^^^^^^^^^^^^
help: provide the argument
    |
308 |             .apply_op2_no_bwd(/* &candle_core::Tensor */, &Leftshift(n))?
    |                              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

   Compiling mistralrs-vision v0.2.5 (/Users/christianweyer/Sources/mistral.rs/mistralrs-vision)
Some errors have detailed explanations: E0061, E0425.
For more information about an error, try `rustc --explain E0061`.
error: could not compile `mistralrs-quant` (lib) due to 2 previous errors
warning: build failed, waiting for other jobs to finish...

EricLBuehler · 2024-08-19T10:41:05Z

@ChristianWeyer thanks for letting me know, 70c647c should fix this now.

ChristianWeyer · 2024-08-19T18:27:30Z

@ChristianWeyer if you could please paste the output of interactive mode with all the logging during loading, that would be very helpful!

Voila:

❯ cargo run --release --features metal -- -i --throughput plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama
    Finished `release` profile [optimized] target(s) in 0.60s
     Running `target/release/mistralrs-server -i --throughput plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama`
2024-08-19T14:10:52.254964Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-08-19T14:10:52.255064Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-08-19T14:10:52.255104Z  INFO mistralrs_server: Model kind is: normal (no quant, no adapters)
2024-08-19T14:10:52.255541Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-19T14:10:52.255857Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-19T14:11:22.505704Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00004.safetensors", "model-00002-of-00004.safetensors", "model-00003-of-00004.safetensors", "model-00004-of-00004.safetensors"]
2024-08-19T14:11:22.843371Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-19T14:11:23.354823Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-19T14:11:23.357734Z  INFO mistralrs_core::pipeline::normal: Loading model `meta-llama/Meta-Llama-3.1-8B-Instruct` on metal[4294968463].
2024-08-19T14:11:23.366631Z  INFO mistralrs_core::utils::normal: DType selected is F16.
2024-08-19T14:11:23.366839Z  INFO mistralrs_core::pipeline::normal: Model config: Config { hidden_size: 4096, intermediate_size: 14336, vocab_size: 128256, num_hidden_layers: 32, num_attention_heads: 32, num_key_value_heads: 8, use_flash_attn: false, rms_norm_eps: 1e-5, rope_theta: 500000.0, max_position_embeddings: 131072, rope_scaling: Some(Llama3RopeConfig { factor: 8.0, low_freq_factor: 1.0, high_freq_factor: 4.0, original_max_position_embeddings: 8192, rope_type: Llama3 }), quantization_config: None }
100%|███████████████████████████████████████████████████████████████████████████████████████████████| 82/82 [00:06<00:00, 20.91it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████| 104/104 [00:00<00:00, 135.13it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 107.32it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 52.92it/s]
2024-08-19T14:11:31.688832Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|eot_id|>", "<|end_of_text|>", "<|eom_id|>", unk_tok = `None`
2024-08-19T14:11:31.698495Z  INFO mistralrs_server: Model loaded.
2024-08-19T14:11:31.698591Z  INFO mistralrs_server::interactive_mode: Starting interactive loop with sampling params: SamplingParams { temperature: Some(0.1), top_k: Some(32), top_p: Some(0.1), min_p: Some(0.05), top_n_logprobs: 0, frequency_penalty: Some(0.1), presence_penalty: Some(0.1), stop_toks: None, max_len: Some(4096), logits_bias: None, n_choices: 1 }
>

ChristianWeyer · 2024-08-20T17:17:21Z

Did that help @EricLBuehler?

EricLBuehler · 2024-08-31T01:40:52Z

@ChristianWeyer thanks, yes that did help. I'm concerned that the Metal ordinal seems to be an unsigned integer overflow: metal[4294968463], so maybe it's using the CPU somehow. Can you please confirm the GPU is being utilized?

Sorry for the late reply.

ChristianWeyer · 2024-09-03T12:53:38Z

(On holidays… back at the weekend 🌴)

ChristianWeyer · 2024-09-07T14:51:31Z

@ChristianWeyer thanks, yes that did help. I'm concerned that the Metal ordinal seems to be an unsigned integer overflow: metal[4294968463], so maybe it's using the CPU somehow. Can you please confirm the GPU is being utilized?

Sorry for the late reply.

Tried with commit cccdd27 - and ran into this error:

   Compiling mistralrs-server v0.3.0 (/Users/christianweyer/Sources/mistral.rs/mistralrs-server)
error[E0658]: use of unstable library feature 'absolute_path'
  --> mistralrs-server/src/util.rs:11:34
   |
11 |         url::Url::from_file_path(std::path::absolute(url_unparsed)?)
   |                                  ^^^^^^^^^^^^^^^^^^^
   |
   = note: see issue #92750 <https://github.com/rust-lang/rust/issues/92750> for more information

For more information about this error, try `rustc --explain E0658`.
error: could not compile `mistralrs-server` (bin "mistralrs-server") due to 1 previous error

EricLBuehler · 2024-09-09T11:02:35Z

@ChristianWeyer as of v0.3.0, or MSRV is now 1.79. This error indicates that you have less than that version installed, can you please run rustup update?

ChristianWeyer · 2024-09-09T11:41:47Z

Yes, this worked, thanks.

Using asitop, I can see that the GPU is used @EricLBuehler

ChristianWeyer · 2024-09-19T13:42:08Z

... any more ideas @EricLBuehler? :-)

EricLBuehler · 2024-09-25T18:28:31Z

@ChristianWeyer I do plan on beginning optimizing the Candle Metal backend so it will hopefully receive the same performance enhancements as the CUDA one has.

ChristianWeyer · 2024-09-26T04:54:10Z

@ChristianWeyer I do plan on beginning optimizing the Candle Metal backend so it will hopefully receive the same performance enhancements as the CUDA one has.

Cool - let me know when I can start testing :-)

ChristianWeyer · 2024-09-30T13:47:20Z

... not to be too impatient ... did you already find some time to boost Metal @EricLBuehler ? :-)

EricLBuehler · 2024-10-17T01:41:23Z

Hi @ChristianWeyer! My M3 Max machine arrived today - so expect some performance improvements in the coming days for sure!

EricLBuehler · 2024-10-26T21:19:00Z

Hi @ChristianWeyer! I just merged #887 which increases decoding T/s (ex. for Llama 3.1 8b @ q4k) by 26% (30 -> 38)!

Let me know if you can see a difference!

ChristianWeyer · 2024-10-28T16:22:24Z

Hi @ChristianWeyer! I just merged #887 which increases decoding T/s (ex. for Llama 3.1 8b @ q4k) by 26% (30 -> 38)!

Let me know if you can see a difference!

@EricLBuehler When running

cargo run --release --features metal -- -i --throughput plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama

I now get:

2024-10-28T16:19:59.809221Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-10-28T16:19:59.809289Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-10-28T16:19:59.809366Z  INFO mistralrs_server: Model kind is: normal (no adapters)
2024-10-28T16:19:59.810497Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
thread 'main' panicked at mistralrs-core/src/pipeline/normal.rs:230:58:
Could not get file "tokenizer.json" from API: MissingHeader("etag")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

ChristianWeyer · 2024-11-03T07:36:52Z

@EricLBuehler When running

cargo run --release --features metal -- -i --throughput plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama

I now get:

2024-10-28T16:19:59.809221Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-10-28T16:19:59.809289Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-10-28T16:19:59.809366Z  INFO mistralrs_server: Model kind is: normal (no adapters)
2024-10-28T16:19:59.810497Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
thread 'main' panicked at mistralrs-core/src/pipeline/normal.rs:230:58:
Could not get file "tokenizer.json" from API: MissingHeader("etag")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Does it work on your side @EricLBuehler ?

EricLBuehler · 2024-11-28T20:03:17Z

@ChristianWeyer check out #903! We are now faster than or comparable to llama.cpp and MLX.

ChristianWeyer added the bug Something isn't working label Jul 25, 2024

ac3xx mentioned this issue Aug 15, 2024

Add missing error case in automatic dtype selection feature #685

Merged

EricLBuehler added a commit that referenced this issue Aug 15, 2024

Fix metal build in ISQ (#686)

ecfeb0e

#630 (comment)

EricLBuehler closed this as completed in #685 Aug 15, 2024

EricLBuehler reopened this Aug 15, 2024

EricLBuehler closed this as completed Nov 28, 2024

Streamed inference not as smooth (fast?) as with e.g. Ollama - Llama 3.1 #630

Streamed inference not as smooth (fast?) as with e.g. Ollama - Llama 3.1 #630

Comments

ChristianWeyer commented Jul 25, 2024

Describe the bug

Latest commit or version

EricLBuehler commented Jul 25, 2024

ChristianWeyer commented Jul 25, 2024

EricLBuehler commented Jul 25, 2024

ChristianWeyer commented Jul 25, 2024

EricLBuehler commented Jul 27, 2024

ChristianWeyer commented Jul 31, 2024

EricLBuehler commented Jul 31, 2024

ChristianWeyer commented Aug 14, 2024 • edited Loading

EricLBuehler commented Aug 14, 2024

ChristianWeyer commented Aug 14, 2024

xfer commented Aug 14, 2024

ac3xx commented Aug 14, 2024

EricLBuehler commented Aug 14, 2024

ac3xx commented Aug 14, 2024

EricLBuehler commented Aug 14, 2024

ac3xx commented Aug 15, 2024

EricLBuehler commented Aug 15, 2024

xfer commented Aug 15, 2024

ChristianWeyer commented Aug 15, 2024

EricLBuehler commented Aug 15, 2024 • edited Loading

ChristianWeyer commented Aug 16, 2024

EricLBuehler commented Aug 17, 2024 • edited Loading

ChristianWeyer commented Aug 19, 2024

EricLBuehler commented Aug 19, 2024 • edited Loading

ChristianWeyer commented Aug 19, 2024

ChristianWeyer commented Aug 20, 2024

EricLBuehler commented Aug 31, 2024

ChristianWeyer commented Sep 3, 2024

ChristianWeyer commented Sep 7, 2024

EricLBuehler commented Sep 9, 2024

ChristianWeyer commented Sep 9, 2024

ChristianWeyer commented Sep 19, 2024

EricLBuehler commented Sep 25, 2024

ChristianWeyer commented Sep 26, 2024

ChristianWeyer commented Sep 30, 2024

EricLBuehler commented Oct 17, 2024

EricLBuehler commented Oct 26, 2024 • edited Loading

ChristianWeyer commented Oct 28, 2024

ChristianWeyer commented Nov 3, 2024

EricLBuehler commented Nov 28, 2024

ChristianWeyer commented Aug 14, 2024 •

edited

Loading

EricLBuehler commented Aug 15, 2024 •

edited

Loading

EricLBuehler commented Aug 17, 2024 •

edited

Loading

EricLBuehler commented Aug 19, 2024 •

edited

Loading

EricLBuehler commented Oct 26, 2024 •

edited

Loading