Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streamed inference not as smooth (fast?) as with e.g. Ollama - Llama 3.1 #630

Closed
ChristianWeyer opened this issue Jul 25, 2024 · 40 comments · Fixed by #685
Closed

Streamed inference not as smooth (fast?) as with e.g. Ollama - Llama 3.1 #630

ChristianWeyer opened this issue Jul 25, 2024 · 40 comments · Fixed by #685
Labels
bug Something isn't working

Comments

@ChristianWeyer
Copy link

Describe the bug

Have a look :-)

Inference-macOS.mov

Latest commit or version

0.22
MBP M3 Max

@ChristianWeyer ChristianWeyer added the bug Something isn't working label Jul 25, 2024
@EricLBuehler
Copy link
Owner

Hi @ChristianWeyer if you could please try to gather some T/s metrics, that'd be amazing for a quantitative comparison!

@ChristianWeyer
Copy link
Author

Sure!

Ollama has --verbose:

❯ ollama run llama3.1:8b-instruct-fp16 --verbose
>>> tell me a joke
Here's one:

What do you call a fake noodle?

An impasta.

total duration:       1.29921225s
load duration:        34.187542ms
prompt eval count:    15 token(s)
prompt eval duration: 483.086ms
prompt eval rate:     31.05 tokens/s
eval count:           18 token(s)
eval duration:        781.205ms
eval rate:            23.04 tokens/s

Is there anything similar for mistral.ai @EricLBuehler ?

@EricLBuehler
Copy link
Owner

Yes, mistral.rs has --throughput before the model selector (plain). It can be used with the server.

@ChristianWeyer
Copy link
Author

Is there a trick to see the throughput values in interactive mode? Or does it not work with -i?

❯ cargo run --release --features metal -- -i --throughput plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama
    Finished `release` profile [optimized] target(s) in 0.43s
     Running `target/release/mistralrs-server -i --throughput plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama`
2024-07-25T18:40:01.534562Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-07-25T18:40:01.534616Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-07-25T18:40:01.534652Z  INFO mistralrs_server: Model kind is: normal (no quant, no adapters)
2024-07-25T18:40:01.535339Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-07-25T18:40:01.535540Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-07-25T18:40:02.049554Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00004.safetensors", "model-00002-of-00004.safetensors", "model-00003-of-00004.safetensors", "model-00004-of-00004.safetensors"]
2024-07-25T18:40:02.205875Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-07-25T18:40:02.783612Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-07-25T18:40:02.786192Z  INFO mistralrs_core::utils::normal: DType selected is F16.
2024-07-25T18:40:02.786199Z  INFO mistralrs_core::pipeline::normal: Loading model `meta-llama/Meta-Llama-3.1-8B-Instruct` on metal[4294968875].
2024-07-25T18:40:02.786317Z  INFO mistralrs_core::pipeline::normal: Model config: Config { hidden_size: 4096, intermediate_size: 14336, vocab_size: 128256, num_hidden_layers: 32, num_attention_heads: 32, num_key_value_heads: 8, use_flash_attn: false, rms_norm_eps: 1e-5, rope_theta: 500000.0, max_position_embeddings: 131072, rope_scaling: Some(Llama3RopeConfig { factor: 8.0, low_freq_factor: 1.0, high_freq_factor: 4.0, original_max_position_embeddings: 8192, rope_type: Llama3 }) }
100%|█████████████████████████████████████████████████████████████| 82/82 [00:07<00:00, 29.71it/s]
100%|██████████████████████████████████████████████████████████| 104/104 [00:00<00:00, 128.57it/s]
100%|███████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 80.46it/s]
100%|█████████████████████████████████████████████████████████████| 5/5 [00:01<00:00, 1787.63it/s]
2024-07-25T18:40:12.803523Z  INFO mistralrs_core::utils::normal: DType selected is F16.
2024-07-25T18:40:13.201142Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|eot_id|>", "<|eot_id|>", "<|end_of_text|>", "<|eom_id|>", unk_tok = `None`
2024-07-25T18:40:13.214374Z  INFO mistralrs_server: Model loaded.
2024-07-25T18:40:13.214438Z  INFO mistralrs_server::interactive_mode: Starting interactive loop with sampling params: SamplingParams { temperature: Some(0.1), top_k: Some(32), top_p: Some(0.1), min_p: Some(0.05), top_n_logprobs: 0, frequency_penalty: Some(0.1), presence_penalty: Some(0.1), stop_toks: None, max_len: Some(4096), logits_bias: None, n_choices: 1 }
> tell me a joke
Here's one:

What do you call a fake noodle?

An impasta.
>

@EricLBuehler
Copy link
Owner

@ChristianWeyer not at the moment, it is only for the server. I will add that tomorrow, but in the meantime, if you start up an OAI server (perhaps in both) we can isolate whether the issue is in model performance or streaming implementation.

@ChristianWeyer
Copy link
Author

Have you been able to update the code for the interactive mode @EricLBuehler ?

@EricLBuehler
Copy link
Owner

@ChristianWeyer yes in #655.

@ChristianWeyer
Copy link
Author

ChristianWeyer commented Aug 14, 2024

Sorry for the late reply @EricLBuehler.

I just tried to run the latest commit (8cab33b) with
cargo run --release --features metal -- -i plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama

and got an error:

error[E0004]: non-exhaustive patterns: `DType::I32` not covered
   --> /Users/christianweyer/.cargo/git/checkouts/candle-c6a149c3b35a488f/7ad6494/candle-core/src/sort.rs:145:23
    |
145 |                 match storage.dtype() {
    |                       ^^^^^^^^^^^^^^^ pattern `DType::I32` not covered
    |
note: `DType` defined here
   --> /Users/christianweyer/.cargo/git/checkouts/candle-c6a149c3b35a488f/7ad6494/candle-core/src/dtype.rs:8:10
    |
8   | pub enum DType {
    |          ^^^^^
...
14  |     I32,
    |     --- not covered
    = note: the matched value is of type `DType`
help: ensure that all possible cases are being handled by adding a match arm with a wildcard pattern or an explicit pattern as shown
    |
152 ~                     DType::I64 => "asort_asc_i64",
153 ~                     DType::I32 => todo!(),
    |

error[E0004]: non-exhaustive patterns: `DType::I32` not covered
   --> /Users/christianweyer/.cargo/git/checkouts/candle-c6a149c3b35a488f/7ad6494/candle-core/src/sort.rs:155:23
    |
155 |                 match storage.dtype() {
    |                       ^^^^^^^^^^^^^^^ pattern `DType::I32` not covered
    |
note: `DType` defined here
   --> /Users/christianweyer/.cargo/git/checkouts/candle-c6a149c3b35a488f/7ad6494/candle-core/src/dtype.rs:8:10
    |
8   | pub enum DType {
    |          ^^^^^
...
14  |     I32,
    |     --- not covered
    = note: the matched value is of type `DType`
help: ensure that all possible cases are being handled by adding a match arm with a wildcard pattern or an explicit pattern as shown
    |
162 ~                     DType::I64 => "asort_desc_i64",
163 ~                     DType::I32 => todo!(),
    |

   Compiling pyo3-macros v0.22.2
   Compiling rust-embed-impl v8.5.0
   Compiling derive_builder v0.20.0
   Compiling esaxx-rs v0.1.10
   Compiling darling v0.11.0
   Compiling utoipa-gen v4.3.0
For more information about this error, try `rustc --explain E0004`.
error: could not compile `candle-core` (lib) due to 2 previous errors
warning: build failed, waiting for other jobs to finish...

@EricLBuehler
Copy link
Owner

@ChristianWeyer sorry for the trouble, I think this should be fixed in #681.

@ChristianWeyer
Copy link
Author

Sure, no problem @EricLBuehler. Now it compiles.

But at runtime is crashes:

     Running `target/release/mistralrs-server -i plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama`
2024-08-14T11:51:59.682284Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-08-14T11:51:59.682438Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-08-14T11:51:59.682537Z  INFO mistralrs_server: Model kind is: normal (no quant, no adapters)
2024-08-14T11:51:59.684890Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-14T11:51:59.685635Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-14T11:52:03.626990Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00004.safetensors", "model-00002-of-00004.safetensors", "model-00003-of-00004.safetensors", "model-00004-of-00004.safetensors"]
2024-08-14T11:52:04.101942Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-14T11:52:05.642770Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-14T11:52:05.645004Z  INFO mistralrs_core::pipeline::normal: Loading model `meta-llama/Meta-Llama-3.1-8B-Instruct` on metal[4294968915].
Error: Metal error Error while loading function: "Function bgemm was not found in the library"

@xfer
Copy link

xfer commented Aug 14, 2024

I have a similar issue (Error: Metal error Error while loading function: "Function bgemm was not found in the library"), but I'm using google/gemma-2-9b-it on a M1 Mac Studio.

@ac3xx
Copy link
Contributor

ac3xx commented Aug 14, 2024

Exact same issue using microsoft/Phi-3-vision-128k-instruct on a M1 Max.

@EricLBuehler
Copy link
Owner

@xfer @ac3xx @ChristianWeyer can you try to rollback to v0.2.4:

git fetch origin tag v0.2.4
git checkout v0.2.4

And then rebuild to see if it works?

@ac3xx
Copy link
Contributor

ac3xx commented Aug 14, 2024

@xfer @ac3xx @ChristianWeyer can you try to rollback to v0.2.4:


git fetch origin tag v0.2.4

git checkout v0.2.4

And then rebuild to see if it works?

I completely forgot to update my comment - I did this earlier and it ran fine. Let me know if you need a bisect/etc.

@EricLBuehler
Copy link
Owner

I completely forgot to update my comment - I did this earlier and it ran fine. Let me know if you need a bisect/etc.

Yeah a bisect would be very helpful!

@ac3xx
Copy link
Contributor

ac3xx commented Aug 15, 2024

% cargo run --release --features metal -- -i vision-plain -m microsoft/Phi-3-vision-128k-instruct -a phi3v
   Compiling mistralrs-core v0.2.4 (/Users/jl/Code/mistral.rs/mistralrs-core)
error[E0308]: arguments to this method are incorrect
   --> mistralrs-core/src/pipeline/isq.rs:128:30
    |
128 | ...                   .apply_isq(dtype, &n_quantized, device)
    |                        ^^^^^^^^^        ------------  ------ expected `&AtomicUsize`, found `candle_core::Device`
    |                                         |
    |                                         expected `candle_core::Device`, found `&AtomicUsize`
    |
note: method defined here
   --> /Users/jl/Code/mistral.rs/mistralrs-quant/src/lib.rs:126:8
    |
126 |     fn apply_isq(
    |        ^^^^^^^^^
help: swap these arguments
    |
128 |                             .apply_isq(dtype, device, &n_quantized)
    |                                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For more information about this error, try `rustc --explain E0308`.
error: could not compile `mistralrs-core` (lib) due to 1 previous error

@EricLBuehler #683 has broken compilation on master as an FYI.


Yeah a bisect would be very helpful!

Correcting for the red herring commits (because of the wrong candle commit), it's caused by the rewrite of the automatic dtype inference. Specifically, this change has led to the newer version of try_into_dtype calling determine_auto_dtype_all, which is missing a case (candle_core::Error::Metal(_)) - thrown due to the lack of BF16 support. Forcing f16 works fine.

I've opened #685 with the missing error case added, confirmed working without -d f16.

@EricLBuehler
Copy link
Owner

@xfer @ChristianWeyer I just merged @ac3xx's PR #685 which should fix this issue. I also merged #685 which should fix the compilation issue. So, I think master should be working now, but confirmation from someone with a Metal machine would be great.

@xfer
Copy link

xfer commented Aug 15, 2024

@EricLBuehler for gemma-2-2b-it and gemma-2-2b it is working fine!

Also sorry for not testing the bisect 😞

@ChristianWeyer
Copy link
Author

OK, so then here - finally - the stats you requested @EricLBuehler:

Ollama:
total duration: 2.240256875s
load duration: 32.448458ms
prompt eval count: 15 token(s)
prompt eval duration: 560.735ms
prompt eval rate: 26.75 tokens/s
eval count: 37 token(s)
eval duration: 1.646012s
eval rate: 22.48 tokens/s

mistral.rs
2024-08-15T12:54:06.636383Z INFO mistralrs_server::interactive_mode: Average T/s: 10.96718959597559

@EricLBuehler
Copy link
Owner

EricLBuehler commented Aug 15, 2024

@EricLBuehler for gemma-2-2b-it and gemma-2-2b it is working fine!

Great, glad to hear @xfer! No worries about the bisect.

OK, so then here - finally - the stats you requested @EricLBuehler:

@ChristianWeyer thanks for letting me know. I'll see what optimizations we can make.

@ChristianWeyer
Copy link
Author

Do you need more help to identify potential performance issues @EricLBuehler?

@EricLBuehler
Copy link
Owner

EricLBuehler commented Aug 17, 2024

@ChristianWeyer if you could please paste the output of interactive mode with all the logging during loading, that would be very helpful!

@ChristianWeyer
Copy link
Author

The latest commit (575286b) gives me this error @EricLBuehler :

cargo run --release --features metal -- -i --throughput plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llam

error[E0425]: cannot find value `rhs` in this scope
   --> mistralrs-quant/src/utils/ops.rs:306:31
    |
306 |         let original_device = rhs.device();
    |                               ^^^ not found in this scope

error[E0061]: this method takes 2 arguments but 1 argument was supplied
   --> mistralrs-quant/src/utils/ops.rs:308:14
    |
308 |             .apply_op2_no_bwd(&Leftshift(n))?
    |              ^^^^^^^^^^^^^^^^ ------------- an argument of type `&candle_core::Tensor` is missing
    |
note: method defined here
   --> /Users/christianweyer/.cargo/git/checkouts/candle-c6a149c3b35a488f/2386e4e/candle-core/src/custom_op.rs:162:12
    |
162 |     pub fn apply_op2_no_bwd<C: CustomOp2>(&self, rhs: &Self, c: &C) -> Result<Self> {
    |            ^^^^^^^^^^^^^^^^
help: provide the argument
    |
308 |             .apply_op2_no_bwd(/* &candle_core::Tensor */, &Leftshift(n))?
    |                              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

   Compiling mistralrs-vision v0.2.5 (/Users/christianweyer/Sources/mistral.rs/mistralrs-vision)
Some errors have detailed explanations: E0061, E0425.
For more information about an error, try `rustc --explain E0061`.
error: could not compile `mistralrs-quant` (lib) due to 2 previous errors
warning: build failed, waiting for other jobs to finish...

@EricLBuehler
Copy link
Owner

EricLBuehler commented Aug 19, 2024

@ChristianWeyer thanks for letting me know, 70c647c should fix this now.

@ChristianWeyer
Copy link
Author

@ChristianWeyer if you could please paste the output of interactive mode with all the logging during loading, that would be very helpful!

Voila:

❯ cargo run --release --features metal -- -i --throughput plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama
    Finished `release` profile [optimized] target(s) in 0.60s
     Running `target/release/mistralrs-server -i --throughput plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama`
2024-08-19T14:10:52.254964Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-08-19T14:10:52.255064Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-08-19T14:10:52.255104Z  INFO mistralrs_server: Model kind is: normal (no quant, no adapters)
2024-08-19T14:10:52.255541Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-19T14:10:52.255857Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-19T14:11:22.505704Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00004.safetensors", "model-00002-of-00004.safetensors", "model-00003-of-00004.safetensors", "model-00004-of-00004.safetensors"]
2024-08-19T14:11:22.843371Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-19T14:11:23.354823Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
2024-08-19T14:11:23.357734Z  INFO mistralrs_core::pipeline::normal: Loading model `meta-llama/Meta-Llama-3.1-8B-Instruct` on metal[4294968463].
2024-08-19T14:11:23.366631Z  INFO mistralrs_core::utils::normal: DType selected is F16.
2024-08-19T14:11:23.366839Z  INFO mistralrs_core::pipeline::normal: Model config: Config { hidden_size: 4096, intermediate_size: 14336, vocab_size: 128256, num_hidden_layers: 32, num_attention_heads: 32, num_key_value_heads: 8, use_flash_attn: false, rms_norm_eps: 1e-5, rope_theta: 500000.0, max_position_embeddings: 131072, rope_scaling: Some(Llama3RopeConfig { factor: 8.0, low_freq_factor: 1.0, high_freq_factor: 4.0, original_max_position_embeddings: 8192, rope_type: Llama3 }), quantization_config: None }
100%|███████████████████████████████████████████████████████████████████████████████████████████████| 82/82 [00:06<00:00, 20.91it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████| 104/104 [00:00<00:00, 135.13it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 107.32it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 52.92it/s]
2024-08-19T14:11:31.688832Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|eot_id|>", "<|end_of_text|>", "<|eom_id|>", unk_tok = `None`
2024-08-19T14:11:31.698495Z  INFO mistralrs_server: Model loaded.
2024-08-19T14:11:31.698591Z  INFO mistralrs_server::interactive_mode: Starting interactive loop with sampling params: SamplingParams { temperature: Some(0.1), top_k: Some(32), top_p: Some(0.1), min_p: Some(0.05), top_n_logprobs: 0, frequency_penalty: Some(0.1), presence_penalty: Some(0.1), stop_toks: None, max_len: Some(4096), logits_bias: None, n_choices: 1 }
>

@ChristianWeyer
Copy link
Author

Did that help @EricLBuehler?

@EricLBuehler
Copy link
Owner

@ChristianWeyer thanks, yes that did help. I'm concerned that the Metal ordinal seems to be an unsigned integer overflow: metal[4294968463], so maybe it's using the CPU somehow. Can you please confirm the GPU is being utilized?

Sorry for the late reply.

@ChristianWeyer
Copy link
Author

(On holidays… back at the weekend 🌴)

@ChristianWeyer
Copy link
Author

@ChristianWeyer thanks, yes that did help. I'm concerned that the Metal ordinal seems to be an unsigned integer overflow: metal[4294968463], so maybe it's using the CPU somehow. Can you please confirm the GPU is being utilized?

Sorry for the late reply.

Tried with commit cccdd27 - and ran into this error:

   Compiling mistralrs-server v0.3.0 (/Users/christianweyer/Sources/mistral.rs/mistralrs-server)
error[E0658]: use of unstable library feature 'absolute_path'
  --> mistralrs-server/src/util.rs:11:34
   |
11 |         url::Url::from_file_path(std::path::absolute(url_unparsed)?)
   |                                  ^^^^^^^^^^^^^^^^^^^
   |
   = note: see issue #92750 <https://github.com/rust-lang/rust/issues/92750> for more information

For more information about this error, try `rustc --explain E0658`.
error: could not compile `mistralrs-server` (bin "mistralrs-server") due to 1 previous error

@EricLBuehler
Copy link
Owner

@ChristianWeyer as of v0.3.0, or MSRV is now 1.79. This error indicates that you have less than that version installed, can you please run rustup update?

@ChristianWeyer
Copy link
Author

Yes, this worked, thanks.

Using asitop, I can see that the GPU is used @EricLBuehler

@ChristianWeyer
Copy link
Author

... any more ideas @EricLBuehler? :-)

@EricLBuehler
Copy link
Owner

@ChristianWeyer I do plan on beginning optimizing the Candle Metal backend so it will hopefully receive the same performance enhancements as the CUDA one has.

@ChristianWeyer
Copy link
Author

@ChristianWeyer I do plan on beginning optimizing the Candle Metal backend so it will hopefully receive the same performance enhancements as the CUDA one has.

Cool - let me know when I can start testing :-)

@ChristianWeyer
Copy link
Author

... not to be too impatient ... did you already find some time to boost Metal @EricLBuehler ? :-)

@EricLBuehler
Copy link
Owner

Hi @ChristianWeyer! My M3 Max machine arrived today - so expect some performance improvements in the coming days for sure!

@EricLBuehler
Copy link
Owner

EricLBuehler commented Oct 26, 2024

Hi @ChristianWeyer! I just merged #887 which increases decoding T/s (ex. for Llama 3.1 8b @ q4k) by 26% (30 -> 38)!

Let me know if you can see a difference!

@ChristianWeyer
Copy link
Author

Hi @ChristianWeyer! I just merged #887 which increases decoding T/s (ex. for Llama 3.1 8b @ q4k) by 26% (30 -> 38)!

Let me know if you can see a difference!

@EricLBuehler When running

cargo run --release --features metal -- -i --throughput plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama

I now get:

2024-10-28T16:19:59.809221Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-10-28T16:19:59.809289Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-10-28T16:19:59.809366Z  INFO mistralrs_server: Model kind is: normal (no adapters)
2024-10-28T16:19:59.810497Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
thread 'main' panicked at mistralrs-core/src/pipeline/normal.rs:230:58:
Could not get file "tokenizer.json" from API: MissingHeader("etag")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

@ChristianWeyer
Copy link
Author

@EricLBuehler When running

cargo run --release --features metal -- -i --throughput plain -m meta-llama/Meta-Llama-3.1-8B-Instruct -a llama

I now get:

2024-10-28T16:19:59.809221Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-10-28T16:19:59.809289Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-10-28T16:19:59.809366Z  INFO mistralrs_server: Model kind is: normal (no adapters)
2024-10-28T16:19:59.810497Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `meta-llama/Meta-Llama-3.1-8B-Instruct`
thread 'main' panicked at mistralrs-core/src/pipeline/normal.rs:230:58:
Could not get file "tokenizer.json" from API: MissingHeader("etag")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Does it work on your side @EricLBuehler ?

@EricLBuehler
Copy link
Owner

@ChristianWeyer check out #903! We are now faster than or comparable to llama.cpp and MLX.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants