-
Notifications
You must be signed in to change notification settings - Fork 335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streamed inference not as smooth (fast?) as with e.g. Ollama - Llama 3.1 #630
Comments
Hi @ChristianWeyer if you could please try to gather some T/s metrics, that'd be amazing for a quantitative comparison! |
Sure! Ollama has
Is there anything similar for mistral.ai @EricLBuehler ? |
Yes, mistral.rs has --throughput before the model selector (plain). It can be used with the server. |
Is there a trick to see the throughput values in interactive mode? Or does it not work with -i?
|
@ChristianWeyer not at the moment, it is only for the server. I will add that tomorrow, but in the meantime, if you start up an OAI server (perhaps in both) we can isolate whether the issue is in model performance or streaming implementation. |
Have you been able to update the code for the interactive mode @EricLBuehler ? |
@ChristianWeyer yes in #655. |
Sorry for the late reply @EricLBuehler. I just tried to run the latest commit (8cab33b) with and got an error:
|
@ChristianWeyer sorry for the trouble, I think this should be fixed in #681. |
Sure, no problem @EricLBuehler. Now it compiles. But at runtime is crashes:
|
I have a similar issue (Error: Metal error Error while loading function: "Function bgemm was not found in the library"), but I'm using |
Exact same issue using |
@xfer @ac3xx @ChristianWeyer can you try to rollback to v0.2.4:
And then rebuild to see if it works? |
I completely forgot to update my comment - I did this earlier and it ran fine. Let me know if you need a bisect/etc. |
Yeah a bisect would be very helpful! |
@EricLBuehler #683 has broken compilation on master as an FYI.
Correcting for the red herring commits (because of the wrong candle commit), it's caused by the rewrite of the automatic dtype inference. Specifically, this change has led to the newer version of I've opened #685 with the missing error case added, confirmed working without |
@xfer @ChristianWeyer I just merged @ac3xx's PR #685 which should fix this issue. I also merged #685 which should fix the compilation issue. So, I think |
@EricLBuehler for Also sorry for not testing the bisect 😞 |
OK, so then here - finally - the stats you requested @EricLBuehler: Ollama: mistral.rs |
Great, glad to hear @xfer! No worries about the bisect.
@ChristianWeyer thanks for letting me know. I'll see what optimizations we can make. |
Do you need more help to identify potential performance issues @EricLBuehler? |
@ChristianWeyer if you could please paste the output of interactive mode with all the logging during loading, that would be very helpful! |
The latest commit (575286b) gives me this error @EricLBuehler :
|
@ChristianWeyer thanks for letting me know, 70c647c should fix this now. |
Voila:
|
Did that help @EricLBuehler? |
@ChristianWeyer thanks, yes that did help. I'm concerned that the Metal ordinal seems to be an unsigned integer overflow: Sorry for the late reply. |
(On holidays… back at the weekend 🌴) |
Tried with commit cccdd27 - and ran into this error:
|
@ChristianWeyer as of v0.3.0, or MSRV is now 1.79. This error indicates that you have less than that version installed, can you please run |
Yes, this worked, thanks. Using |
... any more ideas @EricLBuehler? :-) |
@ChristianWeyer I do plan on beginning optimizing the Candle Metal backend so it will hopefully receive the same performance enhancements as the CUDA one has. |
Cool - let me know when I can start testing :-) |
... not to be too impatient ... did you already find some time to boost Metal @EricLBuehler ? :-) |
Hi @ChristianWeyer! My M3 Max machine arrived today - so expect some performance improvements in the coming days for sure! |
Hi @ChristianWeyer! I just merged #887 which increases decoding T/s (ex. for Llama 3.1 8b @ q4k) by 26% (30 -> 38)! Let me know if you can see a difference! |
@EricLBuehler When running
I now get:
|
Does it work on your side @EricLBuehler ? |
@ChristianWeyer check out #903! We are now faster than or comparable to llama.cpp and MLX. |
Describe the bug
Have a look :-)
Inference-macOS.mov
Latest commit or version
0.22
MBP M3 Max
The text was updated successfully, but these errors were encountered: