-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrent request handling #1062
Comments
I am not sure if your case is similar, but I am facing the same issue:
I am possibly looking for the same solution, I hope we find some solution to this. PS: I was also looking for #771, #897 👀 This library also provides a server: llama-cpp-python/llama_cpp/server/app.py Lines 165 to 168 in 8207280
Probably that may help. |
If the hardware computing power is insufficient, the benefits of parallel inference are low. I implemented a simple parallel inference using this project and tested it on V100S. Under the condition of 2 concurrency, the efficiency is not as high as that of a single request. Supporting parallel inference (batch processing) is a very complex task, involving issues such as kv-cache logit. Instead, you can use the api_like_OAI.py provided by llama.cpp as an alternative. This service supports parallel inference, although the performance is slightly lower during parallel execution. |
Hi! I just made such a solution for myself. Here is the code: https://github.com/sergey-zinchenko/llama-cpp-python/tree/model_lock_per_request I did introduce async locking of all the model stuff for all kinds of requests—stream and not. All the requests will be handled one by one, so it's not kind of concurrent, but at least the server will not crash or interrupt the request it handles at the moment. |
@sergey-zinchenko can you provide more details like what changes did you made and how to adopt them in our llama-cpp-python? |
|
@malik-787 Shortly I added global async lock mechanism to handle request one by one limiting the number of maximum awaiting requests on the uvicorn level. The server stops crashing and stop closing ongoing inferencing and in my PR incoming requests just awaiting finishing of ongoing one. IMHO that way is more or less better for multiuser scenarios and for k8s deployment. |
Hey there!! 🙏
I am currently working on a project that involves the sending request to the model using flask api and when user sends the request concurrently the model is not able to handle it. Is there any way i can handle multiple concurrent request to the model and serve multiple users at the same time?
Please help! @abetlen
The text was updated successfully, but these errors were encountered: