We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using int4 QVQ 72b model. https://huggingface.co/kosbu/QVQ-72B-Preview-AWQ basic config: 4 2080ti 22G tp=4
python3 -m sglang.launch_server --model-path /root/model/QVQ-72B-Preview-AWQ --host 0.0.0.0 --port 30000 --tp 4 --mem-fraction-static 0.7
As you may see, the prefilling stage take 20s.
What i can do to optimize the speed? Or do i have option to turn off prefilling, when performing only one request?
The text was updated successfully, but these errors were encountered:
Well. This is quite strange. What GPUs and prompts are you using?
Sorry, something went wrong.
Okay. For 2080 that's normal. 😂 We may do not have better plan for this. Sorry.
One more thing, it is not quick on duo a6000s (w/ nvlink) Prefilling Batch takes a lot of time.
@WuNein This could also make sense. A6000 is not a well-performed GPU indeed. And I think there is something related:
#2488
The bad thing is that we do not have these general devices.
No branches or pull requests
Using int4 QVQ 72b model. https://huggingface.co/kosbu/QVQ-72B-Preview-AWQ
basic config: 4 2080ti 22G tp=4
python3 -m sglang.launch_server --model-path /root/model/QVQ-72B-Preview-AWQ --host 0.0.0.0 --port 30000 --tp 4 --mem-fraction-static 0.7
As you may see, the prefilling stage take 20s.
What i can do to optimize the speed?
Or do i have option to turn off prefilling, when performing only one request?
The text was updated successfully, but these errors were encountered: