You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
I've been trying various huggingface models on Triton using the ONNX Runtime backend. The models are first converted from huggingface to onnx using one of onnxruntime converters and then deployed on Triton.
I've been using this script to convert the GPT2 huggingface model to ONNX and this script to convert the BERT huggingface model to ONNX.
For benchmarking, we send 50 sequential inference requests with random inputs (of different sequence lengths) to the server and measure each requests RTT. We then report the mean and standard deviation of the requests' RTT. Here are the results:
BERT
64 tokens
- Mean time: 4.95 ms
- STD time: 0.08 ms
128 tokens
- Mean time: 7.56 ms
- STD time: 0.28 ms
256 tokens
- Mean time: 9.98 ms
- STD time: 0.19 ms
GPT2
64 tokens
- Mean time: 633.78 ms
- STD time: 7.88 ms
128 tokens
- Mean time: 1329.68 ms
- STD time: 23.24 ms
256 tokens
- Mean time: 2789.59 ms
- STD time: 294.28 ms
Beyond the higher execution time (which can be due to both models having different sizes), it can be seen that BERT scales very well for increasing sequence lengths. In contrast, GPT2's execution time increases linearly with the sequence length, which is very undesirable.
I did some tests outside of Triton and the same used to happen when IOBinding wasn't enabled, so this leads me to the question: Is this is happening inside Triton due to IOBinding not working correctly for this GPT2 model? I see that IOBinding is always enabled in the onnxruntime_backend, but it might not be working correctly.
Triton Information
Triton Docker Image version: 22.10
The text was updated successfully, but these errors were encountered:
Description
I've been trying various huggingface models on Triton using the ONNX Runtime backend. The models are first converted from huggingface to onnx using one of onnxruntime converters and then deployed on Triton.
The problem is that the GPT2 converted model seems to perform very poorly for increasing sequence lengths, as opposite to a BERT converted model which scales pretty well. We have been using GPT2LMHeadModel and bert-large-uncased-whole-word-masking-finetuned-squad.
I've been using this script to convert the GPT2 huggingface model to ONNX and this script to convert the BERT huggingface model to ONNX.
For benchmarking, we send 50 sequential inference requests with random inputs (of different sequence lengths) to the server and measure each requests RTT. We then report the mean and standard deviation of the requests' RTT. Here are the results:
Beyond the higher execution time (which can be due to both models having different sizes), it can be seen that BERT scales very well for increasing sequence lengths. In contrast, GPT2's execution time increases linearly with the sequence length, which is very undesirable.
I did some tests outside of Triton and the same used to happen when IOBinding wasn't enabled, so this leads me to the question: Is this is happening inside Triton due to IOBinding not working correctly for this GPT2 model? I see that IOBinding is always enabled in the onnxruntime_backend, but it might not be working correctly.
Triton Information
Triton Docker Image version: 22.10
The text was updated successfully, but these errors were encountered: