GPT2 performance degradation with higher sequence length on ONNX Runtime #157

rgallardone · 2022-11-09T17:07:14Z

Description
I've been trying various huggingface models on Triton using the ONNX Runtime backend. The models are first converted from huggingface to onnx using one of onnxruntime converters and then deployed on Triton.

The problem is that the GPT2 converted model seems to perform very poorly for increasing sequence lengths, as opposite to a BERT converted model which scales pretty well. We have been using GPT2LMHeadModel and bert-large-uncased-whole-word-masking-finetuned-squad.

I've been using this script to convert the GPT2 huggingface model to ONNX and this script to convert the BERT huggingface model to ONNX.

For benchmarking, we send 50 sequential inference requests with random inputs (of different sequence lengths) to the server and measure each requests RTT. We then report the mean and standard deviation of the requests' RTT. Here are the results:

BERT
64 tokens
- Mean time: 4.95 ms
- STD time: 0.08 ms
128 tokens
- Mean time: 7.56 ms
- STD time: 0.28 ms
256 tokens
- Mean time: 9.98 ms
- STD time: 0.19 ms

GPT2
64 tokens
- Mean time: 633.78 ms
- STD time: 7.88 ms
128 tokens
- Mean time: 1329.68 ms
- STD time: 23.24 ms
256 tokens
- Mean time: 2789.59 ms
- STD time: 294.28 ms

Beyond the higher execution time (which can be due to both models having different sizes), it can be seen that BERT scales very well for increasing sequence lengths. In contrast, GPT2's execution time increases linearly with the sequence length, which is very undesirable.

I did some tests outside of Triton and the same used to happen when IOBinding wasn't enabled, so this leads me to the question: Is this is happening inside Triton due to IOBinding not working correctly for this GPT2 model? I see that IOBinding is always enabled in the onnxruntime_backend, but it might not be working correctly.

Triton Information
Triton Docker Image version: 22.10

The text was updated successfully, but these errors were encountered:

Tabrizian transferred this issue from triton-inference-server/server Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPT2 performance degradation with higher sequence length on ONNX Runtime #157

GPT2 performance degradation with higher sequence length on ONNX Runtime #157

rgallardone commented Nov 9, 2022

GPT2 performance degradation with higher sequence length on ONNX Runtime #157

GPT2 performance degradation with higher sequence length on ONNX Runtime #157

Comments

rgallardone commented Nov 9, 2022