Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MNN batch inference time not more efficient than single image #3184

Open
danielbr33 opened this issue Jan 28, 2025 · 2 comments
Open

MNN batch inference time not more efficient than single image #3184

danielbr33 opened this issue Jan 28, 2025 · 2 comments
Labels
User The user ask question about how to use. Or don't use MNN correctly and cause bug.

Comments

@danielbr33
Copy link

You can use C++ API to do it. How ever, it's not more efficient to use multi batch for inference.

Why is multi batch computing inefficient using MNN?

Is this true for all operators?

Originally posted by @mingyunzzu in #673

Can somebody explain why inference with batch isn't more efficient in MNN? When I run detection on single image it takes 7 miliseconds and when I run on batch of 32 images it takes 8 miliseconds per image. This is only the time of inference measured by time of runSession without preparing images and postprocessing. What can I use to reach better results?

@jxt1234
Copy link
Collaborator

jxt1234 commented Jan 29, 2025

  1. The issue's bug has been resolved. Now batch inference time depend on the compute flops of device. If one image has reach the peak flops, then batch image will not more efficient.
  2. The get the device's compute peak flops, can use ./run_test.out speed/MatMulBConst
  3. Normally GPU has more flops than CPU, you can use opencl instead of CPU to forward.

@jxt1234 jxt1234 added the User The user ask question about how to use. Or don't use MNN correctly and cause bug. label Jan 29, 2025
@danielbr33
Copy link
Author

I've run the modified test with dimensions closer to mine. The results are below. My image is a tensor of shape (3, 413, 413).
Could you explain me if I correctly understand that there is no large difference in FLOPS result between 10x and 100x larger sizes and therefore does not achieve better detection time when using batch?

I've also tried to quantize model, which reduced it size from 6.9MB to 1.8MB, but time increased from 7.5ms to 11ms which also seems strange to me. I used low precision in my model's BackendConfig.

Do you have any different advices what can I use to reduce time of inference if I cannot use GPU?

(base) daniel@Daniel-PC:~/Desktop/MNN/build$ ./run_test.out speed/MatMulBConst
CPU Group: [ 14 12 15 13 ], 800000 - 3600000
CPU Group: [ 11 8 6 4 2 0 9 10 7 5 3 1 ], 800000 - 4900000
The device supports: i8sdot:0, fp16:0, i8mm: 0, sve2: 0
running speed/MatMulBConstTest.
MatMul B Const (Conv1x1): [540, 540, 320], run 100
_runConst, 203, cost time: 9.487000 ms
[540, 540, 320], Avg time: 1.366700 ms , flops: 68.275406 G
MatMul B Const (Conv1x1): [1024, 1024, 1024], run 100
_runConst, 203, cost time: 18.649000 ms
[1024, 1024, 1024], Avg time: 13.539290 ms , flops: 79.305626 G
MatMul B Const (Conv1x1): [3, 416, 416], run 1000
_runConst, 203, cost time: 0.081000 ms
[3, 416, 416], Avg time: 0.011036 ms , flops: 47.043137 G
MatMul B Const (Conv1x1): [30, 416, 416], run 1000
_runConst, 203, cost time: 0.153000 ms
[30, 416, 416], Avg time: 0.079503 ms , flops: 65.301689 G
MatMul B Const (Conv1x1): [300, 416, 416], run 100
_runConst, 203, cost time: 0.807000 ms
[300, 416, 416], Avg time: 0.753230 ms , flops: 68.925568 G
speed/MatMulBConstTest cost time: 1693.446 ms
√√√ all <speed/MatMulBConst> tests passed.
TEST_NAME_UNIT: 单元测试
TEST_CASE_AMOUNT_UNIT: {"blocked":0,"failed":0,"passed":1,"skipped":0}
TEST_CASE={"name":"单元测试","failed":0,"passed":1}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
User The user ask question about how to use. Or don't use MNN correctly and cause bug.
Projects
None yet
Development

No branches or pull requests

2 participants