-
Notifications
You must be signed in to change notification settings - Fork 874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
question to model inference optimization #3134
Comments
Hi @geraldstanje You will have to use the benchmarking tool as shown in this example |
@agunapal can you run torch compile in the init function for torchServe any problems with that? e.g. here: https://github.com/pytorch/serve/blob/master/examples/Huggingface_Transformers/Transformer_handler_generalized.py#L34 do you have an example somewhere? |
Hi @geraldstanje You can download this mar file. Here we use torch.compile with BERT https://github.com/pytorch/serve/blob/master/benchmarks/models_config/bert_torch_compile_gpu.yaml#L24 |
@agunapal ok - i want to look whats inside the .mar file - will i need https://github.com/pytorch/serve/blob/master/model-archiver/README.md ? |
You can wget the mar file and then do an unzip |
@agunapal what could be the reason that torch.compile doesnt immediately complete? it seems torch.compile requres some warmup requests to run (not sure if thats specific to mode=reduce-overhead only) - can you run this in the initialize as well - do you see any problems if the entire warmup takes longer than 30 sec? |
Eval may not be needed. Torch.compile first iteration can take time..so..usually you need to send a few(3-4) requests to warm up. You can also check how we address this with aot compile. You can find the example under pt2 examples directory |
@agunapal the problem is seems its doing some lazy execution - i run torch.compile - it seems to stop there - if i send request for predict it runs torch.compile ... how to disable lazy execution? or how to check if lazy execution causes such behavior? |
📚 The doc issue
there is a typo:
A larger batch size means a higher throughput at the cost of lower latency.
correct version should be:
A larger batch size means a higher throughput at the cost of higher latency.
i have some more questions to model inference latency optimization:
im currently reading about:
https://github.com/pytorch/serve/blob/master/docs/performance_guide.md#torchserve-on-cpu-
https://github.com/pytorch/serve/blob/master/docs/performance_guide.md#torchserve-on-gpu
https://github.com/pytorch/serve/blob/master/docs/configuration.md
https://huggingface.co/docs/transformers/en/perf_torch_compile
i currently running model inference for a SetFit model (https://huggingface.co/blog/setfit) on a ml.g4dn.xlarge instance on aws (vCPUs: 4, Memory (GiB): 16.0, Memory per vCPU (GiB): 4.0, Physical Processor: Intel Xeon Family, GPU: 1, GPU Architecture: nvidia t4 tensor core, Video Memory (GiB): 16).
one thing which helped was to use torch.compile with mode="reduce-overhead"
im not sure how you set all these parameters to tune for low latency, high throughput:
i measured that a single model inference takes about 20ms. i want to have a max latency of around 50m. so i set the max_batch_delay to 30ms and set the max_batch_size to 100 (which seems a bit high at the moment).
how to set min_worker, max_worker - should that set to the number of cpu cores?
should i also increase the default_workers_per_model?
also, does BetterTransformer work with setfit models as well?
i have not used a profiler yet - just looking to understand all those settings before.
Suggest a potential alternative/fix
No response
The text was updated successfully, but these errors were encountered: