question to model inference optimization #3134

geraldstanje · 2024-05-04T21:10:46Z

📚 The doc issue

there is a typo: A larger batch size means a higher throughput at the cost of lower latency.
correct version should be: A larger batch size means a higher throughput at the cost of higher latency.

i have some more questions to model inference latency optimization:
im currently reading about:
https://github.com/pytorch/serve/blob/master/docs/performance_guide.md#torchserve-on-cpu-
https://github.com/pytorch/serve/blob/master/docs/performance_guide.md#torchserve-on-gpu
https://github.com/pytorch/serve/blob/master/docs/configuration.md
https://huggingface.co/docs/transformers/en/perf_torch_compile

i currently running model inference for a SetFit model (https://huggingface.co/blog/setfit) on a ml.g4dn.xlarge instance on aws (vCPUs: 4, Memory (GiB): 16.0, Memory per vCPU (GiB): 4.0, Physical Processor: Intel Xeon Family, GPU: 1, GPU Architecture: nvidia t4 tensor core, Video Memory (GiB): 16).

one thing which helped was to use torch.compile with mode="reduce-overhead"

im not sure how you set all these parameters to tune for low latency, high throughput:

* `min_worker` - (optional) the minimum number of worker processes. TorchServe will try to maintain this minimum
* for specified model. The default value is `1`.
* `max_worker` - (optional) the maximum number of worker processes. TorchServe will make no more that this
* number of workers for the specified model. The default is the same as the setting for `min_worker`.
saw also other settings: 
number_of_netty_threads: defines the number of threads to accept incoming http requests from your client container.
job_queue_size: defines the size of a models's job queue which stores incoming http requests.
default_workers_per_model: defines the number of workers which fetches a http request from a model's job queue.
netty_client_threads: defines the number of threads of a model's worker to receive http response from a model's
worker backend in TS internal.

i measured that a single model inference takes about 20ms. i want to have a max latency of around 50m. so i set the max_batch_delay to 30ms and set the max_batch_size to 100 (which seems a bit high at the moment).

how to set min_worker, max_worker - should that set to the number of cpu cores?
should i also increase the default_workers_per_model?
also, does BetterTransformer work with setfit models as well?

i have not used a profiler yet - just looking to understand all those settings before.

Suggest a potential alternative/fix

No response

The text was updated successfully, but these errors were encountered:

agunapal · 2024-05-07T17:58:53Z

Hi @geraldstanje You will have to use the benchmarking tool as shown in this example
https://github.com/pytorch/serve/tree/master/examples/benchmarking/resnet50
You can refer to the yaml file to see the various options it runs the experiments for.

geraldstanje · 2024-06-13T19:39:44Z

@agunapal can you run torch compile in the init function for torchServe any problems with that? e.g. here: https://github.com/pytorch/serve/blob/master/examples/Huggingface_Transformers/Transformer_handler_generalized.py#L34

do you have an example somewhere?

agunapal · 2024-06-13T19:55:05Z

Hi @geraldstanje You can download this mar file. Here we use torch.compile with BERT https://github.com/pytorch/serve/blob/master/benchmarks/models_config/bert_torch_compile_gpu.yaml#L24

geraldstanje · 2024-06-13T20:00:23Z

@agunapal ok - i want to look whats inside the .mar file - will i need https://github.com/pytorch/serve/blob/master/model-archiver/README.md ?

agunapal · 2024-06-13T20:03:51Z

You can wget the mar file and then do an unzip

geraldstanje · 2024-06-13T20:05:45Z

@agunapal
ok that worked - what does the self.model.eval() before the torch.compile in initialize of TransformersSeqClassifierHandler?

what could be the reason that torch.compile doesnt immediately complete?

it seems torch.compile requres some warmup requests to run (not sure if thats specific to mode=reduce-overhead only) - can you run this in the initialize as well - do you see any problems if the entire warmup takes longer than 30 sec?

agunapal · 2024-06-13T21:20:05Z

Eval may not be needed.

Torch.compile first iteration can take time..so..usually you need to send a few(3-4) requests to warm up.

You can also check how we address this with aot compile. You can find the example under pt2 examples directory

geraldstanje · 2024-06-14T00:03:42Z

@agunapal the problem is seems its doing some lazy execution - i run torch.compile - it seems to stop there - if i send request for predict it runs torch.compile ... how to disable lazy execution?

or how to check if lazy execution causes such behavior?

geraldstanje changed the title ~~question to model inference tuning~~ question to model inference optimization May 4, 2024

agunapal self-assigned this May 7, 2024

agunapal added the triaged Issue has been reviewed and triaged label May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question to model inference optimization #3134

question to model inference optimization #3134

geraldstanje commented May 4, 2024 •

edited

Loading

agunapal commented May 7, 2024

geraldstanje commented Jun 13, 2024 •

edited

Loading

agunapal commented Jun 13, 2024

geraldstanje commented Jun 13, 2024 •

edited

Loading

agunapal commented Jun 13, 2024

geraldstanje commented Jun 13, 2024 •

edited

Loading

agunapal commented Jun 13, 2024

geraldstanje commented Jun 14, 2024 •

edited

Loading

question to model inference optimization #3134

question to model inference optimization #3134

Comments

geraldstanje commented May 4, 2024 • edited Loading

📚 The doc issue

Suggest a potential alternative/fix

agunapal commented May 7, 2024

geraldstanje commented Jun 13, 2024 • edited Loading

agunapal commented Jun 13, 2024

geraldstanje commented Jun 13, 2024 • edited Loading

agunapal commented Jun 13, 2024

geraldstanje commented Jun 13, 2024 • edited Loading

agunapal commented Jun 13, 2024

geraldstanje commented Jun 14, 2024 • edited Loading

geraldstanje commented May 4, 2024 •

edited

Loading

geraldstanje commented Jun 13, 2024 •

edited

Loading

geraldstanje commented Jun 13, 2024 •

edited

Loading

geraldstanje commented Jun 13, 2024 •

edited

Loading

geraldstanje commented Jun 14, 2024 •

edited

Loading