Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default TGI Inference parameter values #2978

Open
2 of 4 tasks
ashwincv0112 opened this issue Jan 31, 2025 · 1 comment
Open
2 of 4 tasks

Default TGI Inference parameter values #2978

ashwincv0112 opened this issue Jan 31, 2025 · 1 comment

Comments

@ashwincv0112
Copy link

System Info

Hi team,

We are trying to get the default parameter values that is being used while invoking a fine-tuned model which is deployed using TGI (latest version).

In the logs we are able to get the below information.

{ best_of: None, 
temperature: None, 
repetition_penalty: None, 
frequency_penalty: None, 
top_k: None, 
top_p: None,
typical_p: None, 
do_sample: false, 
max_new_tokens: Some(672), 
return_full_text: None, 
stop: [], 
truncate: None, 
watermark: false, 
details: false, 
decoder_input_details: false, 
seed: None, 
top_n_tokens: None, 
grammar: None, 
adapter_id: None } 
total_time="10.779314795s" 
validation_time="536.816µs" 
queue_time="60.971µs"
inference_time="10.778717208s" 
time_per_token="16.039757ms" 
seed="None"}

The objective of the exercise is, we are trying to get the same level of accuracy from the model output between a Finetuned model and Base model + LoRA adapters (deployed using the multi-lora functionality of the TGI).

We are getting the expected output from the Finetuned model but when using the multi-lora the accuracy of the output reduces drastically.

We are using the below config while invoking.

While using Finetuned model

'parameters': {
            'max_new_tokens': token_limit,
        },

While using Multi-LoRA functionality

'parameters': {
            'max_new_tokens': token_limit,
            "adapter_id": "adapter1",
        },
    }

we did refer to the below link:

https://github.com/huggingface/text-generation-inference/blob/38773453ae0d29fba3dc79a38d589ebdc5451093/router/src/lib.rs

Could you suggest is there any difference between the default values used in above mentioned methodology. Also if you can suggest a way to increase the output accuracy while using Multi-LoRA.

Thanks.

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Multi-LoRA deployment:

docker run -it \
  --gpus all \
  --shm-size 1g \
  -v /home/ubuntu/data:/data \
  -p 8080:80 \
  ghcr.io/huggingface/text-generation-inference:latest \
	--model-id=/data/starcoder2-3b \
	--lora-adapters=adapter1=/data/adapter1
	--dtype bfloat16

Finetuned model deployment

docker run --gpus all -d -p 8080:80 \
	-v /home/ubuntu/data_backup:/data \
	ghcr.io/huggingface/text-generation-inference:latest \
	--model-id=/data/ 

Expected behavior

With the same parameter values, we should be getting the same output (or at least same output accuracy)

@ashwincv0112
Copy link
Author

ashwincv0112 commented Feb 5, 2025

Hi Team,

We are trying to match the output from a TGI deployed Finetuned model with a Model deployed using TGI Multi-LoRA functionality (where we are using a base model (Starcoder2-3B) and 2 different fine-tuned adapters).

Even after keeping all the inference parameters same, we are getting completely different outputs for the same prompts.

Please find the list of parameters used.

'inputs': input_prompt,
        'parameters': {
            'max_new_tokens': token_limit,
            "adapter_id": "adapter1",
            "best_of": None,
            "decoder_input_details": False,
            "details": False,
            "do_sample": False,
            "frequency_penalty": None,
            "grammar": None,
            "repetition_penalty": None, 
            "return_full_text": None,
            "seed": None,
            "temperature": None,
            "top_k": None,
            "top_n_tokens": None,
            "top_p": None,
            "truncate": None,
            "typical_p": None,
            "watermark": False
        },

Could you provide some input on this.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant