Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After using autoawq to quantify the model, an error occurs when inferring the model #636

Open
xuanzhangyang opened this issue Oct 21, 2024 · 0 comments

Comments

@xuanzhangyang
Copy link

quantization:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import AwqConfig, AutoConfig
from awq import AutoAWQForCausalLM

# 使用中文版
model_id = './Llama-3.1-Nemotron-70B-Instruct-HF'
# 或者,使用原版
# model_id = 'meta-llama/Llama-2-7b-chat-hf'

quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

model = AutoAWQForCausalLM.from_pretrained(model_id, device_map='auto'})
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model.quantize(tokenizer, quant_config=quant_config)

quantization_config = AwqConfig(

    bits=quant_config["w_bit"],

    group_size=quant_config["q_group_size"],

    zero_point=quant_config["zero_point"],

    version=quant_config["version"].lower(),

).to_dict()

model.model.config.quantization_config = quantization_config


import os
output = "./Llama-3.1-Nemotron-70B-Instruct-HF-awq"
if not os.path.exists(output):
    os.mkdir(output)

model.save_quantized(output)
tokenizer.save_pretrained(output)

inference:

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "./Llama-3.1-Nemotron-70B-Instruct-HF-awq"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

messages = [{"role": "user", "content": "hello"}]
tokenized_message = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True)
response_token_ids = model.generate(tokenized_message['input_ids'].cuda(),attention_mask=tokenized_message['attention_mask'].cuda(),max_new_tokens=4096, pad_toke    n_id = tokenizer.eos_token_id)
generated_tokens =response_token_ids[:, len(tokenized_message['input_ids'][0]):]
generated_text = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
print(generated_text)

but I got error:

loc("/home/code/pytorch/AutoAWQ/awq/modules/triton/gemm.py":229:35): error: invalid element type in packLLEElements. Expected 'f16' but got 'f32'
loc("/home/code/pytorch/AutoAWQ/awq/modules/triton/gemm.py":229:35): error: invalid element type in packLLEElements. Expected 'f16' but got 'f32'
loc("/home/code/pytorch/AutoAWQ/awq/modules/triton/gemm.py":229:35): error: invalid element type in packLLEElements. Expected 'f16' but got 'f32'
loc("/home/code/pytorch/AutoAWQ/awq/modules/triton/gemm.py":229:35): error: invalid element type in packLLEElements. Expected 'f16' but got 'f32'
.
.
.
loc("/home/xuan/code/pytorch/AutoAWQ/awq/modules/triton/gemm.py":229:35): error: 'llvm.insertvalue' op Type mismatch: cannot insert 'f32' into '!llvm.struct<(f16, f16, f16, f16, f16, f16, f16, f16, f16, f16, f16, f16, f16, f16, f16, f16)>'

env info:

torch                                    2.4.1
triton                                   3.0.0
autoawq                              0.2.6+cu121 /home/code/pytorch/AutoAWQ
transformers                        4.46.0.dev0 /home/code/pytorch/hunningface/transformers
CUDA Version: 12.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant