After using autoawq to quantify the model, an error occurs when inferring the model #636

xuanzhangyang · 2024-10-21T08:19:19Z

quantization:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import AwqConfig, AutoConfig
from awq import AutoAWQForCausalLM

# 使用中文版
model_id = './Llama-3.1-Nemotron-70B-Instruct-HF'
# 或者，使用原版
# model_id = 'meta-llama/Llama-2-7b-chat-hf'

quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

model = AutoAWQForCausalLM.from_pretrained(model_id, device_map='auto'})
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model.quantize(tokenizer, quant_config=quant_config)

quantization_config = AwqConfig(

    bits=quant_config["w_bit"],

    group_size=quant_config["q_group_size"],

    zero_point=quant_config["zero_point"],

    version=quant_config["version"].lower(),

).to_dict()

model.model.config.quantization_config = quantization_config


import os
output = "./Llama-3.1-Nemotron-70B-Instruct-HF-awq"
if not os.path.exists(output):
    os.mkdir(output)

model.save_quantized(output)
tokenizer.save_pretrained(output)

inference:

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "./Llama-3.1-Nemotron-70B-Instruct-HF-awq"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

messages = [{"role": "user", "content": "hello"}]
tokenized_message = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True)
response_token_ids = model.generate(tokenized_message['input_ids'].cuda(),attention_mask=tokenized_message['attention_mask'].cuda(),max_new_tokens=4096, pad_toke    n_id = tokenizer.eos_token_id)
generated_tokens =response_token_ids[:, len(tokenized_message['input_ids'][0]):]
generated_text = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
print(generated_text)

but I got error:

loc("/home/code/pytorch/AutoAWQ/awq/modules/triton/gemm.py":229:35): error: invalid element type in packLLEElements. Expected 'f16' but got 'f32'
loc("/home/code/pytorch/AutoAWQ/awq/modules/triton/gemm.py":229:35): error: invalid element type in packLLEElements. Expected 'f16' but got 'f32'
loc("/home/code/pytorch/AutoAWQ/awq/modules/triton/gemm.py":229:35): error: invalid element type in packLLEElements. Expected 'f16' but got 'f32'
loc("/home/code/pytorch/AutoAWQ/awq/modules/triton/gemm.py":229:35): error: invalid element type in packLLEElements. Expected 'f16' but got 'f32'
.
.
.
loc("/home/xuan/code/pytorch/AutoAWQ/awq/modules/triton/gemm.py":229:35): error: 'llvm.insertvalue' op Type mismatch: cannot insert 'f32' into '!llvm.struct<(f16, f16, f16, f16, f16, f16, f16, f16, f16, f16, f16, f16, f16, f16, f16, f16)>'

env info:

torch                                    2.4.1
triton                                   3.0.0
autoawq                              0.2.6+cu121 /home/code/pytorch/AutoAWQ
transformers                        4.46.0.dev0 /home/code/pytorch/hunningface/transformers
CUDA Version: 12.1

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After using autoawq to quantify the model, an error occurs when inferring the model #636

After using autoawq to quantify the model, an error occurs when inferring the model #636

xuanzhangyang commented Oct 21, 2024

After using autoawq to quantify the model, an error occurs when inferring the model #636

After using autoawq to quantify the model, an error occurs when inferring the model #636

Comments

xuanzhangyang commented Oct 21, 2024

quantization:

inference:

but I got error:

env info: