-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FALCON-40B Inference-Kernel Support #3656
base: master
Are you sure you want to change the base?
Conversation
…t/DeepSpeed into ds-inference/add-falcon-support
If the model is loaded using a path |
Got this error with this PR on 4xA30:
DeepSpeed version: 0.9.3+0df4059d |
@RezaYazdaniAminabadi I am unable to replicate the latency (getting >100ms). Can you share more information about your environment? |
Hi @Yard1 |
@RezaYazdaniAminabadi The times I get (I generate two times and take the second run because the first one is always a bit slower): Generations:
With kernel inject:
|
@RezaYazdaniAminabadi I have access only to A100-80GB (p4de.24xlarge). I have ran the following script: from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
import transformers
import torch
import deepspeed
import time
from deepspeed.accelerator import get_accelerator
model = "tiiuae/falcon-40b"
tokenizer = AutoTokenizer.from_pretrained(model, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model, trust_remote_code=True).bfloat16()
model = deepspeed.init_inference(model, mp_size=2, replace_with_kernel_inject=True)
input_prompt = [
"Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:"
]
input_tokens = tokenizer.batch_encode_plus(input_prompt, return_tensors="pt",)
token_num = input_tokens['input_ids'].size(-1)
for t in input_tokens:
if torch.is_tensor(input_tokens[t]):
input_tokens[t] = input_tokens[t].to(get_accelerator().current_device_name())
input_tokens.pop('token_type_ids')
# Warmup
sequences = model.generate(**input_tokens, min_new_tokens=512, max_new_tokens=512, do_sample=True)
st = time.monotonic()
for i in range(2):
sequences = model.generate(**input_tokens, min_new_tokens=512, max_new_tokens=512, do_sample=True)
tt = time.monotonic() - st
print(f"Time taken {tt/2} time per new token {tt/512/2}")
if torch.distributed.get_rank() == 0:
print(f"Result: {tokenizer.batch_decode(sequences, skip_special_tokens=True)[0]}") I just ran this as Results:
ds_report:
I have tried with 2 A100s and 4 A100s and got similar results for both. I have noticed that latency increases linearly when increasing the batch size, and that each next token takes longer to generate (which can be pretty dramatic with a large number of input/output tokens). Given that the main change I made compared to your script was to increase the number of tokens from 300 to 512, I wager that's the problem. I have seen similar behavior with the 7B model without deepspeed, so I assume it's due to the architecture. Still, that is very suboptimal. Are there any optimization tweaks that can be done on deepspeed's side to fix this, or should this be taken up with the model's authors? |
Hi @Yard1, Thanks for sharing your script.
Please let me know if that changes the latency. |
Thanks @thies1006 for verifying that this works on your side. I think your perf improvement is smaller about (50%), however, since you are doing model-parallelism, the inference performance very much depends on the communication bandwidth that you can achieve on these GPUs. |
@RezaYazdaniAminabadi |
Hi @RezaYazdaniAminabadi, here's my updated script: from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
import transformers
import torch
import deepspeed
import time
from deepspeed.accelerator import get_accelerator
model = "tiiuae/falcon-40b"
tokenizer = AutoTokenizer.from_pretrained(model, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model, trust_remote_code=True).bfloat16()
model = deepspeed.init_inference(model, mp_size=2, replace_with_kernel_inject=True)
input_prompt = [
"Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:"
]
input_tokens = tokenizer.batch_encode_plus(input_prompt, return_tensors="pt",)
token_num = input_tokens['input_ids'].size(-1)
for t in input_tokens:
if torch.is_tensor(input_tokens[t]):
input_tokens[t] = input_tokens[t].to(get_accelerator().current_device_name())
input_tokens.pop('token_type_ids')
# Warmup
sequences = model.generate(**input_tokens, min_length=512, max_length=512, do_sample=True)
torch.cuda.synchronize()
st = time.monotonic()
for i in range(2):
torch.cuda.synchronize()
sequences = model.generate(**input_tokens, min_length=512, max_length=512, do_sample=True)
torch.cuda.synchronize()
tt = time.monotonic() - st
print(f"Time taken {tt/2} time per new token {tt/512/2}")
if torch.distributed.get_rank() == 0:
print(f"Result: {tokenizer.batch_decode(sequences, skip_special_tokens=True)[0]}") With those changes (adding
This is still slower than what you were seeing, @RezaYazdaniAminabadi. Could you check if you get a similar result when using 512 instead of 300 tokens? EDIT: The results with 300 tokens match what you have gotten more closely:
It would appear that the Falcon model has an issue with |
@RezaYazdaniAminabadi this solution will not work on Falcon 7B since the modelling file is different. I think this is a bug HuggingFace need to solve, but just FYI. Maybe some workaround can play like change the attention layer number to inject |
Yes, I know it does not work there. I will look into it and see how it can be supported. Thanks for letting me know. Best, |
can you share you command and env ? @RezaYazdaniAminabadi I always got this error [2023-06-21 06:59:34,573] [ERROR] [launch.py:320:sigkill_handler] ['/opt/conda/envs/bin/python', '-u', 'test_ds.py', '--local_rank=7'] exits with return code = -9 My env: deepspeed: 0.9.3 below is the scripts: `from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig model = "tiiuae/falcon-40b" tokenizer = AutoTokenizer.from_pretrained(model, use_fast=True) model = AutoModelForCausalLM.from_pretrained(model, trust_remote_code=True).bfloat16() input_prompt = [ Warmupsequences = model.generate(**input_tokens, min_new_tokens=512, max_new_tokens=512, do_sample=True) |
This is really nice work! Look forward to Falcon 7b! |
Hi guys, sorry I was so slow on this thread. I will start working more on this toward the weekend and bring the FALCON-7B support too. I am actually amazed by the many interest in this work.
I will work on adding some of these supports for this model. I also appreciate if anyone would like to help improve/use some already developed techniques to improve this. @alexwong2024,
can you please try it and see if the issue is resolved? |
I wanted to help but writing cuda/cpp code is not really my strength. I'm happy to do some testing once it reaches that stage. I would like to try this Falcon 7B in deepspeed inference cuz I believe the MQA can bring down the latency a lot and makes Falcon 7B very production friendly. This can be especially important for cases where streaming output is not available. On another note, I tried the Falcon 7B in Huggingface implementation. Somehow the latency is really high, around twice as much as models with similar architecture that don't have MQA, like stabilitylm 7B. Using or not using Accelerate doesn't make a difference. I wonder if the MQA is not implemented correctly. |
… from over 10 minutes to about 15 sec)
Hi everyone, I have added some changes here that can boost the loading time of this model significantly (from 10 min to less than 15 sec). To test this please use this script as follows:
You need to create the mp_sharded checkpoints to get the fastest loading time. To do this, pass the flag |
I actually have a question from you guys, has anyone tested the inference of this model on text_generation_inference system from HuggingFace? |
Yes. What information do you need? |
I tried FLAN-T5-XXL on TGI and compare the performance with Deepspeed (DS) and Fastertransformer (FT) on Deep Java Library (DJL). I use g5.12xlarge on AWS and fix tensor_parallel_degree=4. For generation_len=256 and batch_size=1, FT is ~3s while DS and TGI doubles the latency. TGI is known for its continuous batching technique and DJL also has dynamic batching. I didn't test that. I think FT re-write everything in CUDA for T5 while DS and TGI probably only re-write some modules/layers? I guess that causes the latency difference. |
@RezaYazdaniAminabadi So for the Falcon kernel you created (06/20). It is faster than TextGeneration Flash implementation with sequence length < 256. Kernel crashes on longer sequence length. We cannot do the TGI's continious batching since DeepSpeed dropped the KV cache part. I think the next big thing to do is enabling a way to massage KV cache for LLM inference. This will catch up. LLAMA is still beating TGI FYI, great jobs! |
Thanks for the feedback, it's great to see some of the downside and benefits of our pipeline and help us improve the stack. I just wanted to know if these problems of the slowness of the model-loading are solved in their pipeline, and I can use them! |
@RezaYazdaniAminabadi Hi, why this PR is closed? Is it due to the lack of some KV cache support for Falcon? Apart from that, I'm interested on supporting meta tensor loading for Falcon-40B and other models like LLAMA2-70B and GPT-3 in the future. But I don't know the way to do that. I think DeepSpeed Currently, it will fail on |
Hi @dc3671, I have most of the fixes, however, I wanted to better understand the contributions I am bringing here. I will reopen this soon. |
I worked a bit on this PR and added the Meta-tensor loading support. Also, Falcon-7B is runnable now. I have added a script,
Next, I am gonna try test the newest Falcon model (180B). |
I added a PR to fix this small problem #4654 Hi @RezaYazdaniAminabadi , did you try last month's latest change of Falcon-40B? They will use in-repo model file and seems not compatible with DeepSpeed's autoTP algorithm. |
Hi @RezaYazdaniAminabadi , Thanks for your contribution. I used this script and met the following issue. My environment is deepspeed=0.12.3, transformers=4.34.0,torch=2.0.1, instance is p4de. Could you help know the reason? [2023-12-08 11:58:57,763] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1 |
This PR adds the Policy, Containers and some kernels for running the FALCON-40B model with tensor-model parallelism.
FALCON-40B Architecture Overview
FALCON model is an interesting model with an inference-friendly structure: 1) it shares the K and V heads across the query heads by broadcasting data in groups of 16 heads, which therefore reduces the KV-Cache by 16x and this allows to run inference very efficiently with much higher throughput. 2) Similar to GPT-J and GPT-NeoX architectures, it uses the parallel MLP and Attention implementations, that on one hand is very useful to overlap computation when there is less workload to saturate the GPU cores, and on the other hand, it reduces the communication when using tensor-model parallelism, as it only requires one all-reduce at the end of each layer.
Testing the model using Multi-GPU Inference
For running this model, I used the following code snippet and using the same query as used in HugginFace website to test this model. I also use 4 A100-40GB to run this model. One side note is that you cannot run this model, as is, on older NVIDIA architectures, such as V100, since it is using some special operation (
F.scaled_dot_product_attention
) that only runs on GPU hardwares with compute-capability higher than 8.0. With DeepSpeed-Inference kernel support, you can run on 4 V100-32GB as well without going through any code-changes of the original model.Generation Result:
Performance Evaluation
For measuring the performance, I ran the same query 10 times and get the average token-latency. I use PyTorch2.0.1+cu118 as the baseline. Compared to PyTorch, DeepSpeed-Inference obtains 2.5x Speedup, reducing the per-token latency from 93 to 36 ms.
TODO: