Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make modeling compatible with Nanotron + few optims #23

Closed

Conversation

NouamaneTazi
Copy link
Contributor

@NouamaneTazi NouamaneTazi commented Oct 25, 2023

  • Implement GPT Neo's rope
  • Implement GQA
  • flash-attn with kv cache for fast inference
  • make sure logits match (some slight differences happen sometimes with padded tokens)
  • make compatible with fast-llm's checkpoints
Convert nanotron checkpoint to transformers
torchrun examples/starcoder2/convert_brrr_to_trfrs.py --checkpoint-path /fsx/shared/brrr/starcoder2_7b_4k_smol_data_750000/ --model-name bigcode/starcoder2-tokenizer --save-path /scratch/brrr/converted/starcoder2_7b_4k_smol_data_750000/
Run inference
# pip install git+https://github.com/bigcode-project/transformers.git@refs/pull/23/head
# pip install flash-attn --no-build-isolation

from pathlib import Path
from pprint import pprint

import torch
from transformers import AutoTokenizer, GPTBigCodeForCausalLM

# CUDA_VISIBLE_DEVICES=1 torchrun examples/starcoder2/convert_brrr_to_trfrs.py --checkpoint-path /fsx/shared/brrr/starcoder2_7b_4k_smol_data_750000/ --model-name bigcode/starcoder2-tokenizer --save-path /scratch/brrr/converted/starcoder2_7b_4k_smol_data_750000/
checkpoint_path = Path("/scratch/brrr/converted/starcoder2_7b_4k_smol_data_750000/")
model_name = "bigcode/starcoder2-tokenizer"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = GPTBigCodeForCausalLM.from_pretrained(
    checkpoint_path, torch_dtype=torch.bfloat16, device_map="cuda"
)

dummy_inputs = [
    "Passage: Daniel went back to the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. Mary went back to the garden. Where is Mary?\nAnswer:",
    "def fib(n)",
    "This film was probably inspired by Godzilla",
]
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "left"

# inputs = tokenizer(dummy_inputs, return_tensors="pt", padding=True).to("cuda")
inputs = tokenizer(
    dummy_inputs, return_tensors="pt", padding="max_length", max_length=11
).to("cuda")

output = model.generate(**inputs, max_new_tokens=128, do_sample=False, use_cache=True)

for i in range(len(inputs.input_ids)):
    out = output[i][len(inputs.input_ids[i]) :]
    pprint(
        {
            "input": dummy_inputs[i],
            "generation": tokenizer.decode(out, clean_up_tokenization_spaces=False),
            "generation_ids": out,
        }
    )

Please make sure flash-attn>=2.4.2

cc @loubnabnl @xrsrke

@NouamaneTazi NouamaneTazi changed the title implement GPT Neo's rope implement GPT Neo's rope + GQA Nov 1, 2023
@NouamaneTazi NouamaneTazi changed the title implement GPT Neo's rope + GQA Make modeling compatible with Nanotron + few optims Jan 8, 2024
@RaymondLi0
Copy link

Models converted from fast-llm use the branch gqa. Maybe we can try to merge into this one, and unify the two implementations?

@UniverseFly
Copy link

Thanks for the great work. I am not sure if the following assertion was correct. When I tried to train the model and fed it with an attention mask where only the first few tokens are masked (e.g., [[False, False, True, True, ..., True], ...]), the assertion failed. It worked after commenting out these lines.

assert ~(
sequence_mask[:, :-1] & (~sequence_mask[:, 1:]) # True is never followed by False
).any(), f"Can't mask in the middle of sequence, please use USE_FAST=0 instead.\nGot sequence_mask: {sequence_mask}"

@NouamaneTazi
Copy link
Contributor Author

NouamaneTazi commented Feb 5, 2024

Closed in favor of #28

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants