Make modeling compatible with Nanotron + few optims #23

NouamaneTazi · 2023-10-25T19:58:11Z

Implement GPT Neo's rope
Implement GQA
flash-attn with kv cache for fast inference
make sure logits match (some slight differences happen sometimes with padded tokens)
make compatible with fast-llm's checkpoints

Convert nanotron checkpoint to transformers

torchrun examples/starcoder2/convert_brrr_to_trfrs.py --checkpoint-path /fsx/shared/brrr/starcoder2_7b_4k_smol_data_750000/ --model-name bigcode/starcoder2-tokenizer --save-path /scratch/brrr/converted/starcoder2_7b_4k_smol_data_750000/

Run inference

# pip install git+https://github.com/bigcode-project/transformers.git@refs/pull/23/head
# pip install flash-attn --no-build-isolation

from pathlib import Path
from pprint import pprint

import torch
from transformers import AutoTokenizer, GPTBigCodeForCausalLM

# CUDA_VISIBLE_DEVICES=1 torchrun examples/starcoder2/convert_brrr_to_trfrs.py --checkpoint-path /fsx/shared/brrr/starcoder2_7b_4k_smol_data_750000/ --model-name bigcode/starcoder2-tokenizer --save-path /scratch/brrr/converted/starcoder2_7b_4k_smol_data_750000/
checkpoint_path = Path("/scratch/brrr/converted/starcoder2_7b_4k_smol_data_750000/")
model_name = "bigcode/starcoder2-tokenizer"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = GPTBigCodeForCausalLM.from_pretrained(
    checkpoint_path, torch_dtype=torch.bfloat16, device_map="cuda"
)

dummy_inputs = [
    "Passage: Daniel went back to the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. Mary went back to the garden. Where is Mary?\nAnswer:",
    "def fib(n)",
    "This film was probably inspired by Godzilla",
]
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "left"

# inputs = tokenizer(dummy_inputs, return_tensors="pt", padding=True).to("cuda")
inputs = tokenizer(
    dummy_inputs, return_tensors="pt", padding="max_length", max_length=11
).to("cuda")

output = model.generate(**inputs, max_new_tokens=128, do_sample=False, use_cache=True)

for i in range(len(inputs.input_ids)):
    out = output[i][len(inputs.input_ids[i]) :]
    pprint(
        {
            "input": dummy_inputs[i],
            "generation": tokenizer.decode(out, clean_up_tokenization_spaces=False),
            "generation_ids": out,
        }
    )

Please make sure flash-attn>=2.4.2

cc @loubnabnl @xrsrke

matching logits without using cache

[Feature] Converting brrr's starcoder to transformer's starcoder checkpoint format

RaymondLi0 · 2024-01-09T16:55:38Z

Models converted from fast-llm use the branch gqa. Maybe we can try to merge into this one, and unify the two implementations?

UniverseFly · 2024-01-22T23:27:34Z

Thanks for the great work. I am not sure if the following assertion was correct. When I tried to train the model and fed it with an attention mask where only the first few tokens are masked (e.g., [[False, False, True, True, ..., True], ...]), the assertion failed. It worked after commenting out these lines.

transformers/src/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py

Lines 439 to 441 in 1507798

    
           assert ~( 
        
               sequence_mask[:, :-1] & (~sequence_mask[:, 1:])  # True is never followed by False 
        
           ).any(), f"Can't mask in the middle of sequence, please use USE_FAST=0 instead.\nGot sequence_mask: {sequence_mask}"

NouamaneTazi · 2024-02-05T09:05:12Z

Closed in favor of #28

NouamaneTazi added 8 commits October 25, 2023 19:57

implement GPT Neo's rope

7384efb

fix imports

7ebc9ea

output logits

2c28b04

attn mask.all()

253da5b

fix caching in rope

d3c15ba

GQA generation without cache

1e74664

fix use_cache for GQA

1f424bb

reshapes fixes for num_heads=2

39a3483

NouamaneTazi changed the title ~~implement GPT Neo's rope~~ implement GPT Neo's rope + GQA Nov 1, 2023

NouamaneTazi and others added 18 commits November 2, 2023 02:16

.

1c79ecd

add flash_attn_with_kvcache to GQA

19cf153

add merging word embedding checkpoints

b493268

add merging quite a bit

4446fe0

add reference starcoder model

1d949b2

merged most of the checkpoints

a58a947

add merged checkpoints

ac559a1

add mapping to target state dict

78114b7

refactor converting scrip

7d50b80

refactor

21ee689

add inference script

210311b

refactor

09c086a

refactor all functions

ae54653

save some files before cleaning it all

594099c

delete uncessary files

fb8a86b

add rope_theta to config

c26472c

matching logits without using cache

fix config.attn_pdrop for flash attn

9c9cfbb

Merge pull request #1 from xrsrke/sc2-rope

6bdf78a

[Feature] Converting brrr's starcoder to transformer's starcoder checkpoint format

NouamaneTazi changed the title ~~implement GPT Neo's rope + GQA~~ Make modeling compatible with Nanotron + few optims Jan 8, 2024

Refactor GPTBigCode model conversion code

1507798

NouamaneTazi closed this Feb 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make modeling compatible with Nanotron + few optims #23

Make modeling compatible with Nanotron + few optims #23

NouamaneTazi commented Oct 25, 2023 •

edited

Loading

RaymondLi0 commented Jan 9, 2024

UniverseFly commented Jan 22, 2024

NouamaneTazi commented Feb 5, 2024 •

edited

Loading

Make modeling compatible with Nanotron + few optims #23

Make modeling compatible with Nanotron + few optims #23

Conversation

NouamaneTazi commented Oct 25, 2023 • edited Loading

RaymondLi0 commented Jan 9, 2024

UniverseFly commented Jan 22, 2024

NouamaneTazi commented Feb 5, 2024 • edited Loading

NouamaneTazi commented Oct 25, 2023 •

edited

Loading

NouamaneTazi commented Feb 5, 2024 •

edited

Loading