Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty Generations / Failing Reproducing 40% on HumanEval #148

Open
leonardtang opened this issue Nov 30, 2023 · 3 comments
Open

Empty Generations / Failing Reproducing 40% on HumanEval #148

leonardtang opened this issue Nov 30, 2023 · 3 comments

Comments

@leonardtang
Copy link

leonardtang commented Nov 30, 2023

Hi all, I've set up Starcoder as follows:

gen_checkpoint = "bigcode/starcoder"
gen_device = "cuda"
gen_tokenizer, gen_model = setup_model_tokenizer(
    gen_checkpoint, bit_4=False, device=gen_device, bnb_config=None
)
def setup_model_tokenizer(
    path,
    device=None,
    bit_4=False,
    bit_8=False,
    max_memory=None,
    bnb_config=None,
):
    tokenizer = setup_tokenizer(path)
    if torch.cuda.device_count() > 1:
        model = AutoModelForCausalLM.from_pretrained(
            path,
            trust_remote_code=True,
            device_map="auto",
            load_in_4bit=bit_4,
            load_in_8bit=bit_8,
            max_memory=max_memory,
            quantization_config=bnb_config,
        ).eval()
    else:
        if not bit_4 and not bit_8:
            model = (
                AutoModelForCausalLM.from_pretrained(path, trust_remote_code=True)
                .to(device)
                .eval()
            )
        else:
            model = AutoModelForCausalLM.from_pretrained(
                path,
                trust_remote_code=True,
                load_in_4bit=bit_4,
                load_in_8bit=bit_8,
                quantization_config=bnb_config,
            ).eval()
    return tokenizer, model
gen_outputs_dict = gen_model.generate(
        **gen_inputs,
        pad_token_id=gen_tokenizer.eos_token_id,
        max_new_tokens=NEW_TOKENS,
        return_dict_in_generate=True,
        do_sample=True,
        temperature=TEMP,
        top_p=0.95,
        top_k=0,
        stopping_criteria=construct_stopping_criteria(
            "code", STOP_SEQS, gen_tokenizer, gen_device
        ),
    )

The stop tokens I'm using are a subset of those found in the Codex paper: STOP_SEQS = ["\nclass", "\ndef"].

Somehow, it looks like I'm consistently getting empty generations however -- just an EOS token. Concretely, around ~20% of my generations are empty on HumanEval.

I'm using the suggested prompt as well, i.e. "<filename>solutions/solution_1.py\n# Here is the correct implementation of the code exercise\n".

I'm getting around 15% on HumanEval, not 40% as stated in the paper. I'm setting TEMP = 0.2 and NEW_TOKENS=128. Would somebody be able to point out what might be going wrong?

@loubnabnl
Copy link
Contributor

Can you try again using the framework we used for evaluation: https://github.com/bigcode-project/bigcode-evaluation-harness there's an argument for adding a prefix. In your code it's not clear if you stripped the prompts or not (impacts performance), we also use more stop words

@jack-jjm
Copy link

I'm having a similar problem - lots of empty generations on a straightforward prompt from HumanEval. For example, this code:

from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "bigcode/starcoder"
device = "cuda" # for GPU usage or "cpu" for CPU usage

prompt = """\
def has_close_elements(numbers: List[float], threshold: float) -> bool:
    \""" Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    \"""
"""

tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_auth_token="<auth_token>")
model = AutoModelForCausalLM.from_pretrained(checkpoint, use_auth_token="<auth_token>", device_map="cuda").to(device)

inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

Just generates this output:

Loading checkpoint shards: 100%|██████████| 7/7 [00:32<00:00,  4.63s/it]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    """

def has_close_elements_v2(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    """

def has_close_elements_v3(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    """

@loubnabnl
Copy link
Contributor

Hi, this prompt is not stripped you need to remove the trailing \n for it to work properly. I also just run the code from the harness and it reproduces the reported numbers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants