Does anyone else receive the same results when generating output? #210

EdgarasSk · 2023-05-15T08:53:42Z

EdgarasSk
May 15, 2023

Edit:

After some investigation I've identified the problem.

When sampling, `top_k` value is not being evaluated before being passed into the function

https://github.com/abetlen/llama-cpp-python/blob/1a13d76c487df1c8560132d10bda62d6e2f4fa93/llama_cpp/llama.py#LL367C1-L367C1

The value is passed as is and is not changed to n_vocab if top_k=0.

Why is that a problem?

In the source code of llama.cpp we can see that when k=0 and min_keep=1 it will always default to a maximum of a single candidate, ensuring we only receive the candidate with the highest logit.

void llama_sample_top_k(struct llama_context * ctx, llama_token_data_array * candidates, int k, size_t min_keep) {
    const int64_t t_start_sample_us = ggml_time_us();

    k = std::max(k, (int) min_keep);
    k = std::min(k, (int) candidates->size);

    // Sort scores in descending order
    if (!candidates->sorted) {
        auto comp = [](const llama_token_data & a, const llama_token_data & b) {
            return a.logit > b.logit;
        };
        if (k == (int) candidates->size) {
            std::sort(candidates->data, candidates->data + candidates->size, comp);
        } else {
            std::partial_sort(candidates->data, candidates->data + k, candidates->data + candidates->size, comp);
        }
        candidates->sorted = true;
    }
    candidates->size = k;

    if (ctx) {
        ctx->t_sample_us += ggml_time_us() - t_start_sample_us;
    }
}

This is not an expected functionality, because value of k=0 is meant to mark that top_k sampling is disabled, according to llama.cpp source code:

    fprintf(stderr, "  --top-k N             top-k sampling (default: %d, 0 = disabled)\n", params.top_k);
    ...
    const int32_t top_k           = params.top_k <= 0 ? llama_n_vocab(ctx) : params.top_k;

Hello.

I've noticed a strange occurrence when trying to generate output. Based on context the bindgins API will always return the same output. Additionally it seems that top_p and temp values are being completely ignored.

This is not the case when running llama.cpp itself.

I am using the latest version (v0.1.50) of llama-cpp-python. I've installed it with cuBLAS support over pip as well as tried compiling it myself, both instances produce the same results.

My example script:

from llama_cpp import Llama
llm = Llama(model_path="models/ggml-vic13b-uncensored-q5_1.bin", n_gpu_layers=40)
tokens = llm.tokenize(b"I am driving down a busy street and notice a plane crashing down. What can I do?")

output = b""
count = 0
for token in llm.generate(tokens, top_k=0, top_p=0.73, temp=0.72, repeat_penalty=1.1):
     text = llm.detokenize([token])
     output += text

     count +=1
     if count >= 200 or (token == llm.token_eos()):
         break

print(output.decode())

Output example (always the same, regardless of top_p and temp):

$ python test.py
llama.cpp: loading model from models/ggml-vic13b-uncensored-q5_1.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  90.75 KB
llama_model_load_internal: mem required  = 11359.05 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 9075 MB
llama_init_from_file: kv self size  =  400.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |


I am in a car, driving down a busy street when I see a plane flying low overhead. Suddenly, it starts to wobble and lose altitude before plummeting towards the ground. I realize that if I don't do something quickly, the plane is going to crash into the side of a building just ahead of me.

I slam on my brakes and swerve my car into an empty parking lot. As I come to a stop, I see the plane hurtling towards the ground, but it looks like it's going to miss the building by just a few feet.

What can I do? Is there anything I can do to help prevent this crash or minimize its impact?

Now, using llama.cpp I always get a different result:

$ ./build/bin/main -m ../models/ggml-vic13b-uncensored-q5_1.bin --top-k 0 --top-p 0.73 --temp 0.72 --repeat-penalty 1.1 -p "I am driving down a busy street and notice a plane crashing down. What can I do?" --gpu-layers 40
main: build = 553 (63d2046)
main: seed  = 1684140386
llama.cpp: loading model from ../models/ggml-vic13b-uncensored-q5_1.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  90.75 KB
llama_model_load_internal: mem required  = 11359.05 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 9075 MB
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 12 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 0, tfs_z = 1.000000, top_p = 0.730000, typical_p = 1.000000, temp = 0.720000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 I am driving down a busy street and notice a plane crashing down. What can I do?

I have been trained in CPR, but not in emergency vehicle operation or emergency response. I have my phone with me and I know the location of the nearest hospital. What should I do? [end of text]

llama_print_timings:        load time =  2668.27 ms
llama_print_timings:      sample time =    72.33 ms /    44 runs   (    1.64 ms per token)
llama_print_timings: prompt eval time =   266.67 ms /    21 tokens (   12.70 ms per token)
llama_print_timings:        eval time =  3027.72 ms /    43 runs   (   70.41 ms per token)
llama_print_timings:       total time =  5772.50 ms

$ ./build/bin/main -m ../models/ggml-vic13b-uncensored-q5_1.bin --top-k 0 --top-p 0.73 --temp 0.72 --repeat-penalty 1.1 -p "I am driving down a busy street and notice a plane crashing down. What can I do?" --gpu-layers 40
main: build = 553 (63d2046)
main: seed  = 1684140427
llama.cpp: loading model from ../models/ggml-vic13b-uncensored-q5_1.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  90.75 KB
llama_model_load_internal: mem required  = 11359.05 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 9075 MB
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 12 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 0, tfs_z = 1.000000, top_p = 0.730000, typical_p = 1.000000, temp = 0.720000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 I am driving down a busy street and notice a plane crashing down. What can I do?

This is a scenario that you may have seen in movies or read about in books, but it's not something that happens every day in real life. If you were to find yourself in this situation, what would you do? Here are some steps you can take to help yourself and others:

1. Stop your car as quickly and safely as possible. Do not try to swerve or brake suddenly, as this could cause a collision with other vehicles on the road. Instead, carefully pull over to the side of the road and turn off the engine.
2. Call 911 immediately. Tell the operator that you have witnessed a plane crash and provide your location. Do not hang up until the operator tells you to do so.
3. Look for anyone who may have been on the plane or anyone who has been injured as a result of the crash. If there are any survivors, try to assist them by providing first aid or comforting them until help arrives.
4. Avoid touching or moving anything that may be hazardous, such as debris from the crash or fuel leaks. Do not try to remove anyone from the wreckage unless they are in imminent danger of being further injured.
5. Stay away from the crash site and do not attempt to take any photos or videos. Your first priority should be assisting those who have been affected by the crash.
6. If you have a camera or phone with you, take pictures of the crash scene from a safe distance. This can help emergency responders and investigators piece together what happened.
7. If you are able to, try to remember as much information as possible about the plane crash, such as the location, time, weather conditions, and any other details that may be relevant.
8. After the incident, contact your loved ones to let them know that you are safe. If you were involved in the crash or witnessed it, seek medical attention if necessary.

Remember, in a situation like this, it's essential to stay calm and focused on helping those who have been affected by the plane crash. Your quick thinking and actions could make a difference in saving lives. [end of text]

llama_print_timings:        load time =  2636.17 ms
llama_print_timings:      sample time =   756.80 ms /   464 runs   (    1.63 ms per token)
llama_print_timings: prompt eval time =   280.92 ms /    21 tokens (   13.38 ms per token)
llama_print_timings:        eval time = 34963.14 ms /   463 runs   (   75.51 ms per token)
llama_print_timings:       total time = 38397.73 ms

Sorry if this is an incorrect place to post something like this, this is my first time posting.

gjmulder · 2023-05-15T09:27:33Z

gjmulder
May 15, 2023

From memory, there's a note in the newer llama.cpp builds that using CuBLAS does not guarantee the same outputs.

I suspect that llama.ccp may be using the CUDA random number generator API, and the CUDA random seed changes from run to run, i.e. it ignores the main: seed = NNNNNN

EDIT: Just to confirm you are aware that you have:

main: seed = 1684140386

and

main: seed = 1684140427

so you are guaranteed different results with your two calls to ./main ?

1 reply

EdgarasSk May 15, 2023
Author

I am aware. I am using random seed. The issue is not that I am producing different results when executing llama.cpp - I want that functionality. I am trying to figure out why I am producing the same result every time when using llama-cpp-python.

I tried looking over the source code and I see that if seed is not provided it defaults to 0 which is stated to generate random. That is the functionality I expect from llama-cpp-python, but I am always getting the same response.

Edit:

I've manually provided the two unique seeds generated by llama.cpp just to test:

llm = Llama(model_path="models/ggml-vic13b-uncensored-q5_1.bin", n_gpu_layers=40, seed=1684140386)

and

llm = Llama(model_path="models/ggml-vic13b-uncensored-q5_1.bin", n_gpu_layers=40, seed=1684140427)

Both instances produce exactly the same result:

$ python test.py
llama.cpp: loading model from models/ggml-vic13b-uncensored-q5_1.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  90.75 KB
llama_model_load_internal: mem required  = 11359.05 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 9075 MB
llama_init_from_file: kv self size  =  400.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |


I am in a car, driving down a busy street when I see a plane flying low overhead. Suddenly, it starts to wobble and lose altitude before plummeting towards the ground. I realize that if I don't do something quickly, the plane is going to crash into the side of a building just ahead of me.

I slam on my brakes and swerve my car into an empty parking lot. As I come to a stop, I see the plane hurtling towards the ground, but it looks like it's going to miss the building by just a few feet.

What can I do? Is there anything I can do to help prevent this crash or minimize its impact?

Just like without providing the seed value.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does anyone else receive the same results when generating output? #210

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Does anyone else receive the same results when generating output? #210

EdgarasSk May 15, 2023

When sampling, top_k value is not being evaluated before being passed into the function

Why is that a problem?

Replies: 1 comment · 1 reply

gjmulder May 15, 2023

EdgarasSk May 15, 2023 Author

EdgarasSk
May 15, 2023

When sampling, `top_k` value is not being evaluated before being passed into the function

Replies: 1 comment 1 reply

gjmulder
May 15, 2023

EdgarasSk May 15, 2023
Author