Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The wikitext2 perplexity calculation method is based on this Huggingface article:
It is calculated with a window size of
max_seq_length = 4096
tokens. At each step, the window shifts bystride=512
tokens, and its firstmax_seq_length - stride
. tokens are considered as context tokens. This means that their logits are not taken into account, allowing this rolling perplexity to be calculated without overlap.I benchmarked llama2-7B with this
config
:By running
python3 run_wikitext-2_benchmark.py -config
andfp16
precisionI got a perplexity of 5.02
fp16
precisionI got a perplexity of 5.15
It is close to the score reported here ggml-org/llama.cpp#2352