Integrate an evaluation harness #12

Vectorrent · 2024-10-13T07:05:53Z

We will need to test our models against common, industry-standard benchmarks. Pythia is what everyone uses today:
https://github.com/EleutherAI/lm-evaluation-harness

The process will involve:

Updating test.py to load the model with the Transformers API
Overwriting it with your model's latest checkpoint
Then, executing Pythia tests from the same script

The text was updated successfully, but these errors were encountered:

Vectorrent · 2024-10-23T07:38:22Z

I added an eval.py script, which covers most of this work, but it doesn't seem to work right. For some reason, eval suites tend to fail almost immediately, with weird tokenization errors. I'm not sure if that's because of a poorly-trained tokenizer, or an under-trained model, or because of the custom architecture - and I'm not sure how to fix it, right now.

Vectorrent · 2024-11-22T08:24:44Z

I was not aware of the evaluate library. This looks pretty nice, and since we use the Huggingface Transformers API already - would probably be easy to setup. We should try this one.

Vectorrent added enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed labels Oct 13, 2024

Vectorrent changed the title ~~Integrate the Pythia test harness~~ Integrate an evaluation harness Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate an evaluation harness #12

Integrate an evaluation harness #12

Vectorrent commented Oct 13, 2024

Vectorrent commented Oct 23, 2024

Vectorrent commented Nov 22, 2024

Integrate an evaluation harness #12

Integrate an evaluation harness #12

Comments

Vectorrent commented Oct 13, 2024

Vectorrent commented Oct 23, 2024

Vectorrent commented Nov 22, 2024