Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demo polyfactory framework #1

Open
adrianeboyd opened this issue Jul 12, 2024 · 3 comments
Open

Demo polyfactory framework #1

adrianeboyd opened this issue Jul 12, 2024 · 3 comments

Comments

@adrianeboyd
Copy link
Contributor

Hi, it's nice to come across a cross-library/model benchmark like this!

When looking at evaluations for structured output libraries, I feel like "valid response" is such a low bar when used on its own as a metric, and I think adding accuracy-related metrics would help these benchmarks be more informative.

I fully acknowledge that this is a bit on the ornery side, but since it only took a few lines of code (it was very easy to do in this repo!), I wanted to submit a demo PR for a new framework that uses polyfactory to generate valid responses based on the response model, with 100% reliability and a latency of 0.000, maybe 0.001 on a bad day.

I'd potentially be interested in contributing to work on additional metrics/tasks in the future, in particular named entity recognition!

@adrianeboyd adrianeboyd changed the title polyfactory framework Demo polyfactory framework Jul 12, 2024
@stephenleo
Copy link
Owner

Hey fully agree on adding an accuracy metric. The code already supports it but I've not published the results because I'm currently using a synthetic dataset with ambiguous real accuracy. In the long run I'd love to report metrics on standard datasets but was having difficulty finding a multilabel classification dataset with many possible classes. I'll continue to look! Do submit a PR if you can find one.

Love you PR, thanks for your submission. Will run the benchmarks today and update the README to your branch before merging!

@adrianeboyd
Copy link
Contributor Author

Sorry, I did come across the hidden accuracy numbers after posting. (I think you might want to refactor the scoring so you're evaluating the whole dataset at the end rather than averaging over a per-item metric, since a lot of the standard metrics (micro f1 for NER, etc.) wouldn't work using a per-item average?)

@stephenleo
Copy link
Owner

Yes I'm logging the predictions for each iteration so can calculate the whole dataset metric at the end. I'll push the metric calculation code soon but wont publish the metrics till I find a good multi-label classification dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants