Demo polyfactory framework #1

adrianeboyd · 2024-07-12T09:57:30Z

Hi, it's nice to come across a cross-library/model benchmark like this!

When looking at evaluations for structured output libraries, I feel like "valid response" is such a low bar when used on its own as a metric, and I think adding accuracy-related metrics would help these benchmarks be more informative.

I fully acknowledge that this is a bit on the ornery side, but since it only took a few lines of code (it was very easy to do in this repo!), I wanted to submit a demo PR for a new framework that uses polyfactory to generate valid responses based on the response model, with 100% reliability and a latency of 0.000, maybe 0.001 on a bad day.

I'd potentially be interested in contributing to work on additional metrics/tasks in the future, in particular named entity recognition!

The text was updated successfully, but these errors were encountered:

stephenleo · 2024-07-12T23:11:43Z

Hey fully agree on adding an accuracy metric. The code already supports it but I've not published the results because I'm currently using a synthetic dataset with ambiguous real accuracy. In the long run I'd love to report metrics on standard datasets but was having difficulty finding a multilabel classification dataset with many possible classes. I'll continue to look! Do submit a PR if you can find one.

Love you PR, thanks for your submission. Will run the benchmarks today and update the README to your branch before merging!

adrianeboyd · 2024-07-13T09:02:15Z

Sorry, I did come across the hidden accuracy numbers after posting. (I think you might want to refactor the scoring so you're evaluating the whole dataset at the end rather than averaging over a per-item metric, since a lot of the standard metrics (micro f1 for NER, etc.) wouldn't work using a per-item average?)

stephenleo · 2024-07-14T01:31:36Z

Yes I'm logging the predictions for each iteration so can calculate the whole dataset metric at the end. I'll push the metric calculation code soon but wont publish the metrics till I find a good multi-label classification dataset.

adrianeboyd changed the title ~~polyfactory framework~~ Demo polyfactory framework Jul 12, 2024

adrianeboyd mentioned this issue Jul 12, 2024

Add polyfactory framework #2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Demo polyfactory framework #1

Demo polyfactory framework #1

adrianeboyd commented Jul 12, 2024

stephenleo commented Jul 12, 2024

adrianeboyd commented Jul 13, 2024

stephenleo commented Jul 14, 2024

Demo polyfactory framework #1

Demo polyfactory framework #1

Comments

adrianeboyd commented Jul 12, 2024

stephenleo commented Jul 12, 2024

adrianeboyd commented Jul 13, 2024

stephenleo commented Jul 14, 2024