Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, LMFormatEnforcer, etc on tasks like multi-label classification, named entity recognition, synthetic data generation, etc.
- Multi-label classification
Framework Model Reliability Latency p95 (s) Fructose gpt-3.5-turbo-0125 1.000 0.869 Outlines unsloth/llama-3-8b-Instruct-bnb-4bit 1.000 1.977 Llamaindex gpt-3.5-turbo-0125 0.997 0.838 Mirascope gpt-3.5-turbo-0125 0.995 1.280 Instructor gpt-3.5-turbo-0125 0.992 1.299 Marvin gpt-3.5-turbo-0125 0.971 1.151 LMFormatEnforcer unsloth/llama-3-8b-Instruct-bnb-4bit 0.950 7.248
- Multi-label classification:
- Task: Given a text, predict the labels associated with it.
- Data:
- Base data: Alexa intent detection dataset
- Benchmarking test is run using synthetic data generated by running:
python -m data_sources.generate_dataset generate-multilabel-data
. - The synthetic data is generated by sampling and combining rows from the base data to achieve multiple classes per row according to some distribution for num classes per row. See
python -m data_sources.generate_dataset generate-multilabel-data --help
for more details.
- Prompt:
"Classify the following text: {text}"
- Evaluation Metrics:
- Reliability: The percentage of times the framework returns valid labels without errors.
- Experiment Details: Run each row through the framework
n_runs
number of times and log the percent of successful runs for each row. Reliability is the average of all the rowspercent_successful
values.
- Named Entity Recognition (🚧 Coming soon)
- Install the requirements using
pip install -r requirements.txt
- Set the OpenAI api key:
export OPENAI_API_KEY=sk-...
- Run the benchmark using
python -m main run-benchmark
- Raw results are stored in the
results
directory. - Generate the results using
python -m main generate-results
- To get help on the command line arguments, add
--help
after the command. Eg.,python -m main run-benchmark --help
- Create a new pandas dataframe pickle file with the following columns:
text
: The text to be sent to the frameworklabels
: List of labels associated with the text- See
data/multilabel_classification.pkl
for an example.
- Add the path to the new pickle file in the
./config.yaml
file under thesource_data_pickle_path
key for all the frameworks you want to test. - Run the benchmark using
python -m main run-benchmark
to test the new data on all the frameworks! - Generate the results using
python -m main generate-results
The easiest way to create a new framework is to reference the ./frameworks/instructor_framework.py
file. Detailed steps are as follows:
- Create a .py file in frameworks directory with the name of the framework. Eg.,
instructor_framework.py
for the instructor framework. - In this .py file create a class that inherits
BaseFramework
fromframeworks.base
. - The class should define an
init
method that initializes the base class. Here are the arguments the base class expects:-
name
(str): name of the task that the framework is being tested on. Obtained from./config.yaml
file. Eg.,"multilabel_classification"
-
prompt
(str): Prompt template used. Obtained from theinit_kwargs
in the./config.yaml
file. -
llm_model
(str): LLM model to be used. Obtained from theinit_kwargs
in the./config.yaml
file. -
llm_model_family
(str): LLM model family to be used. Current supported values as"openai"
and"transformers"
. Obtained from theinit_kwargs
in the./config.yaml
file. -
retries
(int): Number of retries for the framework. Default is$0$ . Obtained from theinit_kwargs
in the./config.yaml
file. -
source_data_picke_path
(str): Path to the source data pickle file. Obtained from theinit_kwargs
in the./config.yaml
file. -
sample_rows
(int): Number of rows to sample from the source data. Useful for testing on a smaller subset of data. Default is$0$ which uses all rows in source_data_pickle_path for the benchmarking. Obtained from theinit_kwargs
in the./config.yaml
file. -
response_model
(Any): The response model to be used. Internally passed by the benchmarking script.
-
- The class should define a
run
method that takes three arguments:-
inputs
: a dictionary of{"text": str}
wherestr
is the text to be sent to the framework -
n_runs
: number of times to repeat each text -
expected_response
: Output expected from the framework
-
- This
run
method should create anotherrun_experiment
function that takesinputs
as argument, runs that input through the framework and returns the output. - The
run_experiment
function should be annotated with the@experiment
decorator fromframeworks.base
withn_runs
andexpected_resposne
as arguments. - The
run
method should call therun_experiment
function and return the three outputspredictions
,percent_successful
,accuracy
andlatencies
. - Import this new class in
frameworks/__init__.py
. - Add a new entry in the
./config.yaml
file with the name of the class as the key. The yaml entry can have the following fields-
name
: name of the task that the framework is being tested on. Obtained from./config.yaml
file. Eg.,"multilabel_classification"
-
n_runs
: number of times to repeat each text -
init_kwargs
: all the arguments that need to be passed to theinit
method of the class, including those mentioned in step 3 above.
-
- Framework related tasks:
Framework Multi-label classification Named Entity Recognition Synthetic Data Generation Instructor ✅ OpenAI 💭 Planning 💭 Planning Mirascope ✅ OpenAI 💭 Planning 💭 Planning Fructose ✅ OpenAI 💭 Planning 💭 Planning Marvin ✅ OpenAI 💭 Planning 💭 Planning Llamaindex ✅ OpenAI 💭 Planning 💭 Planning Outlines ✅ HF Transformers 💭 Planning 💭 Planning LM format enforcer ✅ HF Transformers 💭 Planning 💭 Planning Jsonformer ❌ No Enum Support 💭 Planning 💭 Planning Strictjson ❌ Non-standard schema 💭 Planning 💭 Planning Guidance 🚧 In Progress 💭 Planning 💭 Planning DsPy 🚧 In Progress 💭 Planning 💭 Planning Langchain 🚧 In Progress 💭 Planning 💭 Planning - Others
- Latency metrics
- CICD pipeline for benchmark run automation
- Async run
Contributions are welcome! Here are the steps to contribute:
- Please open an issue with any new framework you would like to add. This will help avoid duplication of effort.
- Once the issue is assigned to you, pls submit a PR with the new framework!
To cite LLM Structured Output Benchmarks in your work, please use the following bibtex reference:
@software{marie_stephen_leo_2024_12327267,
author = {Marie Stephen Leo},
title = {{stephenleo/llm-structured-output-benchmarks:
Release for Zenodo}},
month = jun,
year = 2024,
publisher = {Zenodo},
version = {v0.0.1},
doi = {10.5281/zenodo.12327267},
url = {https://doi.org/10.5281/zenodo.12327267}
}
If this work helped you in any way, please consider ⭐ this repository to give me feedback so I can spend more time on this project.