🧩 LLM Structured Output Benchmarks

Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, LMFormatEnforcer, etc on tasks like multi-label classification, named entity recognition, synthetic data generation, etc.

🏆 Benchmark Results [2024-07-06]

Multi-label classification

Framework	Model	Reliability	Latency p95 (s)
Fructose	gpt-3.5-turbo-0125	1.000	0.869
Outlines	unsloth/llama-3-8b-Instruct-bnb-4bit	1.000	1.977
Llamaindex	gpt-3.5-turbo-0125	0.997	0.838
Mirascope	gpt-3.5-turbo-0125	0.995	1.280
Instructor	gpt-3.5-turbo-0125	0.992	1.299
Marvin	gpt-3.5-turbo-0125	0.971	1.151
LMFormatEnforcer	unsloth/llama-3-8b-Instruct-bnb-4bit	0.950	7.248

🧪 Benchmark methodology

Multi-label classification:
- Task: Given a text, predict the labels associated with it.
- Data:
  - Base data: Alexa intent detection dataset
  - Benchmarking test is run using synthetic data generated by running: python -m data_sources.generate_dataset generate-multilabel-data.
  - The synthetic data is generated by sampling and combining rows from the base data to achieve multiple classes per row according to some distribution for num classes per row. See python -m data_sources.generate_dataset generate-multilabel-data --help for more details.
- Prompt: "Classify the following text: {text}"
- Evaluation Metrics:
  1. Reliability: The percentage of times the framework returns valid labels without errors.
- Experiment Details: Run each row through the framework n_runs number of times and log the percent of successful runs for each row. Reliability is the average of all the rows percent_successful values.
Named Entity Recognition (🚧 Coming soon)

🏃 Run the benchmark

Install the requirements using pip install -r requirements.txt
Set the OpenAI api key: export OPENAI_API_KEY=sk-...
Run the benchmark using python -m main run-benchmark
Raw results are stored in the results directory.
Generate the results using python -m main generate-results
To get help on the command line arguments, add --help after the command. Eg., python -m main run-benchmark --help

📊 Adding new data

Create a new pandas dataframe pickle file with the following columns:
- text: The text to be sent to the framework
- labels: List of labels associated with the text
- See data/multilabel_classification.pkl for an example.
Add the path to the new pickle file in the ./config.yaml file under the source_data_pickle_path key for all the frameworks you want to test.
Run the benchmark using python -m main run-benchmark to test the new data on all the frameworks!
Generate the results using python -m main generate-results

🏗️ Adding a new framework

The easiest way to create a new framework is to reference the ./frameworks/instructor_framework.py file. Detailed steps are as follows:

Create a .py file in frameworks directory with the name of the framework. Eg., instructor_framework.py for the instructor framework.
In this .py file create a class that inherits BaseFramework from frameworks.base.
The class should define an init method that initializes the base class. Here are the arguments the base class expects:
- name (str): name of the task that the framework is being tested on. Obtained from ./config.yaml file. Eg., "multilabel_classification"
- prompt (str): Prompt template used. Obtained from the init_kwargs in the ./config.yaml file.
- llm_model (str): LLM model to be used. Obtained from the init_kwargs in the ./config.yaml file.
- llm_model_family (str): LLM model family to be used. Current supported values as "openai" and "transformers". Obtained from the init_kwargs in the ./config.yaml file.
- retries (int): Number of retries for the framework. Default is $0$. Obtained from the init_kwargs in the ./config.yaml file.
- source_data_picke_path (str): Path to the source data pickle file. Obtained from the init_kwargs in the ./config.yaml file.
- sample_rows (int): Number of rows to sample from the source data. Useful for testing on a smaller subset of data. Default is $0$ which uses all rows in source_data_pickle_path for the benchmarking. Obtained from the init_kwargs in the ./config.yaml file.
- response_model (Any): The response model to be used. Internally passed by the benchmarking script.
The class should define a run method that takes three arguments:
- inputs: a dictionary of {"text": str} where str is the text to be sent to the framework
- n_runs: number of times to repeat each text
- expected_response: Output expected from the framework
This run method should create another run_experiment function that takes inputs as argument, runs that input through the framework and returns the output.
The run_experiment function should be annotated with the @experiment decorator from frameworks.base with n_runs and expected_resposne as arguments.
The run method should call the run_experiment function and return the three outputs predictions, percent_successful, accuracy and latencies.
Import this new class in frameworks/__init__.py.
Add a new entry in the ./config.yaml file with the name of the class as the key. The yaml entry can have the following fields
- name: name of the task that the framework is being tested on. Obtained from ./config.yaml file. Eg., "multilabel_classification"
- n_runs: number of times to repeat each text
- init_kwargs: all the arguments that need to be passed to the init method of the class, including those mentioned in step 3 above.

🧭 Roadmap

Framework related tasks:

Framework	Multi-label classification	Named Entity Recognition	Synthetic Data Generation
Instructor	✅ OpenAI	💭 Planning	💭 Planning
Mirascope	✅ OpenAI	💭 Planning	💭 Planning
Fructose	✅ OpenAI	💭 Planning	💭 Planning
Marvin	✅ OpenAI	💭 Planning	💭 Planning
Llamaindex	✅ OpenAI	💭 Planning	💭 Planning
Outlines	✅ HF Transformers	💭 Planning	💭 Planning
LM format enforcer	✅ HF Transformers	💭 Planning	💭 Planning
Jsonformer	❌ No Enum Support	💭 Planning	💭 Planning
Strictjson	❌ Non-standard schema	💭 Planning	💭 Planning
Guidance	🚧 In Progress	💭 Planning	💭 Planning
DsPy	🚧 In Progress	💭 Planning	💭 Planning
Langchain	🚧 In Progress	💭 Planning	💭 Planning

Others
- Latency metrics
- CICD pipeline for benchmark run automation
- Async run

💡 Contribution guidelines

Contributions are welcome! Here are the steps to contribute:

Please open an issue with any new framework you would like to add. This will help avoid duplication of effort.
Once the issue is assigned to you, pls submit a PR with the new framework!

🎓 Citation

To cite LLM Structured Output Benchmarks in your work, please use the following bibtex reference:

@software{marie_stephen_leo_2024_12327267,
  author       = {Marie Stephen Leo},
  title        = {{stephenleo/llm-structured-output-benchmarks: 
                   Release for Zenodo}},
  month        = jun,
  year         = 2024,
  publisher    = {Zenodo},
  version      = {v0.0.1},
  doi          = {10.5281/zenodo.12327267},
  url          = {https://doi.org/10.5281/zenodo.12327267}
}

🙏 Feedback

If this work helped you in any way, please consider ⭐ this repository to give me feedback so I can spend more time on this project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧩 LLM Structured Output Benchmarks

🏆 Benchmark Results [2024-07-06]

🧪 Benchmark methodology

🏃 Run the benchmark

📊 Adding new data

🏗️ Adding a new framework

🧭 Roadmap

💡 Contribution guidelines

🎓 Citation

🙏 Feedback

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
data_sources		data_sources
frameworks		frameworks
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
main.py		main.py
requirements.txt		requirements.txt

License

adrianeboyd/llm-structured-output-benchmarks

Folders and files

Latest commit

History

Repository files navigation

🧩 LLM Structured Output Benchmarks

🏆 Benchmark Results [2024-07-06]

🧪 Benchmark methodology

🏃 Run the benchmark

📊 Adding new data

🏗️ Adding a new framework

🧭 Roadmap

💡 Contribution guidelines

🎓 Citation

🙏 Feedback

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages