John Snow Labs LangTest 2.1.0: Elevate Your Language Model Testing with Enhanced API Integration, Expanded File Support, Improved Benchmarking Workflows, and an Enhanced User Experience with Various Bug Fixes and Enhancements
📢 Highlights
John Snow Labs is thrilled to announce the release of LangTest 2.1.0! This update brings exciting new features and improvements designed to streamline your language model testing workflows and provide deeper insights.
-
🔗 Enhanced API-based LLM Integration: LangTest now supports testing API-based Large Language Models (LLMs). This allows you to seamlessly integrate diverse LLM models with LangTest and conduct performance evaluations across various datasets.
-
📂 Expanded File Format Support: LangTest 2.1.0 introduces support for additional file formats, further increasing its flexibility in handling different data structures used in LLM testing.
-
📊 Improved Multi-Dataset Handling: We've made significant improvements in how LangTest manages multiple datasets. This simplifies workflows and allows for more efficient testing across a wider range of data sources.
-
🖥️ New Benchmarking Commands: LangTest now boasts a set of new commands specifically designed for benchmarking language models. These commands provide a structured approach to evaluating model performance and comparing results across different models and datasets.
-
💡Data Augmentation for Question Answering: LangTest introduces improved data augmentation techniques specifically for question-answering. This leads to an evaluation of your language models' ability to handle variations and potential biases in language, ultimately resulting in more robust and generalizable models.
🔥 Key Enhancements:
Streamlined Integration and Enhanced Functionality for API-Based Large Language Models:
This feature empowers you to seamlessly integrate virtually any language model hosted on an external API platform. Whether you prefer OpenAI, Hugging Face, or even custom vLLM solutions, LangTest now adapts to your workflow. input_processor
and output_parser
functions are not required for openai api compatible server.
Key Features:
-
Effortless API Integration: Connect to any API system by specifying the API URL, parameters, and a custom function for parsing the returned results. This intuitive approach allows you to leverage your preferred language models with minimal configuration.
-
Customizable Parameters: Define the URL, parameters specific to your chosen API, and a parsing function tailored to extract the desired output. This level of control ensures compatibility with diverse API structures.
-
Unparalleled Flexibility: Generic API Support removes platform limitations. Now, you can seamlessly integrate language models from various sources, including OpenAI, Hugging Face, and even custom vLLM solutions hosted on private platforms.
How it Works:
Parameters:
Define the input_processer
function for creating a payload and the output_parser
function is used to extract the output from the response.
GOOGLE_API_KEY = "<YOUR API KEY>"
model_url = f"https://generativelanguage.googleapis.com/v1beta/models/gemini-pro:generateContent?key={GOOGLE_API_KEY}"
# headers
headers = {
"Content-Type": "application/json",
}
# function to create a payload
def input_processor(content):
return {"contents": [
{
"role": "user",
"parts": [
{
"text": content
}
]
}
]}
# function to extract output from model response
def output_parser(response):
try:
return response['candidates'][0]['content']['parts'][0]['text']
except:
return ""
To take advantage of this feature, users can utilize the following setup code:
from langtest import Harness
# Initialize Harness with API parameters
harness = Harness(
task="question-answering",
model={
"model": {
"url": url,
"headers": headers,
"input_processor": input_processor,
"output_parser": output_parser,
},
"hub": "web",
},
data={
"data_source": "OpenBookQA",
"split": "test-tiny",
}
)
# Generate, Run and get Report
harness.generate().run().report()
Streamlined Data Handling and Evaluation
This feature streamlines your testing workflows by enabling LangTest to process a wider range of file formats directly.
Key Features:
-
Effortless File Format Handling: LangTest now seamlessly ingests data from various file formats, including pickles (.pkl) in addition to previously supported formats. Simply provide the data source path in your harness configuration, and LangTest takes care of the rest.
-
Simplified Data Source Management: LangTest intelligently recognizes the file extension and automatically selects the appropriate processing method. This eliminates the need for manual configuration, saving you time and effort.
-
Enhanced Maintainability: The underlying code structure is optimized for flexibility. Adding support for new file formats in the future requires minimal effort, ensuring LangTest stays compatible with evolving data storage practices.
How it works:
from langtest import Harness
harness = Harness(
task="question-answering",
model={
"model": "http://localhost:1234/v1/chat/completions",
"hub": "lm-studio",
},
data={
"data_source": "path/to/file.pkl", #
},
)
# generate, run and report
harness.generate().run().report()
Multi-Dataset Handling and Evaluation
This feature empowers you to efficiently benchmark your language models across a wider range of datasets.
Key Features:
-
Effortless Multi-Dataset Testing: LangTest now seamlessly integrates and executes tests on multiple datasets within a single harness configuration. This streamlined approach eliminates the need for repetitive setups, saving you time and resources.
-
Enhanced Fairness Evaluation: By testing models across diverse datasets, LangTest helps identify and mitigate potential biases. This ensures your models perform fairly and accurately on a broader spectrum of data, promoting ethical and responsible AI development.
-
Robust Accuracy Assessment: Multi-dataset support empowers you to conduct more rigorous accuracy testing. By evaluating models on various datasets, you gain a deeper understanding of their strengths and weaknesses across different data distributions. This comprehensive analysis strengthens your confidence in the model's real-world performance.
How it works:
Initiate the Harness class
harness = Harness(
task="question-answering",
model={"model": "gpt-3.5-turbo-instruct", "hub": "openai"},
data=[
{"data_source": "BoolQ", "split": "test-tiny"},
{"data_source": "NQ-open", "split": "test-tiny"},
{"data_source": "MedQA", "split": "test-tiny"},
{"data_source": "LogiQA", "split": "test-tiny"},
],
)
Configure the accuracy tests in Harness class
harness.configure(
{
"tests": {
"defaults": {"min_pass_rate": 0.65},
"robustness": {
"uppercase": {"min_pass_rate": 0.66},
"dyslexia_word_swap": {"min_pass_rate": 0.60},
"add_abbreviation": {"min_pass_rate": 0.60},
"add_slangs": {"min_pass_rate": 0.60},
"add_speech_to_text_typo": {"min_pass_rate": 0.60},
},
}
}
)
harness.generate() generates testcases, .run() executes them, and .report() compiles results.
harness.generate().run().report()
Streamlined Evaluation Workflows with Enhanced CLI Commands
LangTest's evaluation capabilities, focusing on report management and leaderboards. These enhancements empower you to:
-
Streamlined Reporting and Tracking: Effortlessly save and load detailed evaluation reports directly from the command line using
langtest eval
, enabling efficient performance tracking and comparative analysis over time, with manual file review options in the~/.langtest
or./.langtest
folder. -
Enhanced Leaderboards: Gain valuable insights with the new langtest
show-leaderboard
command. This command displays existing leaderboards, providing a centralized view of ranked model performance across evaluations. -
Average Model Ranking: Leaderboard now includes the average ranking for each evaluated model. This metric provides a comprehensive understanding of model performance across various datasets and tests.
How it works:
First, create the parameter.json
or parameter.yaml
in the working directory
JSON Format
{
"task": "question-answering",
"model": {
"model": "google/flan-t5-base",
"hub": "huggingface"
},
"data": [
{
"data_source": "MedMCQA"
},
{
"data_source": "PubMedQA"
},
{
"data_source": "MMLU"
},
{
"data_source": "MedQA"
}
],
"config": {
"model_parameters": {
"max_tokens": 64,
"device": 0,
"task": "text2text-generation"
},
"tests": {
"defaults": {
"min_pass_rate": 0.70
},
"robustness": {
"add_typo": {
"min_pass_rate": 0.70
}
}
}
}
}
Yaml Format
task: question-answering
model:
model: google/flan-t5-base
hub: huggingface
data:
- data_source: MedMCQA
- data_source: PubMedQA
- data_source: MMLU
- data_source: MedQA
config:
model_parameters:
max_tokens: 64
device: 0
task: text2text-generation
tests:
defaults:
min_pass_rate: 0.70
robustness:
add_typo:
min_pass_rate: 0.7
And open the terminal or cmd in your system
langtest eval --model <your model name or endpoint> \
--hub <model hub like hugging face, lm-studio, web ...> \
-c < your configuration file like parameter.json or parameter.yaml>
Finally, we can know the leaderboard and rank of the model.
To visualize the leaderboard anytime using the CLI command
langtest show-leaderboard
📒 New Notebooks
Notebooks | Colab Link |
---|---|
Generic API-based Model Testing | |
Multi-Dataset | |
Langtest Eval Cli Command |
🐛 Fixes
⚡ Enhancements
- Improved the error handling in Harness run method [#990]
- Websites Updates [#1001]
- Updated new version for dependencies [#992]
- Improved the data augmentation for Question-Answering task [#991]
What's Changed
- Feautre/integration with web api by @chakravarthik27 in #986
- Refactor TestFactory class to handle exceptions in async tests by @chakravarthik27 in #990
- data augmentation support for question-answering task by @chakravarthik27 in #991
- Updated dependencies by @chakravarthik27 in #992
- Fix/implement the multiple dataset support for accuracy tests by @chakravarthik27 in #998
- Feature/add support for other file formats by @chakravarthik27 in #993
- Bug Fix: Generated results are none by @chakravarthik27 in #1000
- Feature/implement load & save for benchmark reports by @chakravarthik27 in #999
- Fix/bug fixes langtest 2 1 0 rc1 by @chakravarthik27 in #1003
- website updates by @chakravarthik27 in #1001
- Fix/bug fixes langtest 2 1 0 rc1 by @chakravarthik27 in #1004
- Release/2.0.1 by @chakravarthik27 in #1005
Full Changelog: 2.0.0...2.1.0