Feature/enhance harness report to include detailed score counts and grouped results #1132

chakravarthik27 · 2024-10-26T15:06:44Z

This pull request introduces several changes to the langtest package, focusing on enhancing the evaluation framework and improving code structure. The key changes include the addition of the EvalTemplate class, modifications to the is_pass_llm_eval function, and updates to the model_report function.

Enhancements to Evaluation Framework:

Addition of EvalTemplate Class: Introduced the EvalTemplate class in langtest/metrics/llm_eval.py to build a prompt for evaluating student answers based on a given rubric. This class includes a method build_prompt that constructs a grading prompt. (langtest/metrics/llm_eval.py)
Updates to is_pass_llm_eval Function: Modified the is_pass_llm_eval function in langtest/utils/custom_types/helpers.py to accept an eval_template parameter. This allows for customizable evaluation templates, improving the flexibility of the evaluation process. (langtest/utils/custom_types/helpers.py) [1] [2]

Code Structure and Typing Improvements:

Typing Enhancements: Updated type annotations to include Mapping and Union for better type safety and clarity. (langtest/metrics/llm_eval.py, langtest/utils/custom_types/helpers.py) [1] [2]
Changes in BaseQASample Class: Modified the config attribute in the BaseQASample class to use a Mapping type for better structure and clarity. (langtest/utils/custom_types/sample.py)

Reporting Improvements:

Enhanced model_report Function: Improved the model_report function to handle multiple keys in the summary dictionary, calculate pass rates more accurately, and rearrange the columns in the final report for better readability. (langtest/utils/report_utils.py)

These changes collectively enhance the flexibility, readability, and maintainability of the codebase.

…ounts like rating 1 to 5,

…unts

…unncessary comments.

…rompts

…ed grading functionality

…transformer_prompt_eval function for flexible grade handling

…ex matching

chakravarthik27 added 7 commits October 22, 2024 23:04

Updated: model report function in utils, to support different label c…

93f64ec

…ounts like rating 1 to 5,

Refactor model_report function in utils to support different label co…

86455f4

…unts

updated: eval_template parameter add to is_pass_eval func and remove …

3f16138

…unncessary comments.

updated: annotation in BaseQASample

09e0227

updated: llm_eval.py to add EvalTemplate class for building grading p…

936680e

…rompts

updated: eval_prompt template configureable

c54fe53

updated: is_pass_llm_eval function to handle eval_template parameter

0e9df84

chakravarthik27 self-assigned this Oct 26, 2024

chakravarthik27 added 4 commits October 28, 2024 13:00

fixed: errors in report_utils.py

99a6b23

fixed: issues in testcases, and llm_eval.py and helpers.py for improv…

9123d9a

…ed grading functionality

updated: LlmEval class to support dynamic grading lists and modified …

3e2a559

…transformer_prompt_eval function for flexible grade handling

updated: refined grade list pattern in LlmEval class for improved reg…

94df648

…ex matching

chakravarthik27 requested a review from dcecchini November 11, 2024 12:12

chakravarthik27 merged commit 70a7d3a into release/2.5.0 Nov 18, 2024
3 checks passed

chakravarthik27 linked an issue Nov 23, 2024 that may be closed by this pull request

Enhance Harness Report to Include Detailed Score Counts and Grouped Results #1073

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/enhance harness report to include detailed score counts and grouped results #1132

Feature/enhance harness report to include detailed score counts and grouped results #1132

chakravarthik27 commented Oct 26, 2024

Feature/enhance harness report to include detailed score counts and grouped results #1132

Feature/enhance harness report to include detailed score counts and grouped results #1132

Conversation

chakravarthik27 commented Oct 26, 2024

Enhancements to Evaluation Framework:

Code Structure and Typing Improvements:

Reporting Improvements: