Feature/enhance harness report to include detailed score counts and grouped results #1132
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request introduces several changes to the
langtest
package, focusing on enhancing the evaluation framework and improving code structure. The key changes include the addition of theEvalTemplate
class, modifications to theis_pass_llm_eval
function, and updates to themodel_report
function.Enhancements to Evaluation Framework:
Addition of
EvalTemplate
Class: Introduced theEvalTemplate
class inlangtest/metrics/llm_eval.py
to build a prompt for evaluating student answers based on a given rubric. This class includes a methodbuild_prompt
that constructs a grading prompt. (langtest/metrics/llm_eval.py
)Updates to
is_pass_llm_eval
Function: Modified theis_pass_llm_eval
function inlangtest/utils/custom_types/helpers.py
to accept aneval_template
parameter. This allows for customizable evaluation templates, improving the flexibility of the evaluation process. (langtest/utils/custom_types/helpers.py
) [1] [2]Code Structure and Typing Improvements:
Typing Enhancements: Updated type annotations to include
Mapping
andUnion
for better type safety and clarity. (langtest/metrics/llm_eval.py
,langtest/utils/custom_types/helpers.py
) [1] [2]Changes in
BaseQASample
Class: Modified theconfig
attribute in theBaseQASample
class to use aMapping
type for better structure and clarity. (langtest/utils/custom_types/sample.py
)Reporting Improvements:
model_report
Function: Improved themodel_report
function to handle multiple keys in the summary dictionary, calculate pass rates more accurately, and rearrange the columns in the final report for better readability. (langtest/utils/report_utils.py
)These changes collectively enhance the flexibility, readability, and maintainability of the codebase.