Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refined prompt construction for feedback #1058

Draft
wants to merge 529 commits into
base: main
Choose a base branch
from

Conversation

piotrm0
Copy link
Contributor

@piotrm0 piotrm0 commented Apr 8, 2024

Items to add to release announcement:

  • Heading: delete this list if this PR does not introduce any changes that need announcing.

Other details that are good to know but need not be announced:

  • There should be something here at least.

Work in progress. Designing prompts for feedback from several common parts and allow different sizes of such prompts depending on the allowable space:

class ScoringPromptBase(SerialModel):
        """Common parts to build a scoring prompt out of."""

        prefix_template: str = """"""
        """Text to include before all other parts."""

        interp_template:str = """You are a {purpose} scorer."""
        details_template: str = """"""

        low_score: int = 1
        high_score: int = 10

        shots_template: str = """EXAMPLES:"""
        shot_template: str = """
INPUTS:
{shot_inputs}
EXPECTED OUTPUT: {shot_output}"""

        suffix_template = """
Answer only with an integer from {low_score} ({low_interp}) to {high_score} ({high_interp}).

INPUTS:
{inputs}
SCORE: """
        """Text to include after all other parts."""

    class ScoringPrompt(ScoringPromptBase):
        """Specific parts to build a scoring prompt out of."""

        interp: str
        """Minimal interpretation of the score."""

        low_interp: str
        """Interpretation of a low score."""

        high_interp: str
        """Interpretation of a high score."""

        details: str
        """Text to include after the purpose to provide some more details about
        the purpose."""

        shots: List[Tuple[Dict[str, str], int]] = []

        def build(
            self,
            max_tokens: int = sys.maxsize,
            inputs: Optional[Dict[str, str]] = None
        ) -> str:
            """Build a prompt for the given inputs while staying under the token
            limit.
            
            The built prompt will have at least:
                - filled prefix_template,
                - filled interp_template,
                - filled suffix_template.

            If space allows, it will also include:
                - filled details_template,
                - filled shots_template (if at least one of the below is
                  included),
                - filled in shot_template for each shot.

            Args:
                max_tokens: The maximum number of tokens to use. Defaults to
                    sys.maxsize (effectively infinite).

                inputs: The inputs to fill in the prompts
                    with other than the ones defining the scoring task. Those
                    are contained in self.
            
            Returns:
                str: The built prompt.

            Raises:
                ValueError: If the prompt would be too long to fit within the
                    token limit even at its miminum size.

            """

piotrm0 and others added 30 commits November 28, 2023 15:34
* first

* working version, maybe

* note

* fixes

* note

* nit
* remove extra reset cell

* fix langchain prompt import, quickstart

* async fix imports, update install pins

* fix langchain prompttemplate imports

* ada embeddings in quickstart, pinned install versions

* pin package versions

* clear output
* fix quickstart imports

* fix langchain trulens imports
Co-authored-by: Josh Reini <[email protected]>
* move model comparison to use cases (expected location)

* multimodal example with trullama

* remove extra commit
Co-authored-by: Shayak Sen <[email protected]>
Co-authored-by: Josh Reini <[email protected]>
* version bump quickstarts

* version bump py quickstarts

* version bump all_tools

* format quickstarts

* version bump init

* update package one-liner
* added wrapper for dynamically generated functions in boto3

* docs and remove debug prints

* typo

* remove unused

* testing out streaming counting
not_toxic -> toxic, fix docstring. code itself is correct.
* add generator for answer relevance using SummEval

save wip work

test (#627)

wip

* remove sqlit

* groundedness eval across 100 examples

* rm

* Update groundedness_smoke_tests.ipynb

fix typo

* typo

---------

Co-authored-by: Josh Reini <[email protected]>
* add langchain prompt template

* add langchain template to _langchain_evaluate

* pass through criteria, use standard cot reasons template
* change assertion from dict to object

* get model, usage as attr not from dict
daniel-huang-1230 and others added 26 commits March 22, 2024 21:57
* fix: italise TruLens-Eval ref

* fix: italise TruLens-Eval ref in root scripts.

* docs: add contribution instructions for proper names with mod to inverted commas.

* Update standards.md

Markdown lint prefers _ to * for emphasis.

---------

Co-authored-by: Josh Reini <[email protected]>
Co-authored-by: Piotr Mardziel <[email protected]>
* working on glossary

* finish glossary draft

* nits

* Add some info regarding makefiles.

---------

Co-authored-by: Josh Reini <[email protected]>
* more pipelines docs

* adjust trigger for release tests

* one more time

* one more time

* again

* one more

* one more try

* nit

* add a docs pipeline

---------

Co-authored-by: Josh Reini <[email protected]>
* Add if_missing.

* add new enum to docs feedbacks page

* make re_0_10 rating a bit more robust

* adjust rating extraction test

* check for integers only and remove unneeded imports
* fix image

* feedback_function index updates

* implementation and provider docs

* feedback implementations llm-based

* classificiaton implementations

* feedback base provider docstrings

* formatting of numbered lists

* more example admonitions

* tru custom app docs

* isntrumentation api docs

* virtual app api ref

* add missing title
* fix some proper names

* nits

* too many, giving up

* remove _ from mkdocs

* llama indexes

---------

Co-authored-by: Josh Reini <[email protected]>
* Spell fix

* Added user feedback button to the sidebar

* Updated share feedback text
* pin packaging

* remove packaging, remove base langchain

* remove langchain requirement

* update comment

* move nltk to required

* nltk required, download punkt on init

* add packaging requirement

* move punkt download

* bump langchain version

* pin packaging 23.2

* logger debug for optional packages
* Fix import and favicon

* Update requirements.txt

---------

Co-authored-by: Josh Reini <[email protected]>
* removed pkg_resources

* add reqs

* remove duplicate

* preserve note from duplicate

* format

* fix for py3.8

* format

* nit

* remove distutils as well and add notes

* notes

* nits

* fix static_resource for py38 again
…val utils, and docs update (#991)

* implement recommendation metrics for benchmark framework

ece fix

Revert "ece fix"

This reverts commit c58ee7e.

run actual evals

add context relevance inference api to hugs ffs

fmt

larger dataset + smarter backoff + recall

nb update (wip)

fix how we handle ties in precision and recall

saving results for GPT-3.5, GPT-4, Claude-1, and Claude-2

remove secrets

* finished evals with truera context relevance model

* add Verb 2S top 1 prompt

* update ECE method

pushed to server

* save csv results for tmp scaling

* save

* implement meeting bank generator

* example notebook for comprehensiveness benchmark WIP

* gi# This is a combination of 2 commits.

gainsight benchmarking done

remove secrets

* prepping comprehensiveness benchmark notebook

* remove unused test script

* moving results csvs

* updates models

* intermediate results code change

* good stopping point

* cleanup

* symlink docs

* huge doc updates

* fix doc symlink

* fix score range in docstring

* add docstring for truera's context relevance model

* update comprehensiveness notebook

* update comprehensiveness notebook

* fix

* file renames

* new symlinks

* update mkdcos

---------

Co-authored-by: Josh Reini <[email protected]>
Co-authored-by: Josh Reini <[email protected]>
* atlas quickstart

* header updates
* first

* assistants api (rag) quickstart

* fix indent
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Apr 8, 2024
@piotrm0 piotrm0 marked this pull request as draft April 8, 2024 05:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.