forked from promptfoo/promptfoo
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: Add guide for temperature comparison
- Loading branch information
Showing
7 changed files
with
166 additions
and
37 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
6 changes: 0 additions & 6 deletions
6
examples/gpt-3.5-temperature-comparison/prompts/chat_prompt.json
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,153 @@ | ||
# Choosing the right temperature for your LLM | ||
|
||
The `temperature`` setting in language models is like a dial that adjusts how predictable or surprising the responses from the model will be, helping application developers fine-tune the AI's creativity to suit different tasks. | ||
|
||
In general, a low temperature leads to "safer", more expected words, while a higher temperature encourages the model to choose less obvious words. This is why higher temperature is commonly associated with more creative outputs. | ||
|
||
Under the hood, `temperature` adjusts how the model calculates the likelihood of each word it might pick next. | ||
|
||
The `temperature` parameter affects each output token by scaling the logits (the raw output scores from the model) before they are passed through the softmax function that turns them into probabilities. Lower temperatures sharpen the distinction between high and low scores, making the high scores more dominant, while higher temperatures flatten this distinction, giving lower-scoring words a better chance of being chosen. | ||
|
||
## Finding the optimal temperature | ||
|
||
The best way to find the optimal temperature parameter is to run a systematic *evaluation*. | ||
|
||
The optimal temperature will always depend on your specific use case. That's why it's important to: | ||
|
||
- Quantitatively measure the performance of your prompt+model outputs at various temperature settings. | ||
- Ensure consistency in the model's behavior, which is particularly important when deploying LLMs in production environments. | ||
- Compare the model's performance against a set of predefined criteria or benchmarks | ||
|
||
By running a temperature eval, you can make data-driven decisions that balance the reliability and creativity of your LLM app. | ||
|
||
## Prerequisites | ||
|
||
Before setting up an evaluation to compare the performance of your LLM at different temperatures, you'll need to initialize a configuration file. Run the following command to create a `promptfooconfig.yaml` file: | ||
|
||
```bash | ||
npx promptfoo@latest init | ||
``` | ||
|
||
This command sets up a basic configuration file in your current directory, which you can then customize for your evaluation needs. For more information on getting started with promptfoo, refer to the [getting started guide](/docs/getting-started). | ||
|
||
## Evaluating | ||
|
||
Here's an example configuration that compares the outputs of GPT-3.5 at a low temperature (0.2) and a high temperature (0.9): | ||
|
||
```yaml title=promptfooconfig.yaml | ||
prompts: | ||
- 'Respond to the following instruction: {{message}}' | ||
|
||
providers: | ||
- openai:gpt-3.5-turbo-0613: | ||
id: openai-gpt-3.5-turbo-lowtemp | ||
config: | ||
temperature: 0.2 | ||
- openai:gpt-3.5-turbo-0613: | ||
id: openai-gpt-3.5-turbo-hightemp | ||
config: | ||
temperature: 0.9 | ||
|
||
tests: | ||
- vars: | ||
message: What's the capital of France? | ||
- vars: | ||
message: Write a poem about the sea. | ||
- vars: | ||
message: Generate a list of potential risks for a space mission. | ||
- vars: | ||
message: Did Henry VIII have any grandchildren? | ||
``` | ||
In the above configuration, we just use a boilerplate prompt because we're more interested in comparing the different models. | ||
We define two providers that call the same model (GPT 3.5) with different temperature settings. The `id` field helps us distinguish between the two when reviewing the results. | ||
|
||
The `tests` section includes our test cases that will be run against both temperature settings. | ||
|
||
To run the evaluation, use the following command: | ||
|
||
```bash | ||
npx promptfoo@latest eval | ||
``` | ||
|
||
This command shows the outputs side-by-side in the command line. | ||
|
||
## Adding automated checks | ||
|
||
To automatically check for expected outputs, you can define assertions in your test cases. Assertions allow you to specify the criteria that the LLM output should meet, and `promptfoo` will evaluate the output against these criteria. | ||
|
||
For the example of Henry VIII's grandchildren, you might want to ensure that the output is factually correct. You can use a `model-graded-closedqa` assertion to automatically check that the output does not contain any hallucinated information. | ||
|
||
Here's how you can add an assertion to the test case: | ||
|
||
```yaml | ||
tests: | ||
- description: 'Check for hallucination on Henry VIII grandchildren question' | ||
vars: | ||
message: Did Henry VIII have any grandchildren? | ||
// highlight-start | ||
assert: | ||
- type: llm-rubric | ||
value: Henry VIII didn't have any grandchildren | ||
// highlight-end | ||
``` | ||
|
||
This assertion will use a language model to determine whether the LLM output adheres to the criteria. | ||
|
||
In the above example comparing different GPT 3.5 temperatures, we notice that GPT actually _hallucinates_ an incorrect answer to the question about Henry VII's grandchildren. It gets it correct with low temperature, but incorrect with high temperature: | ||
|
||
 | ||
|
||
There are many other [assertion types](/docs/configuration/expected-outputs). For example, we can check that the answer to the "space mission risks" question includes all of the following terms: | ||
|
||
```yaml | ||
tests: | ||
vars: | ||
message: Generate a list of potential risks for a space mission. | ||
assert: | ||
- type: icontains-all | ||
value: ['radiation', 'isolation', 'environment'] | ||
``` | ||
|
||
In this case, a higher temperature leads to more creative results, but also leads to a mention of "as an AI language model": | ||
|
||
 | ||
|
||
It's worth spending a few minutes to set up these automated checks. They help streamline the evaluation process and quickly identify bad outputs. | ||
|
||
After the evaluation is complete, you can use the web viewer to review the outputs and compare the performance at different temperatures: | ||
|
||
```bash | ||
npx promptfoo@latest view | ||
``` | ||
|
||
## Evaluating randomness | ||
|
||
LLMs are inherently nondeterminstic, which means their outputs will vary with each call at nonzero temperatures (and sometimes even at zero temperature). OpenAI introduced the `seed` variable to improve reproducibility of outputs, and other providers will probably follow suit. | ||
|
||
Set a constant seed in the provider config: | ||
|
||
```yaml | ||
providers: | ||
- openai:gpt-3.5-turbo-0613: | ||
id: openai-gpt-3.5-turbo-lowtemp | ||
config: | ||
temperature: 0.2 | ||
// highlight-next-line | ||
seed: 0 | ||
- openai:gpt-3.5-turbo-0613: | ||
id: openai-gpt-3.5-turbo-hightemp | ||
config: | ||
temperature: 0.9 | ||
// highlight-next-line | ||
seed: 0 | ||
``` | ||
|
||
The `eval` command also has a parameter, `repeat`, which runs each test multiple times: | ||
|
||
```bash | ||
promptfoo eval --repeat 3 | ||
``` | ||
|
||
The above command runs the LLM three times for each test case, helping you get a more complete sample of how it performs at a given temperature. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.