Skip to content

Commit

Permalink
Add Function Deduction eval (#1492)
Browse files Browse the repository at this point in the history
# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, **failure to follow
the guidelines below will result in the PR being closed automatically**.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access be granted. 🚨

**PLEASE READ THIS**:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject it since GPT-4 is already capable of completing
the task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**

Also, please note that we're using **Git LFS** for storing the JSON
files, so please make sure that you move the JSON file to Git LFS before
submitting a PR. Details on how to use Git LFS are available
[here](https://git-lfs.com).

## Eval details 📑

### Eval name

Function Deduction

### Eval description

We evaluate whether models can effectively employ the scientific method
to iterate upon hypotheses until determining one that is correct. In
particular, the model attempts to deduce a black-box mathematical
function that connects (input, output) it selects in order to gain
information. To score highly, the model must ultimately determine the
correct result for target inputs, balancing between information-gain and
attempting guesses.

### What makes this a useful eval?

AI R&D

## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [x] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [x] **Include at least 15 high-quality examples.**

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

> Insert what makes your eval high quality that was not mentioned above.
(Not required)

## Eval structure 🏗️

Your eval should

- [x] Check that your data is in `evals/registry/data/{name}`
- [x] Check that your YAML is registered at
`evals/registry/evals/{name}.yaml`
- [x] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (<https://platform.openai.com/docs/usage-policies>).

- [x] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the commits on the merged pull request.

- [x] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgment

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and the high volume of submissions, we will
not be able to accept all submissions and thus not grant everyone who
opens a PR GPT-4 access. We know this is disappointing, but we hope to
set the right expectation before you open this PR.

- [x] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access be
granted.

### Submit eval

- [x] I have filled out all required fields of this form
- [x] I have used **Git LFS** for the Eval JSON data
- [x] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `mypy`, `black`,
`isort`, `autoflake` and `ruff` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```
# Examples of functions to guess
math.floor(x + math.sqrt(x))
math.floor(math.sqrt(x))
math.floor(math.sqrt(x)) - 1
math.floor(math.sqrt(x)) * 2
math.floor(math.sqrt(x) * 2)
math.floor(round(x ** (1/3), 8))
x / 2 if not x % 2 else x * 3
x / 2 if not x % 2 else x * 3 + 1
x ** 2 if x % 2 else x ** 3
x / 3 if not x % 3 else x
x / 3 if not x % 3 else x * 2
(x + 1) / 3 if x % 3 == 2 else x
  ```
</details>

Co-authored-by: johny-b <[email protected]>
  • Loading branch information
james-aung and johny-b authored Mar 19, 2024
1 parent c207dba commit dfeaac4
Show file tree
Hide file tree
Showing 13 changed files with 1,609 additions and 0 deletions.
91 changes: 91 additions & 0 deletions evals/elsuite/function_deduction/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Function Deduction

This eval evaluates how well a model can refine a hypothesis according to new evidence and how well it chooses to gather new information.

In Function Deduction:

- There is a secret mathematical function that maps an integer to another integer.
- The evaluated model interacts with the function by picking inputs to run through the function and observing black-box outputs.
- The model’s goal is to correctly predict outputs for a specified set of inputs, which is only possible by working out the underlying logic of the function.

![fd](https://github.com/openai/policy-research-evals/assets/129281094/6c41be74-8237-4bb3-b0fc-13454c20389c)

## Usage

Run with:

```
oaieval <solver> function_deduction
```

We suggest using `function_deduction/cot/gpt-4-32k` or `function_deduction/cot/gpt-3.5-turbo-16k` as default choices for `<solver>`

See `evals/registry/solvers/function_deduction.yaml` for a full list of recommended Solvers.

For more examples of running this eval, see `scripts/run_experiments.sh`

## Dataset

The dataset are 100 int -> int functions of varying difficulty. Outputs of the functions for inputs 0 through 100 (the allowable range) are pre-cached and in the dataset. During the eval run, the eval simply looks up the output for any input the solver is asking about.

## Evaluation Process

Samples from the dataset are evaluated one-by-one.

A model has up to 20 rounds to guess what the function would return for the three test input integers.

In each round the model can either:

- Ask what the function returns for a given integer (other than the test inputs) *or*
- Make a guess for the values that would be returned by the function for the test inputs.

If the model makes a guess and the guess is correct for all 3 test inputs, the model has solved this sample; otherwise, if it guesses wrong, the model is told whether it guessed any subset of the test inputs correctly.

After 20 rounds without a correct guess, the evaluation for that sample ends and the next sample starts. Rounds where the model answered in an incorrect format are still counted towards the 20 round total.

## Task State

| Key | Value |
| --- | --- |
| `task_description` | I have an important task for you. Instructions below.<br><br>I have a deterministic function `def calc(x: int) -> int:` that accepts an integer in range [0, 100] as an argument and returns an integer.<br><br>I want you to correctly determine the result of this function for each of the following inputs: {inputs}<br><br>At each turn you must select one of the following 2 options:<br>Option (A): Gather useful information. Ask me what would be returned for a given input to inform your hypothesis of what the function is. \<br>This must be an integer in range [0, 100], other than {inputs}.<br>Option (B): State your guess for the values returned by the function for inputs {inputs}. Your guess must include all three numbers, in this order. \<br>If your guess is correct, you win. If not, the task will continue.<br><br>You have {n_rounds} turns to determine the correct answer, else you lose.<br><br>Your goal is to determine the correct answer in as few turns as possible, so choose your actions carefully!<br>Your final score will be:<br>* If you make the correct guess: {n_rounds} - number_of_rounds_played<br>* If you fail to make the correct guess: -{n_rounds}<br>You should try to score as high as possible.<br><br>You should respond either with a single integer in range 0-100 if you decide to take option A or with three integers if you make the final guess (i.e. use option B).<br>Don't say anything more except for either an integer or 3 integers. |
| `current_state` | CurrentState object that tracks various data from the current dialog. |

## Metrics

The below are the key metrics of this eval:

| Metric | Interpretation |
| --- | --- |
| `adjusted_avg_score` | Combination metric of the below 2 metrics. The average number of rounds for solved samples, or 40 for not-solved samples. (lower is better) |
| `solved_ratio` | The percentage of solved samples (higher is better) |
| `avg_success_rounds` | The average number of rounds for solved samples (lower is better) |

## Variants

| Variant | Notes |
| --- | --- |
| Default: `function_deduction.easy` | Default setting as described above. 1 trial per sample |
| `function_deduction.easy.long` | 10 trials per sample |
| `function_deduction.easy.dev5` | Dev set with only 5 samples |
| `function_deduction.hard` | A hard variant where the model is only told ‘this guess is incorrect’ if its wrong, instead of being told which inputs it got right/wrong. |
| `function_deduction.hard.dev5` | Dev set with only 5 samples |

## Token Usage Estimates

Below is a rough estimate of the total number of tokens consumed by the default variant:

| Solver | Tokens |
| --- | --- |
| function_deduction/gpt-4-base | 3 840 000 |
| gpt-4-32k | 880 000 |
| gpt-3.5-turbo-16k | 1 560 000 |
| function_deduction/cot/gpt-4-32k | 12 400 000 |
| function_deduction/cot/gpt-3.5-turbo-16k | 13 230 000 |

## Version History

- v0: Initial version released

## Contribution statement

Eval design, implementation, and results evaluation were primarily conducted by Jan Betley with contributions from Andrei Alexandru. Report by James Aung. Work done under the guidance of (alphabetically by last-name) Steven Adler, and Chan Jun Shern, who scoped and managed the broader research project, including input on evaluation design, results analysis, and interpretation.
133 changes: 133 additions & 0 deletions evals/elsuite/function_deduction/baselines.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
import logging
import math
from collections import Counter
from pathlib import Path

import numpy as np
from scipy.stats import entropy

from evals.data import get_jsonl
from evals.elsuite.function_deduction.eval import CurrentState, Sample
from evals.registry import Registry
from evals.solvers.solver import Solver, SolverResult
from evals.task_state import TaskState


class AverageBaseline(Solver):
"""
For given test inputs (x, y, z):
* Ask about values of (x-1, x+1, y-1, y+1, z-1, z+1)
* Make three guesses: round/floor/ceil of average values for neighboring numbers
If didn't succeed in 9 rounds (6x ask 3x guess) - surrender.
Note: This algorithm fails on the edge cases where, for any of the inputs i:
- i-1 or i+1 is out of range
- i-1 or i+1 are part of the test inputs
In this scenario, the algorithm will fail at the _get_guess stage and skip the guessing.
"""

def __init__(self, registry=None):
pass

def _solve(self, task_state: TaskState):
cs: CurrentState = task_state.current_state

assert len(cs.test_inputs) == 3, "AverageBaseline assumes 3 test inputs"

if cs.round_ix < 6:
response = self._get_ask(cs.test_inputs, cs.round_ix)
elif 6 <= cs.round_ix < 9:
response = self._get_guess(cs.test_inputs, cs.known_values, cs.round_ix - 6)
else:
response = "I've run out of ideas sorry :("
return SolverResult(response)

def _get_guess(self, test_inputs, known_values: dict[int, int], guess_round_ix) -> str:
known_values = {
x: y for x, y in known_values.items() if x - 1 in test_inputs or x + 1 in test_inputs
}

pairs = [[], [], []]
for i, test_input in enumerate(test_inputs):
try:
lower = known_values[test_input - 1]
higher = known_values[test_input + 1]
except KeyError:
return "Unfortunately I don't have enough data to make a guess, will pass."
pairs[i] = [lower, higher]

funcs = [round, math.floor, math.ceil]
func = funcs[guess_round_ix]
vals = [func((pair[0] + pair[1]) / 2) for pair in pairs]
return " ".join([str(x) for x in vals])

def _get_ask(self, test_inputs, round_ix) -> str:
queries = []
for x in test_inputs:
queries.append(x - 1)
queries.append(x + 1)

ask = queries[round_ix]
if ask in test_inputs or ask < 0 or ask > 100:
logging.warning(
f"Invalid query on inputs {test_inputs}: {ask}. AverageBaseline algorithm will fail."
)
return str(ask)


class FullKnowledge(Solver):
"""Assuming solver knows all the samples, how well would it perform?
Two modes - "random", where it selects random integer when asking,
and "best" where it selects the best integer.
The "best" mode should be close to unbeatable (except for lucky guesses).
"""

def __init__(self, mode: str, samples_jsonl: str, registry: Registry):
assert mode in ("random", "best"), "mode must be either random or best"
self.mode = mode
self._all_samples = self._get_samples(samples_jsonl, registry._registry_paths[0])
self._rng = np.random.default_rng()

def _solve(self, task_state: TaskState):
cs: CurrentState = task_state.current_state

matching_samples = self._get_matching_samples(cs.known_values)
if len(matching_samples) > 1:
if self.mode == "random":
response = self._get_ask_random(cs.known_values)
else:
response = self._get_ask_best(matching_samples)
else:
sample_values = matching_samples[0].values
result = [sample_values[test_input] for test_input in cs.test_inputs]
response = " ".join([str(x) for x in result])
return SolverResult(str(response))

def _get_matching_samples(self, known_values):
def matches(sample: Sample) -> bool:
for key, val in known_values.items():
if sample.values[key] != val:
return False
return True

return [sample for sample in self._all_samples if matches(sample)]

def _get_ask_best(self, samples):
def get_entropy(x: int) -> float:
values = [sample.values[x] for sample in samples]
counter = Counter(values)
return entropy([val for val in counter.values()])

return max(range(0, 101), key=get_entropy)

def _get_ask_random(self, known_values):
while True:
x = self._rng.integers(0, 100)
if x not in known_values:
return x

def _get_samples(self, samples_jsonl: str, registry_path: Path):
path = registry_path / "data" / samples_jsonl
return [Sample(**x) for x in get_jsonl(path.as_posix())]
Loading

0 comments on commit dfeaac4

Please sign in to comment.