Skip to content

Commit

Permalink
Manual tweak of results layout to be more github-md friendly
Browse files Browse the repository at this point in the history
  • Loading branch information
awwaiid committed Dec 22, 2024
1 parent 7d67438 commit 87805ca
Showing 1 changed file with 16 additions and 76 deletions.
92 changes: 16 additions & 76 deletions evaluation_results/2024-12-21_13-57-31/results.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,7 @@ There are 4 scenarios and 4 test cases with 3 attempts (48 total tests).
## Test: blank_math

### claude_sonnet_latest_with_seg
<img src='../../evaluation_results/2024-12-21_13-57-31/blank_math/claude_sonnet_latest_with_seg/1/merged-output.png' border=1 width=200 />

<img src='../../evaluations/blank_math/input.png' border=1 width=200 />
<img src='../../evaluation_results/2024-12-21_13-57-31/blank_math/claude_sonnet_latest_with_seg/1/merged-output.png' border=1 width=200 /> <img src='../../evaluations/blank_math/input.png' border=1 width=200 />

```
10
Expand All @@ -15,36 +13,22 @@ There are 4 scenarios and 4 test cases with 3 attempts (48 total tests).
<img src='../../evaluation_results/2024-12-21_13-57-31/blank_math/claude_sonnet_latest_with_seg/3/merged-output.png' border=1 width=200 />

### gpt-4o-mini_no_seg
<img src='../../evaluation_results/2024-12-21_13-57-31/blank_math/gpt-4o-mini_no_seg/1/merged-output.png' border=1 width=200 />

<img src='../../evaluation_results/2024-12-21_13-57-31/blank_math/gpt-4o-mini_no_seg/2/merged-output.png' border=1 width=200 />

<img src='../../evaluation_results/2024-12-21_13-57-31/blank_math/gpt-4o-mini_no_seg/3/merged-output.png' border=1 width=200 />
<img src='../../evaluation_results/2024-12-21_13-57-31/blank_math/gpt-4o-mini_no_seg/1/merged-output.png' border=1 width=200 /> <img src='../../evaluation_results/2024-12-21_13-57-31/blank_math/gpt-4o-mini_no_seg/2/merged-output.png' border=1 width=200 /> <img src='../../evaluation_results/2024-12-21_13-57-31/blank_math/gpt-4o-mini_no_seg/3/merged-output.png' border=1 width=200 />

### gpt-4o_with_seg
<img src='../../evaluation_results/2024-12-21_13-57-31/blank_math/gpt-4o_with_seg/1/merged-output.png' border=1 width=200 />

<img src='../../evaluation_results/2024-12-21_13-57-31/blank_math/gpt-4o_with_seg/2/merged-output.png' border=1 width=200 />

<img src='../../evaluations/blank_math/input.png' border=1 width=200 />
<img src='../../evaluation_results/2024-12-21_13-57-31/blank_math/gpt-4o_with_seg/1/merged-output.png' border=1 width=200 /> <img src='../../evaluation_results/2024-12-21_13-57-31/blank_math/gpt-4o_with_seg/2/merged-output.png' border=1 width=200 /> <img src='../../evaluations/blank_math/input.png' border=1 width=200 />

```
10
```

### claude_sonnet_latest_no_seg
<img src='../../evaluation_results/2024-12-21_13-57-31/blank_math/claude_sonnet_latest_no_seg/1/merged-output.png' border=1 width=200 />

<img src='../../evaluation_results/2024-12-21_13-57-31/blank_math/claude_sonnet_latest_no_seg/2/merged-output.png' border=1 width=200 />

<img src='../../evaluation_results/2024-12-21_13-57-31/blank_math/claude_sonnet_latest_no_seg/3/merged-output.png' border=1 width=200 />
<img src='../../evaluation_results/2024-12-21_13-57-31/blank_math/claude_sonnet_latest_no_seg/1/merged-output.png' border=1 width=200 /> <img src='../../evaluation_results/2024-12-21_13-57-31/blank_math/claude_sonnet_latest_no_seg/2/merged-output.png' border=1 width=200 /> <img src='../../evaluation_results/2024-12-21_13-57-31/blank_math/claude_sonnet_latest_no_seg/3/merged-output.png' border=1 width=200 />

## Test: tic_tac_toe_1

### claude_sonnet_latest_with_seg
<img src='../../evaluation_results/2024-12-21_13-57-31/tic_tac_toe_1/claude_sonnet_latest_with_seg/1/merged-output.png' border=1 width=200 />

<img src='../../evaluations/tic_tac_toe_1/input.png' border=1 width=200 />
<img src='../../evaluation_results/2024-12-21_13-57-31/tic_tac_toe_1/claude_sonnet_latest_with_seg/1/merged-output.png' border=1 width=200 /> <img src='../../evaluations/tic_tac_toe_1/input.png' border=1 width=200 />

```
Your turn! Place an O anywhere you'd like.
Expand All @@ -53,83 +37,39 @@ Your turn! Place an O anywhere you'd like.
<img src='../../evaluation_results/2024-12-21_13-57-31/tic_tac_toe_1/claude_sonnet_latest_with_seg/3/merged-output.png' border=1 width=200 />

### gpt-4o-mini_no_seg
<img src='../../evaluation_results/2024-12-21_13-57-31/tic_tac_toe_1/gpt-4o-mini_no_seg/1/merged-output.png' border=1 width=200 />

<img src='../../evaluation_results/2024-12-21_13-57-31/tic_tac_toe_1/gpt-4o-mini_no_seg/2/merged-output.png' border=1 width=200 />

<img src='../../evaluation_results/2024-12-21_13-57-31/tic_tac_toe_1/gpt-4o-mini_no_seg/3/merged-output.png' border=1 width=200 />
<img src='../../evaluation_results/2024-12-21_13-57-31/tic_tac_toe_1/gpt-4o-mini_no_seg/1/merged-output.png' border=1 width=200 /> <img src='../../evaluation_results/2024-12-21_13-57-31/tic_tac_toe_1/gpt-4o-mini_no_seg/2/merged-output.png' border=1 width=200 /> <img src='../../evaluation_results/2024-12-21_13-57-31/tic_tac_toe_1/gpt-4o-mini_no_seg/3/merged-output.png' border=1 width=200 />

### gpt-4o_with_seg
<img src='../../evaluation_results/2024-12-21_13-57-31/tic_tac_toe_1/gpt-4o_with_seg/1/merged-output.png' border=1 width=200 />

<img src='../../evaluation_results/2024-12-21_13-57-31/tic_tac_toe_1/gpt-4o_with_seg/2/merged-output.png' border=1 width=200 />

<img src='../../evaluation_results/2024-12-21_13-57-31/tic_tac_toe_1/gpt-4o_with_seg/3/merged-output.png' border=1 width=200 />
<img src='../../evaluation_results/2024-12-21_13-57-31/tic_tac_toe_1/gpt-4o_with_seg/1/merged-output.png' border=1 width=200 /> <img src='../../evaluation_results/2024-12-21_13-57-31/tic_tac_toe_1/gpt-4o_with_seg/2/merged-output.png' border=1 width=200 /> <img src='../../evaluation_results/2024-12-21_13-57-31/tic_tac_toe_1/gpt-4o_with_seg/3/merged-output.png' border=1 width=200 />

### claude_sonnet_latest_no_seg
<img src='../../evaluation_results/2024-12-21_13-57-31/tic_tac_toe_1/claude_sonnet_latest_no_seg/1/merged-output.png' border=1 width=200 />

<img src='../../evaluation_results/2024-12-21_13-57-31/tic_tac_toe_1/claude_sonnet_latest_no_seg/2/merged-output.png' border=1 width=200 />

<img src='../../evaluation_results/2024-12-21_13-57-31/tic_tac_toe_1/claude_sonnet_latest_no_seg/3/merged-output.png' border=1 width=200 />
<img src='../../evaluation_results/2024-12-21_13-57-31/tic_tac_toe_1/claude_sonnet_latest_no_seg/1/merged-output.png' border=1 width=200 /> <img src='../../evaluation_results/2024-12-21_13-57-31/tic_tac_toe_1/claude_sonnet_latest_no_seg/2/merged-output.png' border=1 width=200 /> <img src='../../evaluation_results/2024-12-21_13-57-31/tic_tac_toe_1/claude_sonnet_latest_no_seg/3/merged-output.png' border=1 width=200 />

## Test: x_in_box

### claude_sonnet_latest_with_seg
<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_box/claude_sonnet_latest_with_seg/1/merged-output.png' border=1 width=200 />

<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_box/claude_sonnet_latest_with_seg/2/merged-output.png' border=1 width=200 />

<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_box/claude_sonnet_latest_with_seg/3/merged-output.png' border=1 width=200 />
<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_box/claude_sonnet_latest_with_seg/1/merged-output.png' border=1 width=200 /> <img src='../../evaluation_results/2024-12-21_13-57-31/x_in_box/claude_sonnet_latest_with_seg/2/merged-output.png' border=1 width=200 /> <img src='../../evaluation_results/2024-12-21_13-57-31/x_in_box/claude_sonnet_latest_with_seg/3/merged-output.png' border=1 width=200 />

### gpt-4o-mini_no_seg
<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_box/gpt-4o-mini_no_seg/1/merged-output.png' border=1 width=200 />

<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_box/gpt-4o-mini_no_seg/2/merged-output.png' border=1 width=200 />

<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_box/gpt-4o-mini_no_seg/3/merged-output.png' border=1 width=200 />
<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_box/gpt-4o-mini_no_seg/1/merged-output.png' border=1 width=200 /> <img src='../../evaluation_results/2024-12-21_13-57-31/x_in_box/gpt-4o-mini_no_seg/2/merged-output.png' border=1 width=200 /> <img src='../../evaluation_results/2024-12-21_13-57-31/x_in_box/gpt-4o-mini_no_seg/3/merged-output.png' border=1 width=200 />

### gpt-4o_with_seg
<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_box/gpt-4o_with_seg/1/merged-output.png' border=1 width=200 />

<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_box/gpt-4o_with_seg/2/merged-output.png' border=1 width=200 />

<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_box/gpt-4o_with_seg/3/merged-output.png' border=1 width=200 />
<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_box/gpt-4o_with_seg/1/merged-output.png' border=1 width=200 /> <img src='../../evaluation_results/2024-12-21_13-57-31/x_in_box/gpt-4o_with_seg/2/merged-output.png' border=1 width=200 /> <img src='../../evaluation_results/2024-12-21_13-57-31/x_in_box/gpt-4o_with_seg/3/merged-output.png' border=1 width=200 />

### claude_sonnet_latest_no_seg
<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_box/claude_sonnet_latest_no_seg/1/merged-output.png' border=1 width=200 />

<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_box/claude_sonnet_latest_no_seg/2/merged-output.png' border=1 width=200 />

<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_box/claude_sonnet_latest_no_seg/3/merged-output.png' border=1 width=200 />
<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_box/claude_sonnet_latest_no_seg/1/merged-output.png' border=1 width=200 /> <img src='../../evaluation_results/2024-12-21_13-57-31/x_in_box/claude_sonnet_latest_no_seg/2/merged-output.png' border=1 width=200 /> <img src='../../evaluation_results/2024-12-21_13-57-31/x_in_box/claude_sonnet_latest_no_seg/3/merged-output.png' border=1 width=200 />

## Test: x_in_boxes

### claude_sonnet_latest_with_seg
<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_boxes/claude_sonnet_latest_with_seg/1/merged-output.png' border=1 width=200 />

<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_boxes/claude_sonnet_latest_with_seg/2/merged-output.png' border=1 width=200 />

<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_boxes/claude_sonnet_latest_with_seg/3/merged-output.png' border=1 width=200 />
<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_boxes/claude_sonnet_latest_with_seg/1/merged-output.png' border=1 width=200 /> <img src='../../evaluation_results/2024-12-21_13-57-31/x_in_boxes/claude_sonnet_latest_with_seg/2/merged-output.png' border=1 width=200 /> <img src='../../evaluation_results/2024-12-21_13-57-31/x_in_boxes/claude_sonnet_latest_with_seg/3/merged-output.png' border=1 width=200 />

### gpt-4o-mini_no_seg
<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_boxes/gpt-4o-mini_no_seg/1/merged-output.png' border=1 width=200 />

<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_boxes/gpt-4o-mini_no_seg/2/merged-output.png' border=1 width=200 />

<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_boxes/gpt-4o-mini_no_seg/3/merged-output.png' border=1 width=200 />
<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_boxes/gpt-4o-mini_no_seg/1/merged-output.png' border=1 width=200 /> <img src='../../evaluation_results/2024-12-21_13-57-31/x_in_boxes/gpt-4o-mini_no_seg/2/merged-output.png' border=1 width=200 /> <img src='../../evaluation_results/2024-12-21_13-57-31/x_in_boxes/gpt-4o-mini_no_seg/3/merged-output.png' border=1 width=200 />

### gpt-4o_with_seg
<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_boxes/gpt-4o_with_seg/1/merged-output.png' border=1 width=200 />

<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_boxes/gpt-4o_with_seg/2/merged-output.png' border=1 width=200 />

<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_boxes/gpt-4o_with_seg/3/merged-output.png' border=1 width=200 />
<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_boxes/gpt-4o_with_seg/1/merged-output.png' border=1 width=200 /> <img src='../../evaluation_results/2024-12-21_13-57-31/x_in_boxes/gpt-4o_with_seg/2/merged-output.png' border=1 width=200 /> <img src='../../evaluation_results/2024-12-21_13-57-31/x_in_boxes/gpt-4o_with_seg/3/merged-output.png' border=1 width=200 />

### claude_sonnet_latest_no_seg
<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_boxes/claude_sonnet_latest_no_seg/1/merged-output.png' border=1 width=200 />

<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_boxes/claude_sonnet_latest_no_seg/2/merged-output.png' border=1 width=200 />

<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_boxes/claude_sonnet_latest_no_seg/3/merged-output.png' border=1 width=200 />
<img src='../../evaluation_results/2024-12-21_13-57-31/x_in_boxes/claude_sonnet_latest_no_seg/1/merged-output.png' border=1 width=200 /> <img src='../../evaluation_results/2024-12-21_13-57-31/x_in_boxes/claude_sonnet_latest_no_seg/2/merged-output.png' border=1 width=200 /> <img src='../../evaluation_results/2024-12-21_13-57-31/x_in_boxes/claude_sonnet_latest_no_seg/3/merged-output.png' border=1 width=200 />

0 comments on commit 87805ca

Please sign in to comment.