Skip to content

Commit

Permalink
Updates
Browse files Browse the repository at this point in the history
  • Loading branch information
github-actions-bot committed Feb 11, 2025
1 parent bd1a64f commit 739aa30
Showing 1 changed file with 11 additions and 11 deletions.
22 changes: 11 additions & 11 deletions blogs/ai_space_escape.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,17 +105,17 @@ Our evaluation results show the following findings:

- GameArena’s ranking aligns with other reasoning benchmarks considering both reasoning and language capabilities (LiveBench-Reasoning, LiveBench-Language, GPQA).

| Rank | Model Name | Akinator Score | Bluffing Score | Taboo Score | Average Score |
|-----:|-------------------------------------|---------------:|---------------:|------------:|--------------:|
| #1 | claude-3-5-sonnet...-20240620 | 1004.71 | 1025.65 | 990.47 | 1006.94 |
| #2 | gemini-2.0-flash-thinking-exp-01-21 | N/A | 1002.38 | 1007.45 | 1004.91 |
| #3 | gemini-1.5-pro | 1016.54 | 1026.82 | 955.56 | 999.64 |
| #4 | deepseek-r1 | 1001.00 | 999.03 | 998.52 | 999.51 |
| #5 | o1-mini | 999.08 | 994.05 | 990.21 | 994.45 |
| #6 | qwen-max | 1015.84 | 970.93 | 992.81 | 993.19 |
| #7 | gpt-4o-2024-11-20 | 1003.60 | 996.06 | 973.09 | 990.92 |
| #8 | llama-3-405b | 999.59 | 984.04 | 980.55 | 988.06 |
| #9 | grok-2-beta | 994.91 | 977.67 | 973.24 | 981.94 d|
| Model Name | Akinator Score | Bluffing Score | Taboo Score | Average Score |
|:-------------------------------------|---------------:|---------------:|------------:|--------------:|
| claude-3-5-sonnet-20240620 | 1004.71 | 1025.65 | 990.47 | 1006.94 |
| gemini-2.0-flash-thinking-exp-01-21 | N/A | 1002.38 | 1007.45 | 1004.91 |
| gemini-1.5-pro | 1016.54 | 1026.82 | 955.56 | 999.64 |
| deepseek-r1 | 1001.00 | 999.03 | 998.52 | 999.51 |
| o1-mini | 999.08 | 994.05 | 990.21 | 994.45 |
| qwen-max | 1015.84 | 970.93 | 992.81 | 993.19 |
| gpt-4o-2024-11-20 | 1003.60 | 996.06 | 973.09 | 990.92 |
| llama-3-405b | 999.59 | 984.04 | 980.55 | 988.06 |
| grok-2-beta | 994.91 | 977.67 | 973.24 | 981.94 |


![hidetable](placeholder.jpg "Table 2: Game Arena model ranking (sorted by average score across three games) as of Feburary 2025. For more fine-grained analysis with detailed metrics, please check our paper.")
Expand Down

0 comments on commit 739aa30

Please sign in to comment.