clembench: A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents

The cLLM (chat-optimized Large Language Model, "clem") framework tests such models' ability to engage in games – rule-constituted activities played using language. The framework is a systematic way of probing for the situated language understanding of language using agents.

This repository contains the code for setting up the framework and implements a number of games that are further discussed in

Chalamalasetti, K., Götze, J., Hakimov, S., Madureira, B., Sadler, P., & Schlangen, D. (2023). clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents (arXiv:2305.13455). arXiv. https://doi.org/10.48550/arXiv.2305.13455

Evaluation Results

On the main project website , under leaderboard.

Game details

A Simple Word Game: taboo
A Word-Guessing Game Based on Clues: wordle
Drawing Instruction Giving and Following: image
An ASCII Picture Reference Game: reference
Scorekeeping: private and shared

Using the benchmark

We welcome you to contribute to or extend the benchmark with your own games and models. Please simply open a pull request. You can find more information on how to use the benchmark in the links below.

How to run the benchmark and evaluation locally
How to run the benchmark, update leaderboard workflow
How to add a new model as a backend
How to add and run your own game
How to integrate with Slurk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

clembench: A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents

Evaluation Results

Game details

Using the benchmark

Files

README.md

Latest commit

History

README.md

File metadata and controls

clembench: A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents

Evaluation Results

Game details

Using the benchmark