EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents

EmbodiedEval is a comprehensive and interactive benchmark designed to evaluate the capabilities of MLLMs in embodied tasks.

Installation

Setup Simulation Environment

EmbodiedEval includes a 3D simulator for realtime simulation. You have two options to run the simulator:

Option 1: Run the simulator on your personal computer with a display (Windows/MacOS/Linux). No additional configuration is required. The subsequent installation and data download (approximately 20GB of space) will take place on your computer.

Option 2: Run the simulator on a Linux server, which requires sudo access, up-to-date NVIDIA drivers, and running outside a Docker container. Additional configurations are required as follows:

Additional configurations

Install Xorg:

sudo apt install -y gcc make pkg-config xorg

Generate .conf file:

sudo nvidia-xconfig --no-xinerama --probe-all-gpus  --use-display-device=none
sudo cp /etc/X11/xorg.conf /etc/X11/xorg-0.conf

Edit /etc/X11/xorg-0.conf:
- Remove "ServerLayout" and "Screen" section.
- Set BoardName and BusID of "Device" section to the corresponding Name and PCI BusID of a GPU displayed by the nvidia-xconfig --query-gpu-info command. For example:
```
Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BusID          "PCI:164:0:0"
    BoardName      "NVIDIA GeForce RTX 3090"
EndSection
```

Run Xorg:

sudo nohup Xorg :0 -config /etc/X11/xorg-0.conf &

Set the display (Remember to run the following command in every new terminal session before running the evaluation code):
```
export DISPLAY=:0
```

Install Dependencies

conda create -n embodiedeval python=3.10
conda activate embodiedeval
pip install -r requirements.txt

Download Dataset

python download.py

Evaluation

Run Baselines

Random baseline

python run_eval.py --agent random

Human baseline

python run_eval.py --agent human

In human baseline, you can manually interact with the environment.

How to play

Use the keyboard to press the corresponding number to choose an option;
Pressing W/A/D will map to the forward/turn left/turn right options in the menu;
Pressing Enter opens or closes the chat window, and you can enter option numbers greater than 9;
Pressing T will hide/show the options panel.

GPT-4o

Edit the api_key and base_url in agent.py and run:

python run_eval.py --agent gpt-4o

Evaluate Your Own Model

To evaluate your own model, you need to overwrite the MyAgent class in agent.py. In the __init__ method, you need to load the model or initialize the API. In the generate method, you need to perform model inference or API calls and return the generated text. See the comments within the class for details.

Run the following code to evaluate your model.

python run_eval.py --agent myagent

If your server cannot run the simulator (e.g. without sudo access), and your personal computer cannot run the model. You can run simulation on your computer and the model on the server using the following steps:

Evaluation steps with a remote simulator

Perform the Install Dependencies and Download Dataset steps on both your local computer and the server.

On the server, run:

python run_eval.py --agent myagent  --remote --scene_folder <The     absolute path of the scene folder   on your local computer>

This command will hang, waiting for the simulator to connect.

On your computer, set up a SSH tunnel between your computer and the server:
```
ssh -N -L 50051:localhost:50051     <username>@<host> [-p <ssh_port>]
```
On your computer, launch the simulator:
```
python launch.py
```
Once the simulator starts, the evaluation process on the server will begin.

Compute Metrics

Run metrics.py with the result folder as a parameter to compute the performance. The total_metrics.json (overall performance) and type_metrics.json (performance per task type) will be saved in the result folder.

python metrics.py --result_folder results/xxx-xxx-xxx

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
agent.py		agent.py
download.py		download.py
launch.py		launch.py
metrics.py		metrics.py
predicate.py		predicate.py
prompts.py		prompts.py
run_eval.py		run_eval.py
task_setup.py		task_setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents

Installation

Setup Simulation Environment

Install Dependencies

Download Dataset

Evaluation

Run Baselines

Random baseline

Human baseline

GPT-4o

Evaluate Your Own Model

Compute Metrics

Citation

About

Languages

License

thunlp/EmbodiedEval

Folders and files

Latest commit

History

Repository files navigation

EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents

Installation

Setup Simulation Environment

Install Dependencies

Download Dataset

Evaluation

Run Baselines

Random baseline

Human baseline

GPT-4o

Evaluate Your Own Model

Compute Metrics

Citation

About

Resources

License

Stars

Watchers

Forks

Languages