This folder contains the evaluation harness that we built on top of the original Commit0 (paper).
The evaluation consists of three steps:
- Environment setup: install python environment, configure LLM config.
- Run Evaluation: Generate a edit patch for each Commit0 Repo, and get the evaluation results
Please follow instruction here to setup your local development environment and LLM.
OpenHands supports using the Commit0 Docker for **inference. This is now the default behavior.
Make sure your Docker daemon is running, and you have ample disk space (at least 200-500GB, depends on the Commit0 set you are running on) for the instance-level docker image.
When the run_infer.sh
script is started, it will automatically pull the lite
split in Commit0. For example, for instance ID commit-0/minitorch
, it will try to pull our pre-build docker image wentingzhao/minitorch
from DockerHub. This image will be used create an OpenHands runtime image where the agent will operate on.
./evaluation/benchmarks/commit0_bench/scripts/run_infer.sh [repo_split] [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]
# Example
./evaluation/benchmarks/commit0_bench/scripts/run_infer.sh lite llm.eval_sonnet HEAD CodeActAgent 16 100 8 wentingzhao/commit0_combined test
where model_config
is mandatory, and the rest are optional.
repo_split
, e.g.lite
, is the split of the Commit0 dataset you would like to evaluate on. Available options arelite
,all
and each individual repo.model_config
, e.g.eval_gpt4_1106_preview
, is the config group name for your LLM settings, as defined in yourconfig.toml
.git-version
, e.g.HEAD
, is the git commit hash of the OpenHands version you would like to evaluate. It could also be a release tag like0.6.2
.agent
, e.g.CodeActAgent
, is the name of the agent for benchmarks, defaulting toCodeActAgent
.eval_limit
, e.g.10
, limits the evaluation to the firsteval_limit
instances. By default, the script evaluates thelite
split of the Commit0 dataset (16 repos). Note: in order to useeval_limit
, you must also setagent
.max_iter
, e.g.20
, is the maximum number of iterations for the agent to run. By default, it is set to 30.num_workers
, e.g.3
, is the number of parallel workers to run the evaluation. By default, it is set to 1.dataset
, a huggingface dataset name. e.g.wentingzhao/commit0_combined
, specifies which dataset to evaluate on.dataset_split
, split for the huggingface dataset. Notice onlytest
is supported for Commit0.
Note that the USE_INSTANCE_IMAGE
environment variable is always set to true
for Commit0.
Let's say you'd like to run 10 instances using llm.eval_sonnet
and CodeActAgent,
then your command would be:
./evaluation/benchmarks/commit0_bench/scripts/run_infer.sh lite llm.eval_sonnet HEAD CodeActAgent 10 30 1 wentingzhao/commit0_combined test
This is in limited beta. Contact Xingyao over slack if you want to try this out!
./evaluation/benchmarks/commit0_bench/scripts/run_infer.sh [repo_split] [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]
# Example - This runs evaluation on CodeActAgent for 10 instances on "wentingzhao/commit0_combined"'s test set, with max 30 iteration per instances, with 1 number of workers running in parallel
ALLHANDS_API_KEY="YOUR-API-KEY" RUNTIME=remote SANDBOX_REMOTE_RUNTIME_API_URL="https://runtime.eval.all-hands.dev" EVAL_DOCKER_IMAGE_PREFIX="docker.io/wentingzhao" \
./evaluation/benchmarks/commit0_bench/scripts/run_infer.sh lite llm.eval_sonnet HEAD CodeActAgent 10 30 1 wentingzhao/commit0_combined test
To clean-up all existing runtime you've already started, run:
ALLHANDS_API_KEY="YOUR-API-KEY" ./evaluation/utils/scripts/cleanup_remote_runtime.sh
If you would like to specify a list of tasks you'd like to benchmark on, you just need to pass selected repo through repo_split
option.