[Evaluation] Use the latest official SWE-Bench Dockerization for eval…

…uation (#2728) * add newline after patch to fix patch apply * new swebench wip * add newline after patch to fix patch apply * only add newline if not empty * update swebench source and update * update gitignore for swebench eval * update old prep_eval * update gitignore * add scripts for push and pull swebench images * update eval_infer.sh * update eval_infer for new docker workflow * update script to create markdown report based on report.json * update eval infer to use update output * update readme * only move result to folder if running whole file * remove set-x * update conversion script * Update evaluation/swe_bench/README.md * Update evaluation/swe_bench/README.md * Update evaluation/swe_bench/README.md * make sure last line end with newline * switch to an fix attempt branch of swebench * Update evaluation/swe_bench/README.md * Update evaluation/swe_bench/README.md --------- Co-authored-by: Engel Nyst <[email protected]>
All-Hands-AI · Jul 1, 2024 · 6a0ffc5 · 6a0ffc5
1 parent 6246cb8
commit 6a0ffc5
Show file tree

Hide file tree

Showing 11 changed files with 809 additions and 313 deletions.
diff --git a/.gitignore b/.gitignore
@@ -212,4 +212,8 @@ cache
 config.toml
 config.toml.bak
 
-containers/agnostic_sandbox
+containers/agnostic_sandbox
+
+# swe-bench-eval
+image_build_logs
+run_instance_logs
diff --git a/evaluation/swe_bench/README.md b/evaluation/swe_bench/README.md
@@ -2,6 +2,8 @@
 
 This folder contains the evaluation harness that we built on top of the original [SWE-Bench benchmark](https://www.swebench.com/) ([paper](https://arxiv.org/abs/2310.06770)). We created [a fork of SWE-Bench](https://github.com/OpenDevin/OD-SWE-bench.git) mostly built on top of [the original repo](https://github.com/princeton-nlp/SWE-bench) and [containerized](#opendevin-swe-bench-docker-image) it for easy evaluation.
 
+**UPDATE (7/1/2024): We now support the official SWE-Bench dockerized evaluation as announced [here](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md).**
+
 ## Setup Environment
 
 Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to set up a local development environment for OpenDevin.
@@ -10,7 +12,7 @@ Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/D
 
 In [OpenDevin-SWE-Bench fork](https://github.com/OpenDevin/OD-SWE-bench.git) (mostly from [original repo](https://github.com/princeton-nlp/SWE-bench) with some fixes), we try to pre-build the **testbed** (i.e., code of the repository we want the agent to edit) AND the **conda environment**, so that in evaluation (inference) time, we can directly leverage existing environments for efficient evaluation.
 
-**We pack everything you need for SWE-Bench evaluation into one, gigantic, docker image.** To use it:
+**We pack everything you need for SWE-Bench inference into one, gigantic, docker image.** To use it:
 
 ```bash
 docker pull ghcr.io/opendevin/eval-swe-bench:full-v1.2.1
@@ -124,16 +126,23 @@ After running the inference, you will obtain a `output.jsonl` (by default it wil
 
 With `output.jsonl` file, you can run `eval_infer.sh` to evaluate generated patches, and produce a fine-grained report.
 
+**This evaluation is performed using the official dockerized evaluation announced [here](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md).**
+
 If you want to evaluate existing results, you should first run this to clone existing outputs
 
 ```bash
 git clone https://huggingface.co/spaces/OpenDevin/evaluation evaluation/evaluation_outputs
 ```
 
-To prepare for swe-bench evaluation, you should pull evaluation docker from [OpenDevin/SWE-bench-docker](https://github.com/OpenDevin/SWE-bench-docker) and download swe-bench data by running:
+If you have extra local space (e.g., 500GB), you can try pull the [instance-level docker images](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md#choosing-the-right-cache_level) we've prepared to speed up the evaluation by running:
+
+```bash
+evaluation/swe_bench/scripts/docker/pull_all_eval_docker.sh instance
+```
 
+If you want to save disk space a bit (e.g., with ~50GB free disk space), while speeding up the image pre-build process, you can pull the environment-level docker images:
 ```bash
-evaluation/swe_bench/scripts/eval/prep_eval.sh
+evaluation/swe_bench/scripts/docker/pull_all_eval_docker.sh env
 ```
 
 Then you can run the following:
@@ -146,12 +155,11 @@ Then you can run the following:
 
 PS: You can also pass in a JSONL with [SWE-Bench format](https://github.com/princeton-nlp/SWE-bench/blob/main/tutorials/evaluation.md#-creating-predictions) to `./evaluation/swe_bench/scripts/eval_infer.sh`, where each line is a JSON of `{"model_patch": "XXX", "model_name_or_path": "YYY", "instance_id": "ZZZ"}`.
 
-The final results will be saved to `evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/` with the following files/directory (following format of [SWE-bench-docker](https://github.com/aorwall/SWE-bench-docker/tree/main/evaluations/SWE-bench_Lite_golden)):
+The final results will be saved to `evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/` with the following files/directory:
 
 - `README.md`: a report showing what are the instances that passed, failed, etc.
-- `logs/`: a directory of test logs
-- `report.json`: a JSON file that contains keys like `"resolved"` pointing to instance IDs that are resolved by the agent.
-- `summary.json`: a JSON file contains more fine-grained information for each test instance.
+- `report.json`: a JSON file that contains keys like `"resolved_ids"` pointing to instance IDs that are resolved by the agent.
+- `eval_outputs/`: a directory of test logs
 
 ## Visualize Results