Skip to content

Commit

Permalink
[Evaluation] Use the latest official SWE-Bench Dockerization for eval…
Browse files Browse the repository at this point in the history
…uation (#2728)

* add newline after patch to fix patch apply

* new swebench wip

* add newline after patch to fix patch apply

* only add newline if not empty

* update swebench source and update

* update gitignore for swebench eval

* update old prep_eval

* update gitignore

* add scripts for push and pull swebench images

* update eval_infer.sh

* update eval_infer for new docker workflow

* update script to create markdown report based on report.json

* update eval infer to use update output

* update readme

* only move result to folder if running whole file

* remove set-x

* update conversion script

* Update evaluation/swe_bench/README.md

* Update evaluation/swe_bench/README.md

* Update evaluation/swe_bench/README.md

* make sure last line end with newline

* switch to an fix attempt branch of swebench

* Update evaluation/swe_bench/README.md

* Update evaluation/swe_bench/README.md

---------

Co-authored-by: Engel Nyst <[email protected]>
  • Loading branch information
xingyaoww and enyst authored Jul 1, 2024
1 parent 6246cb8 commit 6a0ffc5
Show file tree
Hide file tree
Showing 11 changed files with 809 additions and 313 deletions.
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -212,4 +212,8 @@ cache
config.toml
config.toml.bak

containers/agnostic_sandbox
containers/agnostic_sandbox

# swe-bench-eval
image_build_logs
run_instance_logs
22 changes: 15 additions & 7 deletions evaluation/swe_bench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

This folder contains the evaluation harness that we built on top of the original [SWE-Bench benchmark](https://www.swebench.com/) ([paper](https://arxiv.org/abs/2310.06770)). We created [a fork of SWE-Bench](https://github.com/OpenDevin/OD-SWE-bench.git) mostly built on top of [the original repo](https://github.com/princeton-nlp/SWE-bench) and [containerized](#opendevin-swe-bench-docker-image) it for easy evaluation.

**UPDATE (7/1/2024): We now support the official SWE-Bench dockerized evaluation as announced [here](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md).**

## Setup Environment

Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to set up a local development environment for OpenDevin.
Expand All @@ -10,7 +12,7 @@ Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/D

In [OpenDevin-SWE-Bench fork](https://github.com/OpenDevin/OD-SWE-bench.git) (mostly from [original repo](https://github.com/princeton-nlp/SWE-bench) with some fixes), we try to pre-build the **testbed** (i.e., code of the repository we want the agent to edit) AND the **conda environment**, so that in evaluation (inference) time, we can directly leverage existing environments for efficient evaluation.

**We pack everything you need for SWE-Bench evaluation into one, gigantic, docker image.** To use it:
**We pack everything you need for SWE-Bench inference into one, gigantic, docker image.** To use it:

```bash
docker pull ghcr.io/opendevin/eval-swe-bench:full-v1.2.1
Expand Down Expand Up @@ -124,16 +126,23 @@ After running the inference, you will obtain a `output.jsonl` (by default it wil

With `output.jsonl` file, you can run `eval_infer.sh` to evaluate generated patches, and produce a fine-grained report.

**This evaluation is performed using the official dockerized evaluation announced [here](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md).**

If you want to evaluate existing results, you should first run this to clone existing outputs

```bash
git clone https://huggingface.co/spaces/OpenDevin/evaluation evaluation/evaluation_outputs
```

To prepare for swe-bench evaluation, you should pull evaluation docker from [OpenDevin/SWE-bench-docker](https://github.com/OpenDevin/SWE-bench-docker) and download swe-bench data by running:
If you have extra local space (e.g., 500GB), you can try pull the [instance-level docker images](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md#choosing-the-right-cache_level) we've prepared to speed up the evaluation by running:

```bash
evaluation/swe_bench/scripts/docker/pull_all_eval_docker.sh instance
```

If you want to save disk space a bit (e.g., with ~50GB free disk space), while speeding up the image pre-build process, you can pull the environment-level docker images:
```bash
evaluation/swe_bench/scripts/eval/prep_eval.sh
evaluation/swe_bench/scripts/docker/pull_all_eval_docker.sh env
```

Then you can run the following:
Expand All @@ -146,12 +155,11 @@ Then you can run the following:

PS: You can also pass in a JSONL with [SWE-Bench format](https://github.com/princeton-nlp/SWE-bench/blob/main/tutorials/evaluation.md#-creating-predictions) to `./evaluation/swe_bench/scripts/eval_infer.sh`, where each line is a JSON of `{"model_patch": "XXX", "model_name_or_path": "YYY", "instance_id": "ZZZ"}`.

The final results will be saved to `evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/` with the following files/directory (following format of [SWE-bench-docker](https://github.com/aorwall/SWE-bench-docker/tree/main/evaluations/SWE-bench_Lite_golden)):
The final results will be saved to `evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/` with the following files/directory:

- `README.md`: a report showing what are the instances that passed, failed, etc.
- `logs/`: a directory of test logs
- `report.json`: a JSON file that contains keys like `"resolved"` pointing to instance IDs that are resolved by the agent.
- `summary.json`: a JSON file contains more fine-grained information for each test instance.
- `report.json`: a JSON file that contains keys like `"resolved_ids"` pointing to instance IDs that are resolved by the agent.
- `eval_outputs/`: a directory of test logs

## Visualize Results

Expand Down
Loading

0 comments on commit 6a0ffc5

Please sign in to comment.