Follow the guidance in SWE-bench to set up docker and conda environments. Install the following packages:
pip install libcst
pip install vertexai
Download all repository files and repository structures, and put them under the inference folder.
The Gemini calling function request_gemini_engine
is in inference/utils.py
. We currently use the API Vertexai. Change it if it is different.
The inference is done under the inference
folder.
cd inference
python localize_file.py --pred_dir {result_folder}
python localize_function.py --pred_dir {result_folder}
python edit.py --pred_dir {result_folder} --gemini_version {trained_checkpoint_id}
python post_process.py --pred_dir {result_folder} --output_path {output_path}
The trained checkpoint id can be chosen from the following:
projects/618488765595/locations/us-central1/tuningJobs/5189087506007588864
: Trained with 3k synthesized data for 1 epoch.projects/618488765595/locations/us-central1/tuningJobs/5248990823634173952
: Trained with 3k synthesized data for 2 epochs.projects/618488765595/locations/us-central1/tuningJobs/4976151686125977600
: Trained with 9k synthesized data for 1 epoch.projects/618488765595/locations/us-central1/tuningJobs/4724000409650200576
: Trained with 20k synthesized data for 1 epoch.projects/618488765595/locations/us-central1/tuningJobs/7671606365764190208
: Trained with 24k synthesized data for 1 epoch.projects/618488765595/locations/us-central1/tuningJobs/6343387523317760000
: Trained with 28k synthesized data for 1 epoch.projects/618488765595/locations/us-central1/tuningJobs/6869814998999236608
: Trained with 1k training data for 1 epoch.projects/618488765595/locations/us-central1/tuningJobs/889879118781349888
: Trained with 2k training data for 1 epoch.projects/618488765595/locations/us-central1/tuningJobs/7311378868714078208
: Trained with 4k training data for 1 epoch.projects/618488765595/locations/us-central1/tuningJobs/2758872964140105728
: Trained with 4k training data (different sampling) for 1 epoch.projects/618488765595/locations/us-central1/tuningJobs/7768621774241005568
: Trained with 12k synthesized data on Django for 1 epoch.projects/618488765595/locations/us-central1/tuningJobs/6564682105471631360
: Trained with 2k synthesized data on Matplotlib for 1 epoch.
cd evaluation
python -m swebench.harness.run_evaluation --predictions_path {output_path} --max_workers 24 --run_id {experiment_id}
Known issue: in many cases, the evaluation seems to never end. We check the output files and terminate the process if no new instance is evaluated after a long period of time, e.g., 20 minutes.
python view_results.py --run_id {experiment_id}
python majority_vote.py
It looks for sub-directories under evaluation/logs/run_evaluation
and perform majority vote based on the generated patches.
In current experiments, we sample each checkpoint 10 times (full pass from step 1 to step 4), which results in 120 generations per example.
To run all inference and evaluations end-to-end, you may use run.sh
tmux new -s mysession
./run.sh