Version 1.3
This repository contains datasets for the 29 problems described in the paper General Program Synthesis Benchmark Suite. These problems come from introductory programming textbooks, and require a range of programming constructs and datatypes to solve. These datasets are designed to be usable for any method of performing general program synthesis, including and not limited to inductive program synthesis and evolutionary methods such as genetic programming.
Each problem in the benchmark suite is located in a separate directory in the datasets
directory. All data files are compressed with gzip
. To unzip the datasets, you can unzip individual files, or run the decompress
script to unzip them all at once.
For each problem, we provide a set of edge
cases and a set of random
cases. The edge
cases are those given in the technical report describing the benchmark problems. The random
cases are all generated from the recommended data domains from the technical report, and appear in the same proportions as recommended. For each problem, we included up to 1 million random
cases, unless the range of problem inputs is smaller than 1 million or the resulting file size was over 100MB, in which case we included as many as we could.
A typical use of these datasets for a set of runs of program synthesis would be:
- For each run, use every
edge
case in the training set - For each run, use a different, randomly-sampled set of
random
cases in the training set. - Use a larger set of
random
cases as an unseen test set.
We have provided typical sizes of training and test sets in each problem's README.md.
Each edge and random dataset is provided in three formats: CSV, JSON, and EDN, with all three formats containing identical data.
The CSV files are formatted as follows:
- The first row of the file is the column names.
- Each following row corresponds to one set of program inputs and expected outputs.
- Input columns are labeled
input1
,input2
, etc., and output columns are labeledoutput1
,output2
, etc. - In CSVs, string inputs and outputs are double quoted when necessary, but not if not necessary. Newlines within strings are escaped.
- Columns in CSV files are comma-separated.
The JSON and EDN files are formatted using the JSON Lines standard (adapted for EDN). Each case is put on its own line of the data file. The files should be read line-by-line and each parsed into an object/map using a JSON/EDN parser.
If you use these datasets in a publication, please cite the paper General Program Synthesis Benchmark Suite and include a link to this repository.
BibTeX entry for paper:
@InProceedings{Helmuth:2015:GECCO,
author = "Thomas Helmuth and Lee Spector",
title = "General Program Synthesis Benchmark Suite",
booktitle = "GECCO '15: Proceedings of the 2015 Annual Conference
on Genetic and Evolutionary Computation",
year = "2015",
isbn13 = "978-1-4503-3472-3",
pages = "1039--1046",
organisation = "SIGEVO",
address = "Madrid, Spain",
URL = "http://doi.acm.org/10.1145/2739480.2754769",
DOI = "10.1145/2739480.2754769",
publisher = "ACM",
publisher_address = "New York, NY, USA",
}
The datasets here were generated using this function call in Clojush, the Clojure version of PushGP. This code is admittedly messy, and should only be used if absolutely necessary.