GHPR contains data about pull requests that have fixed one or more issues on GitHub.
The dataset can be found at /ghpr.csv
.
A small sample of the dataset is also available at /ghpr-sample.csv
.
Each instance of GHPR contains data about an issue and a pull request, where the pull request has fixed the issue. Note that in some cases, a single pull request is linked to multiple issues or vice versa.
The dataset is a CSV file with the following columns:
repo_id
- Integerissue_number
- Integerissue_title
- Textissue_body_md
- Text, in Markdown format, can be emptyissue_body_plain
- Text, in plain text, can be emptyissue_created_at
- Integer, in Unix timeissue_author_id
- Integerissue_author_association
- Integer enum (see values below)issue_label_ids
- Comma-separated integers, can be emptypull_number
- Integerpull_created_at
- Integer, in Unix timepull_merged_at
- Integer, in Unix timepull_comments
- Integerpull_review_comments
- Integerpull_commits
- Integerpull_additions
- Integerpull_deletions
- Integerpull_changed_files
- Integer
The value of issue_body_plain
is converted from issue_body_md
.
The conversion is not always perfect.
In some cases, issue_body_plain
still contains some Markdown tags.
The value of issue_author_association
can be one of the following:
0
- Collaborator1
- Contributor2
- First-timer3
- First-time contributor4
- Mannequin5
- Member6
- None7
- Owner
See GitHub docs for more details on author association.
Rows of the dataset are sorted by repository owner username, repository name, pull request number, and then issue number.
The data is collected using the GitHub REST API. The data collection flow is as follows:
- For repository R:
- For each merged pull request P in R:
- For each issue I that is linked by P using a GitHub keyword and is in R:
- (I, P) is a member of GHPR.
- For each issue I that is linked by P using a GitHub keyword and is in R:
- For each merged pull request P in R:
The dataset is created using GHPR Tools.
The raw data for this dataset is also available, where you can find a JSON file for each issue and pull request present in the dataset.
This version of GHPR contains 14,384 instances. The data is collected in October 2020 from CNCF graduated projects, specifically, the following repositories:
Repository | # instances |
---|---|
containerd/containerd |
351 |
coredns/coredns |
227 |
envoyproxy/envoy |
1,229 |
fluent/fluentd |
161 |
goharbor/harbor |
598 |
helm/helm |
859 |
jaegertracing/jaeger |
455 |
kubernetes/kubernetes |
8,323 |
prometheus/prometheus |
543 |
rook/rook |
946 |
theupdateframework/specification |
13 |
tikv/tikv |
469 |
vitessio/vitess |
210 |