RareArena

A Comprehensive Rare Disease Diagnostic Dataset with nearly 50,000 patients covering more than 4000 diseases.

Data Collection

We build our work upon PMC-Patients, a large-scale patient summary dataset sourced from PMC case reports, and we use GPT-4o for all data processing.

To be specific, we first filter cases focusing on rare disease diagnoses from PMC-Patients, and extract their ground-truth diagnosis. Then we map each diagnosis to the Orphanet database using CODER term embeddings, and filter out the cases with diagnosis failing to map. Next, we truncate the cases and rephrased them to avoid diagnosis leakage. Here we consider two task settings:

Rare Disease Screening (RDS), where the cases are truncated up to any diagnosic tests, such as whole-genome sequencing for genetic diseases and pathogen detection for rare infections.
Rare Disease Confirmation (RDC), where the cases are truncated up to the final diagnosis. Finally, we remove any cases with potential diagnosis leakage.

Evaluation

To evaluate certain model on RareArena, there are three steps to take:

Generate top 5 diagnosis using the model. We provide an OpenAI-style script and our naive prompt used in our paper in eval/run.py.
Evaluate the top 5 diagnosis using GPT-4o (since it is untrivial to identify whether the true diagnosis is retrieved due to presence of synonyms and hypernyms). The script and prompt for GPT-4o is given in eval/eval.py.
Parse the evaluation output and calculate top-1 and top-5 recall using eval/metric.py.

Model Performances

Rare Disease Screening Task

Model	Top 1 Recall (%)				Top 5 Recall (%)
Model	Score = 0 (missing)	Score = 1 (hypernyms)	Score = 2 (synonyms)	Total*	Score = 0 (missing)	Score = 1 (hypernyms)	Score = 2 (synonyms)	Total
GPT-4o	66.95	9.93	23.13	33.05	43.14	20.26	36.61	56.86
Llama3.1-70B	74.44	8.29	17.27	25.56	52.00	17.56	30.45	48.00
Qwen2.5-72B	75.44	10.14	14.42	24.56	49.87	23.79	26.34	50.13
Gemma2-9B	82.09	9.75	8.16	17.91	56.01	22.90	21.09	43.99
Phi3-7B	84.31	6.11	9.58	15.69	57.61	21.15	21.24	42.39
Llama3.1-7B	86.03	6.21	7.76	13.97	58.09	19.07	22.84	41.91
Qwen2.5-7B	86.80	7.46	5.74	13.20	55.80	29.20	15.00	44.20

* Total recall is defined as the sum of score 2 and score 1 matches.

Rare Disease Confirmation Task

Model	Top 1 Recall (%)				Top 5 Recall (%)
Model	Score = 0 (missing)	Score = 1 (hypernyms)	Score = 2 (synonyms)	Total	Score = 0 (missing)	Score = 1 (hypernyms)	Score = 2 (synonyms)	Total
GPT-4o	35.76	14.51	49.72	64.24	14.08	20.23	65.69	85.92
Llama3.1-70B	43.94	14.41	41.66	56.06	18.43	21.12	60.45	81.57
Qwen2.5-72B	49.46	15.46	35.09	50.54	22.98	25.93	51.09	77.02
Gemma2-9B	60.22	16.09	23.69	39.78	29.44	29.70	40.86	70.56
Phi3-7B	68.82	9.15	22.03	31.18	37.68	23.48	38.84	62.32
Llama3.1-8B	64.14	11.17	24.69	35.86	31.13	23.84	45.03	68.87
Qwen2.5-7B	71.78	12.68	15.54	28.22	35.08	34.08	30.85	64.92

License

RareArena is released under CC BY-NC-SA 4.0 License.

Acknowledgements

We would like to acknowledge that the RareArena dataset was created and provided by Tsinghua Medicine, Peking Union Medical College, and Department of Statistics and Data Science at Tsinghua University.

Citation

Our paper is currently under review at Lancet Digital Health.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
eval		eval
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RareArena

Data Collection

Evaluation

Model Performances

Rare Disease Screening Task

Rare Disease Confirmation Task

License

Acknowledgements

Citation

About

Releases

Packages

Languages

License

zhao-zy15/RareArena

Folders and files

Latest commit

History

Repository files navigation

RareArena

Data Collection

Evaluation

Model Performances

Rare Disease Screening Task

Rare Disease Confirmation Task

License

Acknowledgements

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages