-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add notebook to compare INSPIRE and HEPData IDs
* Closes #2.
- Loading branch information
1 parent
7ef1fa9
commit 5f6f7f2
Showing
2 changed files
with
339 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,332 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "9246d05b", | ||
"metadata": {}, | ||
"source": [ | ||
"# Compare INSPIRE record numbers obtained from INSPIRE and HEPData" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "a4188fa9", | ||
"metadata": {}, | ||
"source": [ | ||
"This Jupyter notebook investigates the consistency between the INSPIRE record numbers obtained from either [INSPIRE](https://inspirehep.net/) or [HEPData](https://www.hepdata.net/).\n", | ||
"\n", | ||
"Discrepancies usually occur because an INSPIRE record has changed its record number." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"id": "0fe87f59", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import requests\n", | ||
"from time import sleep\n", | ||
"import math" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "a022d3e6", | ||
"metadata": {}, | ||
"source": [ | ||
"## Get records from INSPIRE" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "173511f7", | ||
"metadata": {}, | ||
"source": [ | ||
" Get a list of all INSPIRE record numbers that have a HEPData record attached using the [INSPIRE API](https://github.com/inspirehep/rest-api-doc)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "7082011c", | ||
"metadata": {}, | ||
"source": [ | ||
"First define query and get number of INSPIRE records." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"id": "e73afd3e", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"query = 'external_system_identifiers.schema:HEPData'" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"id": "a4e5a0a1", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"num_results = 9924\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"url = 'https://inspirehep.net/api/literature'\n", | ||
"payload = {'q': query, 'size': 1, 'fields': 'control_number'}\n", | ||
"response = requests.get(url, params=payload)\n", | ||
"num_results = response.json()['hits']['total']\n", | ||
"print('num_results = {}'.format(num_results))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "688eb8b5", | ||
"metadata": {}, | ||
"source": [ | ||
"Now get IDs of INSPIRE records in chunks." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"id": "cb8bf99b", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"pages = 10\n", | ||
"page = 1\n", | ||
"page = 2\n", | ||
"page = 3\n", | ||
"page = 4\n", | ||
"page = 5\n", | ||
"page = 6\n", | ||
"page = 7\n", | ||
"page = 8\n", | ||
"page = 9\n", | ||
"page = 10\n", | ||
"num_results = 9924\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"inspire_ids = []\n", | ||
"size = 1000\n", | ||
"pages = math.ceil(num_results/float(size))\n", | ||
"print('pages = {}'.format(pages))\n", | ||
"page = 1\n", | ||
"while page <= pages:\n", | ||
" print('page = {}'.format(page))\n", | ||
" payload = {'q': query, 'size': size, 'fields': 'control_number', 'page': page, 'sort': 'mostrecent'}\n", | ||
" response = requests.get(url, params=payload)\n", | ||
" data = response.json()['hits']['hits']\n", | ||
" for hit in data:\n", | ||
" inspire_ids.append(int(hit['metadata']['control_number']))\n", | ||
" page += 1\n", | ||
" sleep(1)\n", | ||
"inspire_ids.sort()\n", | ||
"print('num_results = {}'.format(len(inspire_ids)))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "2a77936a", | ||
"metadata": {}, | ||
"source": [ | ||
"## Get records from HEPData" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "730fdfb5", | ||
"metadata": {}, | ||
"source": [ | ||
"Get a list of the INSPIRE record numbers of all HEPData records using the PostgreSQL database." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"id": "6d5ff69b", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"9924\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"url = 'https://hepdata.net/search/ids'\n", | ||
"payload = {'inspire_ids': 'true'}\n", | ||
"response = requests.get(url, params=payload)\n", | ||
"hepdata_ids = response.json()\n", | ||
"hepdata_ids.sort()\n", | ||
"print(len(hepdata_ids))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "39c99547", | ||
"metadata": {}, | ||
"source": [ | ||
"Get a list of the INSPIRE record numbers of all HEPData records using the OpenSearch index." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"id": "7e24b553", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"9924\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"payload = {'inspire_ids': 'true', 'use_es': 'true'}\n", | ||
"response = requests.get(url, params=payload)\n", | ||
"hepdata_ids_es = response.json()\n", | ||
"hepdata_ids_es.sort()\n", | ||
"print(len(hepdata_ids_es))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "c5cc724b", | ||
"metadata": {}, | ||
"source": [ | ||
"Check equality of results obtained using the two methods." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"id": "6ba9c4b5", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"True" | ||
] | ||
}, | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"hepdata_ids == hepdata_ids_es" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "01ac3aa7", | ||
"metadata": {}, | ||
"source": [ | ||
"## Compare lists of record numbers and identify differences" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 8, | ||
"id": "77d88aa2", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"True" | ||
] | ||
}, | ||
"execution_count": 8, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"inspire_ids == hepdata_ids" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "246f9d7a", | ||
"metadata": {}, | ||
"source": [ | ||
"Print out IDs for HEPData records not in INSPIRE." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"id": "a93f2acf", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"for hepdata_id in hepdata_ids:\n", | ||
" if hepdata_id not in inspire_ids:\n", | ||
" print('The HEPData record https://hepdata.net/record/ins{}'.format(hepdata_id))\n", | ||
" print('is not linked from https://inspirehep.net/record/{}'.format(hepdata_id))\n", | ||
" print()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "8f91ae58", | ||
"metadata": {}, | ||
"source": [ | ||
"Print out IDs for INSPIRE records not in HEPData." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 10, | ||
"id": "2fd1def5", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"for inspire_id in inspire_ids:\n", | ||
" if inspire_id not in hepdata_ids:\n", | ||
" print('https://inspirehep.net/record/{}'.format(inspire_id))" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.9.15" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |