Skip to content

Commit

Permalink
Add notebook to compare INSPIRE and HEPData IDs
Browse files Browse the repository at this point in the history
* Closes #2.
  • Loading branch information
GraemeWatt committed Apr 24, 2023
1 parent 7ef1fa9 commit 5f6f7f2
Show file tree
Hide file tree
Showing 2 changed files with 339 additions and 1 deletion.
8 changes: 7 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,4 +41,10 @@ an [INSPIRE](https://inspirehep.net) record, and release of the first version of
[HEPData](https://www.hepdata.net) record. A histogram is plotted of these delays for a given
time period and experimental collaboration.

Written by Graeme Watt on 25th November 2022.
Written by Graeme Watt on 25th November 2022.

### [compare_inspire_and_hepdata.ipynb](notebooks/compare_inspire_and_hepdata.ipyb)

This Jupyter notebook investigates the consistency between the INSPIRE record numbers obtained from
either [INSPIRE](https://inspirehep.net) or [HEPData](https://www.hepdata.net). Discrepancies usually
occur because an INSPIRE record has changed its record number.
332 changes: 332 additions & 0 deletions notebooks/compare_inspire_and_hepdata.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,332 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "9246d05b",
"metadata": {},
"source": [
"# Compare INSPIRE record numbers obtained from INSPIRE and HEPData"
]
},
{
"cell_type": "markdown",
"id": "a4188fa9",
"metadata": {},
"source": [
"This Jupyter notebook investigates the consistency between the INSPIRE record numbers obtained from either [INSPIRE](https://inspirehep.net/) or [HEPData](https://www.hepdata.net/).\n",
"\n",
"Discrepancies usually occur because an INSPIRE record has changed its record number."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "0fe87f59",
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"from time import sleep\n",
"import math"
]
},
{
"cell_type": "markdown",
"id": "a022d3e6",
"metadata": {},
"source": [
"## Get records from INSPIRE"
]
},
{
"cell_type": "markdown",
"id": "173511f7",
"metadata": {},
"source": [
" Get a list of all INSPIRE record numbers that have a HEPData record attached using the [INSPIRE API](https://github.com/inspirehep/rest-api-doc)."
]
},
{
"cell_type": "markdown",
"id": "7082011c",
"metadata": {},
"source": [
"First define query and get number of INSPIRE records."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "e73afd3e",
"metadata": {},
"outputs": [],
"source": [
"query = 'external_system_identifiers.schema:HEPData'"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "a4e5a0a1",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"num_results = 9924\n"
]
}
],
"source": [
"url = 'https://inspirehep.net/api/literature'\n",
"payload = {'q': query, 'size': 1, 'fields': 'control_number'}\n",
"response = requests.get(url, params=payload)\n",
"num_results = response.json()['hits']['total']\n",
"print('num_results = {}'.format(num_results))"
]
},
{
"cell_type": "markdown",
"id": "688eb8b5",
"metadata": {},
"source": [
"Now get IDs of INSPIRE records in chunks."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "cb8bf99b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"pages = 10\n",
"page = 1\n",
"page = 2\n",
"page = 3\n",
"page = 4\n",
"page = 5\n",
"page = 6\n",
"page = 7\n",
"page = 8\n",
"page = 9\n",
"page = 10\n",
"num_results = 9924\n"
]
}
],
"source": [
"inspire_ids = []\n",
"size = 1000\n",
"pages = math.ceil(num_results/float(size))\n",
"print('pages = {}'.format(pages))\n",
"page = 1\n",
"while page <= pages:\n",
" print('page = {}'.format(page))\n",
" payload = {'q': query, 'size': size, 'fields': 'control_number', 'page': page, 'sort': 'mostrecent'}\n",
" response = requests.get(url, params=payload)\n",
" data = response.json()['hits']['hits']\n",
" for hit in data:\n",
" inspire_ids.append(int(hit['metadata']['control_number']))\n",
" page += 1\n",
" sleep(1)\n",
"inspire_ids.sort()\n",
"print('num_results = {}'.format(len(inspire_ids)))"
]
},
{
"cell_type": "markdown",
"id": "2a77936a",
"metadata": {},
"source": [
"## Get records from HEPData"
]
},
{
"cell_type": "markdown",
"id": "730fdfb5",
"metadata": {},
"source": [
"Get a list of the INSPIRE record numbers of all HEPData records using the PostgreSQL database."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "6d5ff69b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"9924\n"
]
}
],
"source": [
"url = 'https://hepdata.net/search/ids'\n",
"payload = {'inspire_ids': 'true'}\n",
"response = requests.get(url, params=payload)\n",
"hepdata_ids = response.json()\n",
"hepdata_ids.sort()\n",
"print(len(hepdata_ids))"
]
},
{
"cell_type": "markdown",
"id": "39c99547",
"metadata": {},
"source": [
"Get a list of the INSPIRE record numbers of all HEPData records using the OpenSearch index."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "7e24b553",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"9924\n"
]
}
],
"source": [
"payload = {'inspire_ids': 'true', 'use_es': 'true'}\n",
"response = requests.get(url, params=payload)\n",
"hepdata_ids_es = response.json()\n",
"hepdata_ids_es.sort()\n",
"print(len(hepdata_ids_es))"
]
},
{
"cell_type": "markdown",
"id": "c5cc724b",
"metadata": {},
"source": [
"Check equality of results obtained using the two methods."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "6ba9c4b5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hepdata_ids == hepdata_ids_es"
]
},
{
"cell_type": "markdown",
"id": "01ac3aa7",
"metadata": {},
"source": [
"## Compare lists of record numbers and identify differences"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "77d88aa2",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"inspire_ids == hepdata_ids"
]
},
{
"cell_type": "markdown",
"id": "246f9d7a",
"metadata": {},
"source": [
"Print out IDs for HEPData records not in INSPIRE."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "a93f2acf",
"metadata": {},
"outputs": [],
"source": [
"for hepdata_id in hepdata_ids:\n",
" if hepdata_id not in inspire_ids:\n",
" print('The HEPData record https://hepdata.net/record/ins{}'.format(hepdata_id))\n",
" print('is not linked from https://inspirehep.net/record/{}'.format(hepdata_id))\n",
" print()"
]
},
{
"cell_type": "markdown",
"id": "8f91ae58",
"metadata": {},
"source": [
"Print out IDs for INSPIRE records not in HEPData."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "2fd1def5",
"metadata": {},
"outputs": [],
"source": [
"for inspire_id in inspire_ids:\n",
" if inspire_id not in hepdata_ids:\n",
" print('https://inspirehep.net/record/{}'.format(inspire_id))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.15"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

0 comments on commit 5f6f7f2

Please sign in to comment.