Commands to load record-level access Health-related outcomes tables into an SQLite database. Data must be requested from UK Biobank and downloaded as database tables from the Data Portal.
The health outcomes data tables in UKB can each be upwards of 4GB in size and therefore are memory-intensive to work with. By storing the data tables in a database, it is possible to query them without loading all of the data into memory.
-
Nov 23 2021: Update parsing of dates in the
hesin
table which used to be stored codedDDMMYYYY
but are now coded asDD/MM/YYYY
. -
Aug 11 2020: Added uniqueness checks during data import and transactional commits to main relational tables to handle the database creation being interrupted and restarted.
-
Aug 10 2020: Database schema has been normalised which should make many queries a lot faster, particularly those that involve searching text fields. It also decreases the size of the database by about 7GB.
hesin
: Master table of administrative recordshesin_diag
: Diagnosis codeshesin_oper
: Operations and procedural codeshesin_psych
: Administrative records relating to psychiatryhesin_maternity
: Maternity records of carehesin_delivery
: Children born as the result of a maternity record.
gp_clinical
: GP clinical event recordsgp_scripts
: GP prescription recordsgp_registrations
: GP registration records
- Install SQLite version > 3.9.0
- Clone the repository
git clone [email protected]:ccbs-stradl/ukb_healthoutcomes_db.git cd ukb_healthoutcomes_db
Assuming you have requested the relevant fields for the record-level health outcomes data as part of an approved UKB application, the full database tables can be downloaded (in tab-separated text format) from the UKB Data Showcase
- Log in to the AMS Portal
- Select the Projects tab
- Click the "View/Update" button for the Application you are downloading data for
- Select the Data tab.
- Click the "Go to Showcase to refresh or download data" button.
- Select the Data Portal tab.
- Click the "Connect" button.
- Select the Table Download tab.
- Enter the name of a table (listed above) to download and click the "Fetch Table" button.
- Use the listed
wget
command with the specified unique URL key to download the table, or click the download link (wget
is preferred as it names the file correctly. The download link may open the table directly in a browser window, in which case you have to use Save as... to save it).
Example wget
commands:
wget -nd -Ohesin.txt https://biota.ndph.ox.ac.uk/...
wget -nd -Ohesin_diag.txt https://biota.ndph.ox.ac.uk/...
wget -nd -Ohesin_oper.txt https://biota.ndph.ox.ac.uk/...
wget -nd -Ohesin_psych.txt https://biota.ndph.ox.ac.uk/...
wget -nd -Ohesin_maternity.txt https://biota.ndph.ox.ac.uk/...
wget -nd -Ohesin_delivery.txt https://biota.ndph.ox.ac.uk/...
wget -nd -Ogp_clinical.txt https://biota.ndph.ox.ac.uk/...
wget -nd -Ogp_scripts.txt https://biota.ndph.ox.ac.uk/...
wget -nd -Ogp_registrations.txt https://biota.ndph.ox.ac.uk/...
Move the downloaded text files into the repository directory. They are expected to have the names given to them in the wget
command (of the form table_name.txt
). If the tables have different names, are in different locations, or are not available, modify the import.sql
file as appropriate.
Create the database with
sh db.sh
The database creation script reads the raw data into the database and then normalises it into a more compact set of tables to speed up querying.
The database is called healthoutcomes.db
and can be opened using SQLite:
sqlite3 healthoutcomes.db
The database has end-user views of the data that conform to the table names and columns of the original data. Dates in the tables are standardised to the format YYYY-MM-DD
.
The total size of the SQL database file is approximately 19GB.
The data can be manipulated in R without loading the entire dataset into memory. There are several R libraries for working with databases such as RSQLite and dplyr.
# install required packages
install.packages(c('dplyr', 'RSQLite', 'dbplyr'))
library(dplyr)
# make connection to database
con <- DBI::dbConnect(RSQLite::SQLite(), 'healthoutcomes.db')
# load hesin table
hesin <- tbl(con, 'hesin')
The hesin
table can then be worked on using dplyr commands like any other tibble
. Use select()
, filter()
, and summarize()
commands to identify the subset of the data, or transform it by passing expressions with SQL functions to mutate()
. Once your query is finalized together, use collect()
to import the query into the R workspace for further manipulation or modeling.
The health outcomes data tables have date information formatted as either YYYYMMDD
or DD-MM-YYYY
so these have been normalized to YYYY-MM-DD
so that they can be passed to SQLite's date functions.
gp_registrations <- tbl(con, 'gp_registrations')
gp_clinical <- tbl(con, 'gp_clinical')
# Return all registrations from October 2016
gp_registrations %>% filter(reg_date >= '2016-10-1' & reg_date <= '2016-10-31') %>% arrange(reg_date)
# count how many clinical records are available for each year
gp_clinical %>% group_by(date(event_dt, 'start of year')) %>% tally()
Prescriptions are referred to in gp_scripts
table are variously coded with Read v2, BNF, or DMD codes depending on the data source. Many entries also have a plain text name of the drug in the drug_name
column (along with dosage size information). The drug name can be searched using the %LIKE%
operator and with the name of the drug surrounded by "%%"
.
gp_scripts <- tbl(con, 'gp_scripts')
gp_scripts %>% filter(drug_name %LIKE% "%Amitriptyline%")
Many columns are stored as coded integers rather than strings. The HES Data Dictionary lists the structure of each table and the data coding for each column. Data codings can be searched for on the UKB Showcase and inspected or downloaded as a text file. For example, the source
column of the hesin
table has Data-Coding 263.
For each table TABLE
there is an underlying data representation called TABLE_data
with foreign key links between them using the eids
table. Most TEXT
columns (excepting date fields) are normalized to separate tables called TABLE_FIELD
linked with a foreign key FIELD_id
.