Overview | Quickstart | Installation | Datasets | Examples | How You Can Help
BioGrakn Covid is an open source project to build a knowledge graph to enable research in COVID-19 and related disease areas.
We're excited to release an open source knowledge graph to speed up the research into Covid-19. Our goal is to provide a way for researchers to easily analyse and query large amounts of data and papers related to the virus.
BioGrakn Covid makes it easy to quickly trace information sources and identify articles and the information therein. This first release includes entities extracted from Covid-19 papers, and from additional datasets including, proteins, genes, disease-gene associations, coronavirus proteins, protein expression, biological pathways, and drugs.
For example, by querying for the virus SARS-CoV-2, we can find the associated human protein, proteasome subunit alpha type-2 (PSMA2), a component of the proteasome, implicated in SARS-CoV-2 replication, and its encoding gene (PSMA2). Additionally, we can identify the drug carfilzomib, a known inhibitor of the proteasome that could therefore be researched as a potential treatment for patients with Covid-19. To support the plausibility of this association and its implications, we can easily identify papers in the Covid-19 literature where this protein has been mentioned.
By examining these specific relationships and their attributes, we are directed to the data sources, including publications. This will help researchers to efficiently study the mechanisms of coronaviral infection, the immune response, and help to find targets for the development of treatments or vaccines more efficiently.
Our team currently consists of a partnership between GSK, Oxford PharmaGenesis and Grakn Labs
The schema that models the underlying knowledge graph alongside the descriptive query language, Graql, makes writing complex queries an extremely straightforward and intuitive process. Furthermore, Grakn's automated reasoning, allows BioGrakn to become an intelligent database of biomedical data for the Covid research field that infers implicit knowledge based on the explicitly stored data. BioGrakn Covid can understand biological facts, infer based on new findings and enforce research constraints, all at query (run) time.
BioGrakn Covid is free to access via an Azure VM. You can query it using Workbase:
- Download and run Workbase (download)
- Make sure Grakn isn’t running on your local machine
- On the main Workbase screen, change the host to the IP address shown on this page (link) with port 48555
- Click connect, select the keyspace biograkn_covid and start exploring the data!
You can also connect programmatically using one of the Grakn clients (link). Use the IP address, port and keyspace as specified above.
Prerequesites: Python >3.6, Grakn Core 1.8.0, Grakn Python Client API, Grakn Workbase 1.3.4.
cd <path/to/biograkn-covid>/
python migrator.py
First, make sure to download all source datasets and put them in the Datasets
folder. You can find the links below. Then, grab a coffee while the migrator builds the database and schema for you!
Graql queries can be run either on grakn console, on workbase or through client APIs. However, we encourage running the queries on Grakn Workbase to have the best visual experience. Please follow this tutorial on how to run queries on Workbase.
# Return drugs that are associated to genes, which have been mentioned in the same
# paper as the gene which is associated to SARS.
match
$v isa virus, has virus-name "SARS";
$g isa gene;
$1 ($g, $v) isa gene-virus-association;
$2 ($g, $pu) isa mention;
$3 ($pu, $g2) isa mention;
$g2 isa gene;
$g2 != $g;
$4 ($g2, $dr); $dr isa drug;
get; offset 0; limit 10;
Currently the datasets we've integrated include:
- CORD-19: We incorporate the original corpus which includes peer-reviewed publications from bioRxiv, medRxiv and others.
- CORD-NER: The CORD-19 dataset that the White House released has been annotated and made publicly available. It uses various NER methods to recognise named entities on CORD-19 with distant or weak supervision.
- Uniprot: We’ve downloaded the reviewed human subset, and ingested genes, transcripts and protein identifiers.
- Coronaviruses: This is an annotated dataset of coronaviruses and their potential drug targets put together by Oxford PharmaGenesis based on literature review.
- DGIdb: We’ve taken the Interactions TSV which includes all drug-gene interactions.
- Human Protein Atlas: The Normal Tissue Data includes the expression profiles for proteins in human tissues.
- Reactome: This dataset connects pathways and their participating proteins.
- DisGeNet: We’ve taken the curated gene-disease-associations dataset, which contains associations from Uniprot, CGI, ClinGen, Genomics England and CTD, PsyGeNET, and Orphanet.
We plan to add many more datasets!
This is an on-going project and we need your help! If you want to contribute, you can help out by helping us including:
- Migrate more data sources (e.g. clinical trials, DrugBank, Excelra)
- Extend the schema by adding relevant rules
- Create a website
- Write tutorials and articles for researchers to get started
If you wish to get in touch, please talk to us on the #biograkn channel on our Discord (link here).