Skip to content

Latest commit

 

History

History
280 lines (204 loc) · 15.9 KB

README.md

File metadata and controls

280 lines (204 loc) · 15.9 KB

Variation Normalizer Manuscript

This repo contains analysis notebooks used in the The Clinical Genomic Variation Landscape manuscript.

Small output files can be found in this repo. Larger files can be found in our public s3 bucket: s3://nch-igm-wagner-lab-public/variation-normalizer-manuscript/. There are notebooks that provide functions for programmatically downloading files from the s3 bucket.

After running the notebooks, users will be able to create figures such as these that demonstrate the results of the analysis, such as the below figure.

Variant normalization allows patient data from AACR Project GENIE to be matched to normalized variants in the CIViC, MOAlmanac, and ClinVar knowledgebases.

Patient Matching with GENIE

Set Up

Before running the notebooks, you must set up your environment.

Install Python 3.11

Python 3.11 was used for this analysis. We recommend using Pyenv to install.

Creating the virtual environment

First, create your virtual environment. The requirements.txt is a lockfile containing exact versions used. The requirements-dev.txt contains the main third party packages.

From the root directory, run the following to create the venv and install exact packages:

make devready
source .venv/bin/activate

Environment Variables

We use python-dotenv to load environment variables needed for analysis notebooks that run the Variation Normalizer.

If you are running any of the following notebooks, this section is required:

In the analysis notebooks, you will see:

from dotenv import load_dotenv

load_dotenv()

This will load environment variables from the .env file in the root directory. You will need to create this file yourself. The structure will look like:

.
├── analysis
├── .env
└── README.md

The environment variables that will need to be set inside the .env file:

GENE_NORM_DB_URL=http://localhost:8000
UTA_DB_URL=driver://user:password@host:port/database/schema  # replace with actual values
AWS_ACCESS_KEY_ID=dummy
AWS_SECRET_ACCESS_KEY=dummy
AWS_SESSION_TOKEN=dummy
TRANSCRIPT_MAPPINGS_PATH=variation-normalizer-manuscript/analysis/data/transcript_mapping.tsv  # Should be absolute path. For cool-seq-tool
MANE_SUMMARY_PATH=variation-normalizer-manuscript/analysis/data/MANE.GRCh38.v1.3.summary.txt  # Should be absolute path. For cool-seq-tool
LRG_REFSEQGENE_PATH=variation-normalizer-manuscript/analysis/data/LRG_RefSeqGene_20231114  # Should be absolute path. For cool-seq-tool
SEQREPO_ROOT_DIR=/usr/local/share/seqrepo/latest  # replace if using different path

In analysis/download_s3_files.ipynb, transcript_mapping.tsv, MANE.GRCh38.v1.3.summary.txt, and LRG_RefSeqGene_20231114 will be downloaded to ./analysis/data directory. You must update the associated environment variables (TRANSCRIPT_MAPPINGS_PATH, MANE_SUMMARY_PATH, LRG_REFSEQGENE_PATH) to use the absolute path.

Set Up Backend Services

This analysis relies on several backend services, which you must set up yourself.

Biocommons SeqRepo

Biocommons SeqRepo is used for fast access to sequence data. This analysis used 2021-01-29 SeqRepo data.

Follow the Quick Start Documentation for setting up SeqRepo. The VICC Gene Normalizer also provides some additional setup help here.

Update the SEQREPO_ROOT_DIR in the .env file with the path to SeqRepo. The default path is /usr/local/share/seqrepo/latest.

SeqRepo Verification

To verify, run the following inside your virtual environment:

$ python3
Python 3.11.5 (main, Aug 24 2023, 15:18:16) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from dotenv import load_dotenv
>>> load_dotenv()
True
>>> from os import environ
>>> from cool_seq_tool.data_sources import SeqRepoAccess
.venv/lib/python3.11/site-packages/python_jsonschema_objects/__init__.py:46: UserWarning: Schema version http://json-schema.org/draft-07/schema not recognized. Some keywords and features may not be supported.
  warnings.warn(
>>> from biocommons.seqrepo import SeqRepo
>>> sr = SeqRepo(root_dir=environ["SEQREPO_ROOT_DIR"])
>>> seqrepo_access = SeqRepoAccess(sr)
>>> seqrepo_access.get_reference_sequence("NP_004324.2", 600, 600)
('V', None)
SeqRepo Issues

If you have trouble using the default path, try creating a symlink, by running the following:

seqrepo update-latest

Or set SEQREPO_ROOT_DIR in the .env file to the versioned SeqRepo path, i.e. SEQREPO_ROOT_DIR=/usr/local/share/seqrepo/latest/2021-01-29.

Verify that this works in SeqRepo Verification.

Gene Normalizer DynamoDB

VICC Gene Normalizer's is used to normalize genes and get gene concept data. You must set up the Gene Normalizer's database. The DynamoDB instance was used in this analysis. The PostgreSQL instance is not supported for running notebooks. We provide a quick way to get the DynamoDB instance running in Using Gene Normalizer DynamoDB in s3.

AWS Environment Variables for DynamoDB

If you do not have an AWS account, you can keep AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN as is. Local DynamoDB instances will allow dummy credentials.

Using Gene Normalizer DynamoDB in s3

To immediately connect to the DynamoDB instance used in this analysis, download the instance and extract. You will then download the local archive (Download DynamoDB local v1.x was used), extract the contents, and move the shared-local-instance.db inside the dynamodb_local_latest directory (the relative path should be dynamodb_local_latest/shared-local-instance.db). Follow the documentation on how to start the database (you can skip steps 4 and 5).

When starting the DynamoDB database using the default configs, you should see the following:

Initializing DynamoDB Local with the following configuration:
Port:   8000
InMemory:       false
DbPath: null
SharedDb:       true
shouldDelayTransientStatuses:   false
CorsParams:     *

ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...

If your output looks a little different, you can verify the installation here.

Keep the database connected when running the notebooks.

Gene Normalizer ETL files

The source files used during ETL methods have been uploaded to the public s3 bucket:

DynamoDB Verification

To verify, run the following inside your virtual environment:

$ python3
Python 3.11.5 (main, Aug 24 2023, 15:18:16) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from dotenv import load_dotenv
>>> load_dotenv()
True
>>> from gene.query import QueryHandler
.venv/lib/python3.11/site-packages/python_jsonschema_objects/__init__.py:46: UserWarning: Schema version http://json-schema.org/draft-07/schema not recognized. Some keywords and features may not be supported.
  warnings.warn(
>>> from gene.database import create_db
>>> q = QueryHandler(create_db())
***Using Gene Database Endpoint: http://localhost:8000***
>>> result = q.normalize("BRAF")
>>> result.gene_descriptor.gene_id
'hgnc:1097'

Biocommons UTA Database

Biocommons UTA is used to get transcript alignment data. You must set up the UTA database for Cool-Seq-Tool. This analysis used the uta_20210129 version.

More information for a local UTA database installation can be found here. A local installation was used when running this analysis.

You can also install with Docker (faster set up) as described here. The uta_20210129b image should be used. This option was not used when running this analysis.

Once set up, you must update the UTA_DB_URL environment variable in the .env file with your credentials. If following the Local Installation README, your UTA_DB_URL would be set to postgresql://uta_admin@localhost:5432/uta/uta_20210129.

Note: Cool-Seq-Tool creates a new genomic table. To create, follow all of the steps in UTA Verification section.

UTA Verification

To verify, run the following inside your virtual environment:

python3 -m asyncio
asyncio REPL 3.11.5 (main, Aug 24 2023, 15:18:16) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Use "await" directly instead of "asyncio.run()".
Type "help", "copyright", "credits" or "license" for more information.
>>> import asyncio
>>> from dotenv import load_dotenv
>>> load_dotenv()
True
>>> from cool_seq_tool.data_sources import UTADatabase
.venv/lib/python3.11/site-packages/python_jsonschema_objects/__init__.py:46: UserWarning: Schema version http://json-schema.org/draft-07/schema not recognized. Some keywords and features may not be supported.
  warnings.warn(
>>> uta_db = await UTADatabase.create()
>>> await uta_db.get_ac_from_gene("BRAF")
['NC_000007.14', 'NC_000007.13']

Running Notebooks

This section provides information about the notebooks and the order that they should be run in.

  1. Run the following notebook:
  2. Run the following notebooks (order does not matter):
  3. Run the following notebooks (order does not matter):
  4. Run the following notebook:
  5. Run the following notebook:

Running Notebooks in Visual Studio Code (VS Code)

VS Code is a lightweight source code editor for Windows, Linux, and macOS.

  1. Download VS Code here
  2. Open a notebook and click Select Kernel at the top right. Select the option where the path is venv/3.11/bin/python. See here for more information on managing Jupyter Kernels in VS Code.
  3. Run the notebooks

Analysis with macOS Environments

These notebooks were run using these macOS specs:

Model Year CPU Architecture Total RAM Hard drive capacity
2019 2.6 GHz 6-Core Intel Core i7 32 GB 1 TB
2021 M1 Pro 32 GB 1 TB

Help

If you have any questions or problems, please make an issue in the repo and our team will be happy to assist.