Skip to content

The Cancer Genome Atlas (TCGA), a cancer genomics reference program, has molecularly characterized more than 20,000 primary cancer samples and paired normal samples covering 33 types of cancer. This joint effort between the NCI and the National Human Genome Research Institute began in 2006.

Notifications You must be signed in to change notification settings

Alberto-00/Detection-signature-cancer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Leveraging gene expression and genomic varation for cancer prediction using one-shot learning

The Cancer Genome Atlas (TCGA), a cancer genomics reference program, has molecularly characterized more than 20,000 primary cancer samples and paired normal samples covering 33 types of cancer. This joint effort between the NCI and the National Human Genome Research Institute began in 2006. In the twelve years since, TCGA has generated more than 2.5 petabytes of genomic, epigenomic, transcriptomic, and proteomics. These data have led to improvements in the ability to diagnose, treat and prevent cancer by helping to establish the importance of cancer genomics.

Contribution of this work

During the experimental process, the size of the dataset used was significantly increased in order to improve the diversity and representativeness of the data. This adjustment allowed the model to learn from a wider variety of examples, improving its generalization. In addition, adjustments were made to both models involved in this study, both the classification model and the Siamese-type model. A key element of the optimization was the implementation of the use of custom weights. This strategy allowed different weights to be assigned to different instances of the dataset based on the amount of samples present. Finally, a specification was introduced regarding the types of mutations, this allowed for greater precision in the analysis of genetic information. Numerous studies aimed at identifying a distinctive genomic signature for different types of cancer are being conducted in the current research landscape.

Requirements (tested)

Module Version
tensorflow 2.15.0
torch 2.1.2
cuda 12.2

To install tensorflow follow this guide: link
To install and set up cuda and cudnn follow this guide:

Related work

For more information about our research project access the paper here: our paper
To view the other papers that have contributed to the cancer research study and on which we have commented follow this link: other papers

Technical informations - main.py

In this section we introduce technical informations and installing guides!

Download Dataset

  • Download from Google Drive all the files in the folder Dataset: LINK;
  • Files should be downloaded within a folder with the name dataset;
  • Copy the dataset folder and paste it inside the project in this way: /Detection-signature-cancer/code/dataset

Config Path

In this script there are some path that we are going to describe now:

Dataset

  1. dataset_path: the dataset that we want to use (SNP_DEL_INS_CNA_mutations_and_variants has two);
  2. encoded_path: the encoded of the dataset;

Classification

  1. model_path: where the model will be saved or uploaded;
  2. risultati_classification: results of the classification;

Siamese

  1. siamese_path: where the model of the siamese network will be saved or uploaded;
  2. risultati_siamese: results of the siamese network;

If you want to change the dataset to use either 0030 or 0005 (read the paper for the meanings) you only need to edit the string containing 0030 or 0005 and replace it with one of the two.

For example:

dataset_path = ("dataset/data_mrna/SNP_DEL_INS_CNA_mutations_and_variants/"
                    "data_mrna_v2_seq_rsem_trasposto_normalizzato_deviazione_0030_dataPatient_mutations_and_variants.csv")      

Becomes

dataset_path = ("dataset/data_mrna/SNP_DEL_INS_CNA_mutations_and_variants/"
                    "data_mrna_v2_seq_rsem_trasposto_normalizzato_deviazione_0005_dataPatient_mutations_and_variants.csv")      

Or

model_path = "models/0030/classification/espressione_genomica_con_varianti_2LAYER/"

Becomes

model_path = "models/0005/classification/espressione_genomica_con_varianti_2LAYER/"

Boolean Variables

Always in the main.py script you can set some variables:

  • only_variant = False: if you use the dataset that contains only variations in gene mutations set this on True;
  • data_encoded = False: allows to generate the encoded of the dataset (if this is the first time you run the code leave the default value)
    • False: encoded to be generated;
    • True: load an encoded;
  • classification = True: run the classification;
  • siamese_net = True: run the siamese network;
  • siamese_variants = True: if you use the dataset that contains the variations in gene mutations set this on True;

The Siamese Network can only be launched if it has a classification model already trained and saved. In the project the classification model has already been trained. If you want to use the models in this project and not start experimenting again set the parameters in this way (example for 0030 dataset):

only_variant = False
data_encoded = True
classification = False
siamese_net = True
siamese_variants = True

To run the project run the main.py script.

Author & Contacts

Name Description

Alberto Montefusco


Developer - Alberto-00

Email - [email protected]

LinkedIn - Alberto Montefusco

My WebSite - alberto-00.github.io


Alessandro Macaro


Developer - mtolkien

Email - [email protected]

LinkedIn - Alessandro Macaro


About

The Cancer Genome Atlas (TCGA), a cancer genomics reference program, has molecularly characterized more than 20,000 primary cancer samples and paired normal samples covering 33 types of cancer. This joint effort between the NCI and the National Human Genome Research Institute began in 2006.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages