The Cancer Genome Atlas (TCGA), a cancer genomics reference program, has molecularly characterized more than 20,000 primary cancer samples and paired normal samples covering 33 types of cancer. This joint effort between the NCI and the National Human Genome Research Institute began in 2006. In the twelve years since, TCGA has generated more than 2.5 petabytes of genomic, epigenomic, transcriptomic, and proteomics. These data have led to improvements in the ability to diagnose, treat and prevent cancer by helping to establish the importance of cancer genomics.
During the experimental process, the size of the dataset used was significantly increased in order to improve the diversity and representativeness of the data. This adjustment allowed the model to learn from a wider variety of examples, improving its generalization. In addition, adjustments were made to both models involved in this study, both the classification model and the Siamese-type model. A key element of the optimization was the implementation of the use of custom weights. This strategy allowed different weights to be assigned to different instances of the dataset based on the amount of samples present. Finally, a specification was introduced regarding the types of mutations, this allowed for greater precision in the analysis of genetic information. Numerous studies aimed at identifying a distinctive genomic signature for different types of cancer are being conducted in the current research landscape.
Module | Version |
---|---|
tensorflow | 2.15.0 |
torch | 2.1.2 |
cuda | 12.2 |
To install tensorflow follow this guide: link
To install and set up cuda and cudnn follow this guide:
For more information about our research project access the paper here: our paper
To view the other papers that have contributed to the cancer research study and on which we have commented follow this link: other papers
In this section we introduce technical informations and installing guides!
- Download from Google Drive all the files in the folder
Dataset
: LINK; - Files should be downloaded within a folder with the name
dataset
; - Copy the dataset folder and paste it inside the project in this way:
/Detection-signature-cancer/code/dataset
In this script there are some path that we are going to describe now:
Dataset
dataset_path
: the dataset that we want to use (SNP_DEL_INS_CNA_mutations_and_variants
has two);encoded_path
: the encoded of the dataset;
Classification
model_path
: where the model will be saved or uploaded;risultati_classification
: results of the classification;
Siamese
siamese_path
: where the model of the siamese network will be saved or uploaded;risultati_siamese
: results of the siamese network;
If you want to change the dataset to use either 0030
or 0005
(read the paper for the meanings) you only
need to edit the string containing 0030
or 0005
and replace it with one of the two.
For example:
dataset_path = ("dataset/data_mrna/SNP_DEL_INS_CNA_mutations_and_variants/"
"data_mrna_v2_seq_rsem_trasposto_normalizzato_deviazione_0030_dataPatient_mutations_and_variants.csv")
Becomes
dataset_path = ("dataset/data_mrna/SNP_DEL_INS_CNA_mutations_and_variants/"
"data_mrna_v2_seq_rsem_trasposto_normalizzato_deviazione_0005_dataPatient_mutations_and_variants.csv")
Or
model_path = "models/0030/classification/espressione_genomica_con_varianti_2LAYER/"
Becomes
model_path = "models/0005/classification/espressione_genomica_con_varianti_2LAYER/"
Always in the main.py
script you can set some variables:
only_variant = False
: if you use the dataset that contains only variations in gene mutations set this onTrue
;data_encoded = False
: allows to generate the encoded of the dataset (if this is the first time you run the code leave the default value)False
: encoded to be generated;True
: load an encoded;
classification = True
: run the classification;siamese_net = True
: run the siamese network;siamese_variants = True
: if you use the dataset that contains the variations in gene mutations set this onTrue
;
The Siamese Network can only be launched if it has a classification model already trained and saved. In the project the classification model has already been trained.
If you want to use the models in this project and not start experimenting again set the parameters in this way (example for 0030
dataset):
only_variant = False
data_encoded = True
classification = False
siamese_net = True
siamese_variants = True
To run the project run the main.py
script.
Name | Description |
---|---|
Alberto Montefusco |
Developer - Alberto-00 Email - [email protected] LinkedIn - Alberto Montefusco My WebSite - alberto-00.github.io |
Alessandro Macaro |
Developer - mtolkien Email - [email protected] LinkedIn - Alessandro Macaro |