-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Welcome to the GenNet-multi-omic wiki!
Since there is not a one-size fits all solution for most multi-omics data, one should preprocess the him/herself. For this multi-omic network one needs to provide two inputs: methylation and gene-expression data. However, all types of inputs can be used as long as you can group your inputs per gene.
The information to group your omic (e.g., how to group your SNPs, CpGs, genes and/or pathways is also unique). This framework allows you to shape your network the way you want it. You can shape you network by providing sparse matrices with zeros and ones, where each input is a row and each output is a column. Simply add a one if a input (row) should be connected to an ouput neuron (column).
The custom connections are defined by the mask input, a sparse (COO) connectivity matrix.
The matrix has the shape of (N_nodes_layer_1, N_nodes_layer_2). It is a sparse matrix with zeros for no connections and ones if there is a connections. For example.
output
1 2 3 4 5
input 1 | 1 0 0 0 0 |
input 2 | 1 1 0 0 0 |
input 3 | 0 1 0 0 0 |
input 4 | 0 1 0 0 0 |
input 5 | 0 0 1 0 0 |
input 6 | 0 0 0 1 0 |
input 7 | 0 0 0 1 0 |
This connects the first two inputs (1,2) to the first neuron in the second layer. Connects input 2,3 and 4 to output neuron 2. Connects input 5 to output neuron 3 Connects input 6 and 7 o the 4th neuron in the subsequent layer Connects nothing to the 5th neuron
Select a appropriate model from the classification and regression models in the models folder. Replace the link to the masks (a COO matrix described above). Then proceed to running the model.
Note
For a cohort-wise cross-validation, normalize per cohort. Take care that you standardize/normalize the training data and not the training and test data together!
Prepare the following files:
- Methylation.h5 : Each colum is an input feature and each row is an individual. This should be preprocessed data.
- GeneExpression.h5 : Each colum is an input feature and each row is an individual. This should be preprocessed data.
- ytrain_<phenotype>_<fold>.csv : comma seperated file with patient information
- yval_<phenotype>_<fold>.csv : comma seperated file with patient information
- ytest_<phenotype>_<fold>.csv : comma seperated file with patient information
For the ytrain, yval and ytest, prepare a split in the data for training, validation and test for each fold. Fill in with your phenotype name. These .csv files should contain a column labels for the ground truth annotations and a column row with the indices of the individual in the .h5 files. Other columns should contain covariates.
For classification tasks use the GenNet_ME_GE_classification.py, this includes standard networks and the workflow for a classification task.
usage: GenNet_ME_GE_classification.py [-h] -j J [-lr LR] [-bs BS] [-l1 L1] [-mt MT] [-pn PN] [-fold FOLD] [-omic_l1 OMIC_L1] [-datapath DATAPATH]
usage: GenNet_ME_GE_classification.py [-h] -j J [-lr LR] [-bs BS] [-l1 L1] [-mt MT] [-pn PN] [-fold FOLD] [-omic_l1 OMIC_L1] [-datapath DATAPATH]
optional arguments:
-h, --help show this help message and exit
-j J jobid: identifier for the experiment (experiment number, must be int)
-lr LR learning rate : float
-bs BS batch size: integer
-l1 L1 L1 penalty, must be a float
-mt MT Network name, select a name from models
-pn PN phenotype name
-fold FOLD fold number, must be integer
-omic_l1 OMIC_L1 Omic-specific L1 penalty
-datapath DATAPATH Path to processed data
GenNet_ME_GE_Regression.py contains the workflow for regression tasks. Using GenNet_ME_GE_Regression.py --help will provide more information about the inputs
usage: GenNet_ME_GE_Regression.py [-h] [-j J] [-lr LR] [-bs BS] [-l1 L1] [-mt MT] [-pn PN] [-fold FOLD] [-omic_l1 OMIC_L1] [-datapath DATAPATH]
Argument parser for the experiment
optional arguments:
-h, --help show this help message and exit
-j J jobid: identifier for the experiment (experiment number, must be int)
-lr LR learning rate : float
-bs BS batch size: integer
-l1 L1 L1 penalty, must be a float
-mt MT Network name select a name from models
-pn PN phenotype name
-fold FOLD fold number, must be integer
-omic_l1 OMIC_L1 Omic-specific L1 penalty
-datapath DATAPATH Path to processed data
If you have methylation and gene expression inputs as in our study you can select a model from ClassicationModels or RegressionModels. To create an interpretable neural networks with your multi-omics data you can modify the given neural networks for your needs. Underneath you will find an annotated network to help you get started:
def GenNet_classification_model_two_omics(inputsize_ME, inputsize_GE, l1_value):
"""
Generates a classification model that integrates two different omics data: methylation (ME) and gene expression (GE).
Parameters:
inputsize_ME (int): The input size for methylation data.
inputsize_GE (int): The input size for gene expression data.
l1_value (float): The L1 penalty value for the regularizer on the weights to make the model more interpretable.
Note:
- Add an input_size_omic_3 if you have three different omics.
- This network has a methylation input (ME) and a gene expression input (GE).
"""
# Load the mask that describes how methylations should be grouped into genes
mask_meth = scipy.sparse.load_npz(datapath + '/ME_gene.npz')
# Load the mask combining the methylation genes and the gene expression genes
combine_mask = scipy.sparse.load_npz(datapath + '/ME_GE_gene.npz')
# Define the inputs for GE
input_GE = K.Input(inputsize_GE)
# Define the inputs for ME
input_ME = K.Input(inputsize_ME)
# The layer that groups methylation into genes
# Each input needs to be (size, 1) for the LocallyDirectedConnected layer
gene_layer_ME = K.layers.Reshape(input_shape=(inputsize_ME,), target_shape=(inputsize_ME, 1))(input_ME)
# Normalization layer for methylation input
gene_layer_ME = K.layers.BatchNormalization(center=False, scale=False, name="inter_out_gene_layer_me")(gene_layer_ME)
# The sparse layer that connects input CpGs to genes (methylation only representation)
gene_layer_ME = LocallyDirectedConnected.LocallyDirected1D(mask=mask_meth, filters=1,input_shape=(inputsize_ME, 1), name="gene_layer_me")(gene_layer_ME)
# Activation function for methylation input
gene_layer_ME = K.layers.Activation("tanh", name="activation_ME")(gene_layer_ME)
# Reshape the gene expression input
input_GE_reshape = K.layers.Reshape(input_shape=(inputsize_GE,), target_shape=(inputsize_GE, 1))(input_GE)
# Combine methylation and gene expression inputs
combined = K.layers.concatenate([gene_layer_ME, input_GE_reshape], axis=1)
# Normalization layer for combined inputs
combined = K.layers.BatchNormalization(center=False, scale=False, name="inter_out_combined")(combined)
# The sparse layer that connects combined inputs to genes
gene_layer = LocallyDirectedConnected.LocallyDirected1D(mask=combine_mask, filters=1, name="gene_layer")(combined)
# Flatten the layer output
gene_layer = K.layers.Flatten()(gene_layer)
# Activation function for combined inputs
gene_layer = K.layers.Activation("tanh", name="activation_ME_GE")(gene_layer)
# Final normalization layer
gene_layer = K.layers.BatchNormalization(center=False, scale=False, name="inter_out_gene")(gene_layer)
# Output node with L1 regularization
end_node = K.layers.Dense(units=1, name="end_node", kernel_regularizer=tf.keras.regularizers.l1(l1_value))(gene_layer)
# Sigmoid activation function for output node
end_node = K.layers.Activation("sigmoid", name="activation_end")(end_node)
# Flatten the final output
end_node = K.layers.Flatten()(end_node)
# Define the model with ME and GE inputs
model = K.Model(inputs=[input_GE, input_ME], outputs=end_node)
return model