Skip to content

A Biostatistical tool for Transcriptomics Analysis

Notifications You must be signed in to change notification settings

cbio-astar-tools/GeneCloudOmics

 
 

Repository files navigation

GeneCloudOmics

GeneCloudOmics is a web server for transcriptome data analysis and visualization. It supports the analysis of microarray and RNASeq data and performs ten different bio-statistical analyses that cover the common analytics for gene expression data. Furthermore, it gives the users access to several bioinformatics tools to perform 12 different bioinformatics analyses on gene/protein datasets. GeneCloudOmics is designed as a one-stop server that helps the users perform all tasks through an intuitive graphical user interface (GUI) that waves the hassle of coding, installing tools, packages or libraries and dealing with operating systems compatibility and versioning issues, some of the complications that make data analysis tasks more challenging for biologists. GeneCloudOmics is an open-source tool and the website is free and open to all users and there is no login requirement.

Transcriptomic analysis

Import data and pre-processing

GeneCloudOmics supports two types of common data formats for gene expression analysis: RNA-Seq count matrix and Microarray CEL files.

RNA-Seq gene expression matrix format

Import data

GeneCloudOmics requires all input files in comma-separated value (.csv) format. The data file in .csv should contain the gene names in rows and genotypes (e.g. wild type – mutants or control - treatments, …) in columns, which is similar to the standard transcriptome data file format in GEO database. Supporting files (if applicable) include gene length, list of negative control genes, and metadata file. A number of normalization options are provided in the preprocessing tab depending on the availability of supporting files: RPKM, FPKM, TPM(requiring gene length), RUV (requiring negative control genes), and Upper Quartile (no supporting file needed). The metadata file is required for differential expression analysis, and should specify experimental conditions (e.g. Control/Treated, time 1/time 2/ time3,.) for each genotype listed in the data file. Otherwise, the user can move to the next option to perform analyses (e.g. scatter plot, distribution fitting, Pearson correlation, …) once the data file is loaded (whether normalized or in raw count). Once data files and supporting files are loaded into ABioTrans, user can press Submit button and ABioTrans will automatically proceed to the next tab Preprocessing. In case user accidentally upload wrong data file or supporting files, user can overwrite each of them by uploading new files, or press the browser Reset button (shortcut: F5 key on windows) to erase all uploaded files and reset the whole program.

  • If you input raw data (read counts), please make sure that the first column contains gene names, and the read counts of each genotype (conditions: wildtype, mutants, replicates, etc.) are in the following columns. Each genotype column should have a column name.

  • Along with raw read counts, you can provide gene length (base pair) information in two-column .csv file, with the first column specifying gene names, which must match the gene names in raw data file, and the second column specifying gene length in base pair. Gene length file is required for normalization methods for sequencing depth and gene length: RPKM, FPKM, TPM

  • Alternatively, if you input a normalized data file, it should have gene names in rows and genotypes in columns, following the usual format of files deposited in the GEO database.

    Figure 1a: Required format for raw counts data file

    Figure 1b: Gene length file format

List of negative control genes (spike-in or stably expressed genes accross all samples), if available, should be contained in one-column .csv file. Negative control genes are required for Remove Unwated Variation (RUV) normalziation.

Figure 1c:Negative control gene file format

Finally, a metadata table matching column names of data file to experimental conditions should be given in two-column .csv format. Metadata table is required for differential expression analysis

Figure 1d: Metadata file format

Pre-processing

Preprocessing involves two steps: removing lowly expressed genes and normalizing the remaining gene expression. First, the user need to specify threshold expression values (which must be in same unit to the input data file - either raw read counts or normalized expression), and the minimum number of data columns that must exceed the threshold value. Normalization methods are available upon the availability of supporting data files: normalization for sequencing depth, including TPM and RPKM, requires gene length and normalization for sample variation, including RUV, requires negative control genes. User can download the filtered, normalized data in the Data tab. Relative Log Expression (RLE) plots of raw and processed data are displayed to visualize the effects of normalization. Distribution of gene expression in each data column is visualized by violin plot.

![Figure 2: Preprocessing panel with RLE plots of raw data (upper figure) and filtered, RUV-normalized data (lower figure). Gene expression with minimum five counts in at least two columns are retained.](1b_preprocess(2).PNG

Microarray Preprocessing (new feature)

Microarrays can be used in many types of experiments including genotyping, epigenetics, translation profiling and gene expression profiling

Raw Count data & Metadata can be extracted from the Microarray which can further be used to perform all the different analysis.
The user needs to zip all the CEL files along with the SDRF(Sample and Data Relationship Format) file ( e.g. E-MTAB-2967.sdrf.txt) & upload it.

Once the processing is complete, the user can download the output in csv format. alt text

Scatter plot

Scatter plot compares any 2 samples (or 2 replicates) by displaying the respective expression of all genes in 2D space. It is recommended to preform normalization for sequencing depth (TPM, RPKM, FPKM) for this step (and so does distribution fitting, correlation, hierarchical clustering, noise and entropy). As gene expression data is naturally skewed towards very high expression level region, we recommend applying log-transformation to capture the whole data range. User can choose between log base 2, natural log, or log base 10. An option to add linear regression line is also available. Gene expression data is densely distributed in the lowly expressed region, making the dots usually indistinguishable in regular scatter plot. GeneCloudOmics overlay a 2D kernel density estimation on the scatter plot to visualize the density of expression level. The user can choose to download each single scatter plot, or to download all pairs of samples scatter plot in one PDF file, which may take some time to run.

Figure 3: Scatter plot

Distribution fitting

Distribution fitting compares the gene expression to a number of statistical continuous distributions, which can be used to validate the data. To visualize the comparison, GeneCloudOmics displays the Cumulative Distribution Function of the preprocessed gene expression data with the user-selected theoretical distributions. Once it is confirmed that the gene set follow a distribution, it would be safe conclude the validity of the gene expression data. AIC table is also provided in AIC table tab to show the best fitted distribution in each sample

Figure 3: Distribution fitting - Cumulative Distribution Function

Pearson and Spearman correlations

The Pearson correlation evaluates the linear relationship between two continuous variables. A relationship is linear when a change in one variable is associated with a proportional change in the other variable.

The Spearman correlation evaluates the monotonic relationship between two continuous or ordinal variables. In a monotonic relationship, the variables tend to change together, but not necessarily at a constant rate. The Spearman correlation coefficient is based on the ranked values for each variable rather than the raw data.

User shall choose a correlation method and view it in either correlation heatmap, correlation plot in circle or correlation matrix.

Figure 4: Correlation heatmap

PCA and K-means clustering

Principal Components Analysis (PCA) is an multivariate statistical technique for simplifying high-dimensional data sets (Basilevsky 1994). Given m observations on n variables, the goal of PCA is to reduce the dimensionality of the data matrix by finding r new variables, where r is less than n. Termed principal components, these r new variables together account for as much of the variance in the original n variables as possible while remaining mutually uncorrelated and orthogonal. Each principal component is a linear combination of the original variables, and so it is often possible to ascribe meaning to what the components represent. A PCA analysis of transcriptomic data consider the genes as variables, creating a set of “principal gene components” that indicate the features of genes that best explain the experimental responses they produce.

Figure 5: Principal Component plot

Differential Expression Analysis

DE analysis identifies the genes that are statistically different in expression levels between the 2 selected conditions. Two important threshold are:

  • The lower bound of expression fold change between the 2 selected conditions

  • The upper bound of hypothesis test p-value

GeneCloudOmics implements 3 popular methods to identify DE genes:

For data with single replicate in all experiment condition, NOISeq method can simulate technical replicates to carry out DE test. Metadata file is required for DE Analysis. Please make sure metadata contains all column names from input data file and match them with experimental condition

For edgeR and DESeq2, raw read counts data file must be provided. For NOISeq, gene expression should be normalized for sequencing depths (by select normalization method in Preprocessing tab if raw counts file is inputted, or by directly providing normalized gene expression)

To carry out the analysis, first the user needs to specify DE methods, two conditions to compare (condition 2 is compared against condition 1), fold change cut-off value and False Discovery Rate (FDR or adjusted p-value) threshold, then hit the “Submit” button. By convention, DE genes are filtered at FDR = 0.05 and 2-fold change.

When the computation finishes, table of DE genes, volcano plot of DE result and dispersion plot of input data are displayed in their respective tabs. Please note that volcano plot and dispersion plot are only available for edgeR and DESeq2 methods.

Figure 6: Volcano plot summarizing differential expression analysis result

Heatmap of gene expression and Hierarchical clustering

Hierarchical clustering is used to find the groups of co-expressed genes. The clustering is performed on normalized expressions of differentially expressed genes using Ward clustering method.

Heat map of gene expression and hierarchical clustering can be carried out on DE genes resulted from DE analysis tab by selecting DE result. Otherwise, you can specify the minimum fold change and minimum number of samples (columns in input data file) satisfying the fold change. Finally, you need to specify the number of clusters on rows (genes), then hit Submit.

The gene names in each cluster are displayed in the Gene clusters panel, corresponding to the heatmap you just generated.

Figure 7: Heatmap of gene expression and hierarchical clustering for co-expressed genes

Transcriptome-wide average noise

Transcriptome-wide average noise is used to quantify between gene expressions scatter of all replicates in one experimental condition. Formula is adapted from Kumar et. al 's paper

  • User to select a desired noise plot according to your data. By default the name of the first replicate in each genotype is used as the name of the genotype. You can specify the names of the genotypes, only make sure that all names are different (due to the mapping mechanism of Plotly). Please be reminded that the consecutive columns will be regarded as of the same genotype.
  • Noise here refers to the average transcriptome noise — the squared coefficient of variation — defined as the variance (σ2) of expression divided by the square mean expression (μ2).
  • If replicates is selected, the noise within one genotype will be computed.
  • If genotypes (average of replicates) is selected, the gene expression data within one genotype will first be taken average of, and the average value will be used as the expression value of that certain genotype. Then noise will be computed between the anchor genotype and all the other genotypes.
  • If genotypes (no replicate) is selected, each sample column is treated as a different genotype. Noise will directly be computed between the anchor genotype and the rest of the genotypes.

Entropy

Shannon entropy (Shannon, 1948) measures the disorder of a high-dimensional system, where higher values indicate increasing disorder. ABioTransPlus adapts the Shannon entropy formula to quantify the variability of one sample of gene expression.

  • First specify whether your data is a time series data.
  • If the data is not in time series, entropy values of all samples will directly be displayed.
  • Otherwise, if it is a time series data, specify the number of time points. For example, if the number of time points is 6, then sample 1 to 6 will be regarded as genotype A time 1 - time 6, sample 7 to 12 will be genotype B time 1- time 6, and so on and so forth. For a line chart, the entropy of each genotype will be shown in one line.

Figure 8: Entropy

t-distributed stochastic neighbor embedding (t-SNE) (new feature)

t-SNE is a dimensionality-reduction approach that reduces the complexity of highly complex data such as the transcriptomic data. It visualizes the sample interrelations in a 2- or 3-dimensional visualization. This allows the identification of the close similarities between samples through the relative location of mapped points. Since t-SNE is nonlinear and able to control the trade-off between local and global relationships among points, its visualization of the clusters is usually more compelling when compared with the other methods [Cieslak 2020].

GeneCloudOmics introduces an intuitive interface that allows performing t-SNE analysis on the processed untransformed transcriptomic data through entering three inputs perplexity value, the number of principal components (PC) and the number of clusters. The user can also choose to log transform the data before submission.

Clustering with random forest (new feature)

Random forest clustering is an unsupervised learning approach, where each sample is clustered into different classes, based on their similarity (usually based on Euclidean distance).

GeneCloudOmics uses the random forest algorithm to generate a proximity matrix, a rough estimate of the distance between samples based on the proportion of times the samples end up in the same leaf node of the decision tree. The proximity matrix is converted to a dist matrix which is then inputted to the hierarchical clustering algorithm [Chen 2012].

The diff_genes.csv file will be generated after each run, containing names of differential genes. alt text

Self-Organizing Map (SOM) (new feature)

SOM is a dimensionality reduction technique that produces a two-dimensional, discretized representation of the high-dimensional gene expression matrix. It uses a neighbourhood function to preserve the topological properties of the input gene expression matrix. Each data point (e.g. 1 sample) in the input gene expression matrix recognizes itself by competing for representation. SOM mapping steps start from initializing the weight vectors. From there, a sample vector is selected randomly and the map of weight vectors is searched to find which weight best represents that sample. Each weight vector has neighbouring weights that are close to it. The weight that is chosen is rewarded by being able to become more like that randomly selected sample vector. The neighbours of that weight are also rewarded by being able to become more like the chosen sample vector. This allows the map to grow and form different shapes. Most generally, they form square/rectangular/hexagonal/L shapes in 2D feature space [Yin 2020].

GeneCloudOmics provides 5 types of SOM plots:
Property: Uses values of codebook vectors (weight of gene vectors) and output as coloured nodes
Count: Shows how many genes are mapped to each node
Codes: Displays codebook vectors
Distance: Shows how close genes are from each other when they are mapped
Cluster: Uses hierarchical clustering to cluster the SOM

There are 4 parameters to control the SOM plots. Samples used determines the sample chosen for the plots, either all samples or individual ones. No. of horizontal grids and No. of vertical grids changes the number of nodes used in the SOM. No. of clusters classifies the SOM nodes into the specified cluster size (for cluster plot). alt text

Gene/Protein Set Analysis

Differentially expressed gene (DGE) analysis usually outputs a list of genes that are statistically determined as differentially expressed. Then, the list of DEGs is analyzed, interpreting and annotated to learn more about the functions, pathways and cellular processes that these genes are involved in. Most of the gene differential expression analysis tools do not include bioinformatics features for gene set analysis or include few basic analyses such as GO and pathways enrichment.

In GeneCloudOmics, we designed the GO feature to be dynamic by reading the GO terms associated with the genes/proteins directly from UniProt Knowledgebase [UniProt Consortium 2019] then visualize each of the three GO domains (cellular component, molecular function and biological process) in an independent tab in a bar chart and on a downloadable tabular format (Figure YY). Furthermore, we introduced 11 new bioinformatics analyses that can be performed on a given gene/protein dataset.

Pathways Enrichment Analysis

For a given gene or protein set, GeneCloudOmics uses g:Profiler [Raudvere 2019] to perform a pathway enrichment analysis and displays the results as a network where the nodes are the pathways and the edges are the overlap between the pathways (Figure X). We use Cytoscape.JS for the network visualization [Franz 2016] and, therefore, the network properties such as the colour and layout can be changed from the left panel.

The users can choose from nine different network layouts and five different network colour schemes. The overlap between the pathways can be also changed from the provided controllers so that the user can choose an overlap cutoff or overlap range. The enrichment results can also be downloaded as a CSV file.

Protein-Protein Interaction

Investigating PPIs is one of the essential steps in systems biology studies. PPI databases are growing in size with improved accuracy of PPI curations. Furthermore, several recent large-scale projects provide experimentally-confirmed interaction such as the reference map of the human binary protein interactome (HuRI) project [Luck 2020]. The PPI feature in ABioTrans Plus provides the users with an interface where they can upload a set of proteins (UniProt accessions) and get all the interactions associated with them. The interactions are visualized as a network where the nodes are the proteins and the edges are the interactions and the node size corresponds to the number of interactors of the protein. We use Cytoscape.JS for PPI visualization [Franz 2016]. We provide users with five network styles and nine network layouts to customize their results. The results are also displayed as an interactions table and can be downloaded as a network or an interaction table.

Complex Enrichment

The identification of the subunits of the protein complexes is important to understand the functions and the formation of these macromolecular machines. We provide the user with a complex enrichment feature that allows the identification of proteins in the provided dataset that are part of a known protein complex. This feature uses CORUM databases, which contain curated complex information for mammalian proteins [Giurgiu 2019]. This feature provides the user with complex-forming proteins and the complex information in the submitted dataset.

Protein Function

UniProt Knowledgebase provides a detailed function for thousands of proteins. The protein function feature retrieves the protein function information from UniProt of a given protein set (UniProt accessions). The retrieved protein functions are displayed in a downloadable tabular format.

Protein Subcellular Localization

The protein localization critically affects the protein function. The change of the protein localization is a cellular approach to regulate biological processes. UniProt Knowledgebase provides curated subcellular localization information for the proteins. The protein subcellular localization feature provides the user with an interface to get the subcellular localization information for a given list of proteins (UniProt accessions) and display the results in a downloadable tabular format.

Protein Domains

The protein domains are the functional subunits of the proteins that contribute to their overall function. Each domain is responsible for a certain function or interaction. ABioTrans Plus provides the users with a protein domain feature that connects to UniProt Knowledgebase and retrieves the domain information associated with each protein in a given list of UniProt accession.

Tissue Expression

The distinct expression profile of genes and proteins per tissue is what gives the different tissues the suitability for their functions. The tissue expression feature in ABioTrans Plus provides the user with the tissue expression for each protein in a given protein list (UniProt accessions) through retrieving this information from UniProt Knowledgebase. The result is displayed in a downloadable tabular format. 8) Gene Co-expression: The co-expression analysis is a common analysis that assesses the expression level of different genes to identify simultaneously expressed genes. The resultant co-expression networks are used to identify functionally related genes or genes being controlled by the same transcriptional mechanism [Vella 2017]. ABioTrans Plus provides the users with an interface where they can submit a co-expression query to GeneMANIA [Franz 2018] then shows the results at GeneMANIA’s website in a new tab. Currently, we support queries for nine model organisms including human, yeast, E. coli, C. elegans, Arabidopsis, Drosophila, zebrafish, mouse and rat.9) Protein Physicochemical Properties: For a given set of proteins (UniProt accessions), this feature provides the user with the complete sequences of them in a single FASTA file and allows the user to investigate their physicochemical properties. The physicochemical analysis includes sequence charge, GRAVY index [Kyte 1982] and hydrophobicity (Figure YA).

Protein Evolutionary Analysis

For a given set of proteins (UniProt accessions), this feature provides the user with a phylogenetic and evolutionary analysis that include multiple sequence alignment (MSA) of the protein sequences, clustering based on the amino acid sequences, chromosomal location or gene tree (Figure YB).

Protein Pathological Analysis

Several diseases are associated with the malfunction of certain genes or proteins. The disease-protein association is collected in different online resources such as OMIM databases [Amberger 2019], DisProt [Hatos 2020] and DisGeNET [Pinero 2019]. ABioTrans Plus provides the users with an interface that retrieves the disease-protein association from online databases for a given list of proteins (UniProt accessions). The disease-protein association is visualized as bubble chats that shows the distribution of the proteins among the disease (each bubble is protein and the bubble size is the number of associated disease) or the distribution of proteins among the diseases (each bubble is a disease and the bubble size is the number of associated proteins) (Figure YC).

The features that communicate with UniProt Knowledgebase use UniProtR, an R package for data retrieval and visualization from UniProt [Soudy 2020]. Since all the bioinformatics features only accept gene names (gene symbol) or UniProt accessions, we provide the users on each page with links to two ID converters UniProt ID mapping [UniProt Consortium 2019] and g:Convert [Raudvere 2019] to convert their identifiers to gene names or UniProt accessions.

Download instructions

  1. Scatter plot, distribution fit, correlation plot and heatmap can be saved as PDF by clicking Download as PDF. You can name the file before saving it. Also, you can directly drag the plot from the GUI to a folder on the computer.

  2. PCA, noise, entropy and gene ontology pie chart are supported by Plotly. Please put your mouse over the graph region and click the left most camera sign to save the plot. It will be saved in the download folder on your computer. You are suggested to use Chrome because it saves you trouble from creating an account and changing the code. More importantly, you can download the PCA-3D plot from any angle.

Docker Support

Now we also have a docker image containing all the dependencies needed to run ABioTrans Plus. It also support shiny-server opensource which can be exposed in port 3838.

It can be pulled using docker pull jaktab/genecloudomics-webserver:latest

About

A Biostatistical tool for Transcriptomics Analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 86.2%
  • Python 12.6%
  • JavaScript 1.2%