04.04-unsupervised_exploration.Rmd

# Unsupervised exploration

Unsupervised methods in multi-omic data analysis involve techniques for exploring the structure and patterns of the data without prior knowledge of the experimental design. These include cluster analysis and visualisation methods based on ordination. These procedures can reveal meaningful groupings among observations or allow for reducing the complexity of the data by reducing the number of dimensions. Researchers can then use the results from these exploratory techniques in further multi-omic data integration. It's crucial to properly pre-process the data and choose the appropriate association coefficient when computing these methods, as this has a significant impact on the final outcome.

* [Cluster analysis](#cluster-analysis)
* [Dimension reduction and ordination](#dimension-reduction-ordination)

## Cluster analysis {#cluster-analysis}

Clustering procedures group features or observations into homogeneous sets by minimising within-group and maximising among-group distances

### Hierarchical clustering {#hierarchical-clustering}

Hierarchical clustering produces a stratified organisation of features or observations where relatively similar objects are grouped together. The clustering can be performed using different criteria to measure the distance between clusters, which will affect the final outcome of the analysis (e.g., single linkage, complete linkage, average linkage and Ward’s minimum variance).

\small
```{r eval=FALSE}
# Load the dataset
data <- read.csv("mydata.csv")

# Perform hierarchical clustering
dist_matrix <- dist(data)  # calculate distance matrix
hc <- hclust(dist_matrix)  # perform hierarchical clustering

# Plot dendrogram of clustering
plot(hc, hang=-1)
```
\normalsize

A useful exploratory analysis to reveal general patterns in an omic layer can be obtained by simultaneous application of hierarchical clustering to the rows and columns of the data matrix, and visualising the results in a heatmap.

\small
```{r eval=FALSE}
# Load the dataset
data <- read.csv("mydata.csv", row.names=1)

# Perform hierarchical clustering of rows and columns
row_clusters <- hclust(dist(data))
col_clusters <- hclust(dist(t(data)))

# Plot heatmap with row and column dendrograms
library(gplots)
heatmap.2(as.matrix(data),
          Rowv=row_clusters,
          Colv=col_clusters,
          scale="row",
          dendrogram="both",
          key=TRUE,
          keysize=1.5,
          col=redgreen(75))
```
\normalsize

### Disjoint clustering {#disjoint-clustering}

Disjoint clustering techniques aim at separating the objects into individual, usually mutually exclusive, and in most cases, unconnected clusters. K-means clustering is one of the most typical algorithms where objects are assigned to k clusters using an iterative procedure that minimises the within-clusters sums of squares. Other available clustering methods include twinspan, self-organising maps, dbscan and Dirichlet multinomial mixtures (DMM). DMM were specifically developed to analyse MG data but can be equally useful for other sequencing-based omic datasets.

\small
```{r eval=FALSE}
# Load the dataset
data <- read.csv("mydata.csv")

# Perform K-means clustering
k <- 3  # number of clusters
km <- kmeans(data, k)

# View the cluster assignments
head(km$cluster)
```
\normalsize

\small
```{r eval=FALSE}
# Load the package
library(DirichletMultinomial)

# Load the dataset
data <- read.csv("mydata.csv")

# Fit Dirichlet multinomial mixture model
model <- DMM(data, K=3, alpha=1, beta=1)

# View the cluster assignments
head(model$Z)
```
\normalsize

## Dimension reduction and ordination {#dimension-reduction-ordination}

Ordination is a method complementary to data clustering, which enables displaying differences among samples graphically through reducing the dimensions of the original data set, so that similar objects are near and dissimilar objects are farther from each other.

### Principal Component Analysis (PCA) {#pca}

Principal component analysis (PCA) is one of the most widely applied methods for ordination. PCA generates new synthetic variables (principal components) that are linear combinations of the original variables and capture as much variance of the original data as possible. The principal components are orthogonal to each other and correspond to the successive dimensions of maximum variance of the scatter of points. The distance preserved among objects is euclidean and the relationships among variables are linear, thus PCA should generally be applied after appropriate transformations.

\small
```{r eval=FALSE}
# Load the dataset
data <- read.csv("mydata.csv")

# Perform PCA
pca <- prcomp(data, scale = TRUE)

# View the results
summary(pca)

# Plot the results
plot(pca, type = "l")
```
\normalsize

### Principal Coordinate Analysis (PCoA) {#pcoa}

Principal Coordinate Analysis (PCoA) is a multivariate analysis technique used to visualise and explore the patterns of variation in multivariate data. It is similar to Principal Component Analysis (PCA) but is specifically designed for distance-based data. PCoA transforms a distance matrix into a set of coordinates that can be plotted in two or three dimensions, allowing for visualisation of the relationships between samples based on their dissimilarity.

In multi-omics research, PCoA can be used to analyse and visualise the relationships between samples based on their similarity or dissimilarity in multiple omics data types, such as gene expression, metabolomics, or proteomics. By performing PCoA on these data types separately and then comparing the results, researchers can gain insight into how different omics layers contribute to the overall variation between samples. Additionally, PCoA can be used to identify groups or clusters of samples with similar omics profiles, which can provide insight into underlying biological processes or disease states. Overall, PCoA is a powerful tool for exploring and visualising the complex relationships between multiple omics data types in multi-omics research.

\small
```{r eval=FALSE}
# Load the distance matrix
dist_mat <- read.csv("mydistances.csv", row.names = 1)

# Perform PCoA
pcoa <- cmdscale(dist_mat, k = 2, eig = TRUE, add = TRUE)

# View the results
summary(pcoa)

# Plot the results
plot(pcoa$points, type = "n", xlab = "PCo1", ylab = "PCo2")
text(pcoa$points, labels = rownames(pcoa$points))
```
\normalsize

### Non-metric Multidimensional Scaling (NMDS) {#nmds}

Non-metric Multidimensional Scaling (NMDS) is a multivariate analysis technique used to visualize and explore the patterns of variation in multivariate data. It is similar to Principal Coordinate Analysis (PCoA) but is more flexible in that it can handle non-linear relationships between variables. NMDS transforms a distance matrix into a set of coordinates that can be plotted in two or three dimensions, allowing for visualisation of the relationships between samples based on their dissimilarity. Unlike PCoA, NMDS does not assume a linear relationship between the distance matrix and the coordinates, making it a more powerful tool for analysing complex and non-linear relationships in multivariate data.

In multi-omics research, NMDS can be used to analyse and visualise the relationships between samples based on their similarity or dissimilarity in multiple omics data types, such as gene expression, metabolomics, or proteomics. By performing NMDS on these data types separately and then comparing the results, researchers can gain insight into how different omics layers contribute to the overall variation between samples. Additionally, NMDS can be used to identify groups or clusters of samples with similar omics profiles, which can provide insight into underlying biological processes or disease states. Overall, NMDS is a powerful tool for exploring and visualising the complex relationships between multiple omics data types in multi-omics research, particularly when the relationships between variables are non-linear.

\small
```{r eval=FALSE}
# Load the dataset
data <- read.csv("mydata.csv", row.names = 1)

# Perform NMDS
library(vegan)
nmds <- metaMDS(data, distance = "bray")

# View the results
summary(nmds)

# Plot the results
plot(nmds$points, type = "n", xlab = "NMDS1", ylab = "NMDS2")
text(nmds$points, labels = rownames(nmds$points))
```
\normalsize

### t-Distributed Stochastic Neighbour Embedding (t-SNE) {#t-sne}

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction technique used to visualise high-dimensional data in a low-dimensional space. t-SNE is particularly useful when exploring complex and nonlinear relationships between variables, and can be applied to various types of data including gene expression, proteomics, and metabolomics data. t-SNE works by first constructing a probability distribution over pairs of high-dimensional objects, such as genes or proteins, and then constructing a similar probability distribution over pairs of low-dimensional points. The technique then optimizes these probability distributions to minimise the divergence between them, resulting in a low-dimensional representation of the high-dimensional data.

In multi-omics research, t-SNE can be used to analyse and visualise the relationships between samples based on their omics profiles. By performing t-SNE on multiple omics data types separately and then comparing the results, researchers can gain insight into how different omics layers contribute to the overall variation between samples. Additionally, t-SNE can be used to identify clusters or groups of samples with similar omics profiles, which can provide insight into underlying biological processes or disease states. Overall, t-SNE is a powerful tool for visualising high-dimensional data in a low-dimensional space, allowing researchers to explore and analyse complex relationships in multi-omics research.

\small
```{r eval=FALSE}
# Load the dataset
data <- read.csv("mydata.csv", row.names = 1)

# Perform t-SNE
library(Rtsne)
tsne <- Rtsne(data, dims = 2, perplexity = 30, verbose = TRUE)

# View the results
summary(tsne)

# Plot the results
plot(tsne$Y, col = "blue", pch = 19, xlab = "t-SNE1", ylab = "t-SNE2")
```
\normalsize

### Uniform manifold approximation and projection (UMAP) {#umap}

Uniform Manifold Approximation and Projection (UMAP) is a non-linear dimension reduction technique used to visualise high-dimensional data in a low-dimensional space. It is similar to t-Distributed Stochastic Neighbour Embedding (t-SNE) but is faster and more scalable, making it useful for larger datasets. UMAP works by constructing a fuzzy topological representation of the high-dimensional data and then optimising a low-dimensional representation that preserves the structure of this topological representation. This results in a low-dimensional representation of the high-dimensional data that preserves complex relationships between variables.

In multi-omics research, UMAP can be used to analyse and visualise the relationships between samples based on their omics profiles. By performing UMAP on multiple omics data types separately and then comparing the results, researchers can gain insight into how different omics layers contribute to the overall variation between samples. Additionally, UMAP can be used to identify clusters or groups of samples with similar omics profiles, which can provide insight into underlying biological processes or disease states. Overall, UMAP is a powerful tool for visualising high-dimensional data, particularly for large and complex datasets.

\small
```{r eval=FALSE}
# Load the dataset
data <- read.csv("mydata.csv", row.names = 1)

# Perform UMAP
library(umap)
umap_result <- umap(data, n_components = 2, n_neighbors = 30)

# View the results
summary(umap_result)

# Plot the results
plot(umap_result$layout, col = "blue", pch = 19, xlab = "UMAP1", ylab = "UMAP2")
```
\normalsize