hicPCA and compartmentalization. Setting interaction range. #647

kimj50 · 2020-12-11T16:55:26Z

Hi,
Thank you for the amazing tool.
I was wondering if it is possible to compute hicPCA or hicCompartmentalization on matrix with limited interaction range (ex. use interactions between loci less than 5mb).
This largely comes from problem that is specific our system: the segments of the compartments are super tiny, and the checker-board pattern falls off quickly with increasing genomic distance. So we have to use fairly small bin size (no greater than 10kb) for PCA to work, but this also increases noise in long-range interactions.
Heres a snap shot of 2mb region.

We've dealt with this issue by combination of sequencing more reads and subseting matrix using hicAdjustMatrix, but we don't like that we are computing PCA for different regions of chromosomes separately. So, I was hoping if you guys had some insights on this.
Thank you,
Jun

joachimwolff · 2020-12-14T08:55:47Z

Hi Jun,

so you basically need a --maximalRegion parameter that considers only the interactions for this genomic distance?

I need to check the source and the mathematical properties to see if this is possible.

Best,

Joachim

kimj50 · 2020-12-14T14:14:48Z

Yes! thank you - Jun

gtrichard · 2021-01-05T06:55:36Z

I had the same issue here: https://www.cell.com/cell/pdf/S0092-8674(20)30621-8.pdf (early drosophila embryos with low compartmentalization). We ended-up computing PCA for submatrices of pre-defined size (near the diagonal, tried different sizes) with a sliding window approach and stitched the resulting eigen vectors. The advantage of sliding window is that you can compare if the eigenvector "signal" is similar or not between overlapping regions, and they most likely are very similar, so it's safe to stitch the eigenvector values of adjacent windows. The method is described in the supplementary of that paper. I might have a jupyternotebook about this lying around.

kimj50 · 2021-01-05T17:52:42Z

Oh man. This is exactly what we were looking for! Finding the proper values for sliding window (a,a+w) will be important for our system, c elegans. Conceptually I can understand the method section, but I come from biology background, so I will definitely have a lot of trouble executing the method section. Could you share with us if you find that notebook? Thanks! - Jun

mxw010 · 2021-12-13T00:04:45Z

This is a very, very bad idea.

PCA is not numerical addition/multiplication. It is a matrix transformation, so when you do a subset of your data, you change the space that you are operating in, and therefore the decomposition (which is what the PCA is, essentially) get modified in unpredictable ways. A Compartment analysis of a subset of a chromosome does not necessarily retain the Compartment done on the whole chromosome arm.

There are algorithm to approximate PCAs for large matrix. This is not the way at all. See for yourself. I did a 8x8 vs 4x4 example.

import random
import numpy as np
from sklearn.decomposition import PCA

#set seeds for reproducible results
random.seed(16447)
numbers = []
#size of full dim matrix
n=8
for x in range(0,n*n):
    numbers.append(random.gauss(0,3))

y = np.array(numbers).reshape((n,n))
# for simplicity, do Y^T * Y for a symmetric matrix
full_mat = np.dot(y.T, y)

# chose the first 4 column and rows 
partial_mat = full_mat[0:4,0:4]

#fit PCA for both matrices
pca_partial = PCA(n_components=4)
pca_full = PCA(n_components=n)

pca_partial.fit(partial_mat)
#check variance, singular value, PC
print(pca_partial.explained_variance_)
print(pca_partial.explained_variance_ratio_)
print(pca_partial.singular_values_)
print(pca_partial.components_)

#none of them match
pca_full.fit(full_mat)
print(pca_full.explained_variance_)
print(pca_full.explained_variance_ratio_)
print(pca_full.singular_values_)

LeilyR · 2021-12-13T08:32:12Z

I am not sure if this about our tool or the latest comment on this issue. If you are talking about how we handle hicPCA after hicAdjustmatrix , this is not about the size of matrix but about dealing with the biological question we are trying to answer. The reason to remove a region of chromosome is to remove a bias from the not very well assembled parts or very repetitive centromeric regions. If those regions well be kept, we will introduce variation which are nothing but artifact. Of course is up to the user to be careful about what part of genome they are removing.

joachimwolff · 2021-12-13T13:17:10Z

Thanks for your contribution, but I don't get your point. A different input will lead to a different result, that's quite clear. However, even if the result is not 1:1 comparable to a PCA on the full matrix, and also if the result is something else in mathematical terms, it might still be a valid result in terms of biological interpretation. It's then just a different algorithm. You always need to verify your results with orthogonal data, and if that shows it's valid, the approach can be used. Therefore I am not sure what kind of issue you are raising here.

joachimwolff added the PCA issue label Jul 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hicPCA and compartmentalization. Setting interaction range. #647

hicPCA and compartmentalization. Setting interaction range. #647

kimj50 commented Dec 11, 2020

joachimwolff commented Dec 14, 2020

kimj50 commented Dec 14, 2020

gtrichard commented Jan 5, 2021

kimj50 commented Jan 5, 2021

mxw010 commented Dec 13, 2021

LeilyR commented Dec 13, 2021 •

edited

Loading

joachimwolff commented Dec 13, 2021 •

edited

Loading

hicPCA and compartmentalization. Setting interaction range. #647

hicPCA and compartmentalization. Setting interaction range. #647

Comments

kimj50 commented Dec 11, 2020

joachimwolff commented Dec 14, 2020

kimj50 commented Dec 14, 2020

gtrichard commented Jan 5, 2021

kimj50 commented Jan 5, 2021

mxw010 commented Dec 13, 2021

LeilyR commented Dec 13, 2021 • edited Loading

joachimwolff commented Dec 13, 2021 • edited Loading

LeilyR commented Dec 13, 2021 •

edited

Loading

joachimwolff commented Dec 13, 2021 •

edited

Loading