Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hicPCA and compartmentalization. Setting interaction range. #647

Open
kimj50 opened this issue Dec 11, 2020 · 7 comments
Open

hicPCA and compartmentalization. Setting interaction range. #647

kimj50 opened this issue Dec 11, 2020 · 7 comments

Comments

@kimj50
Copy link

kimj50 commented Dec 11, 2020

Hi,
Thank you for the amazing tool.
I was wondering if it is possible to compute hicPCA or hicCompartmentalization on matrix with limited interaction range (ex. use interactions between loci less than 5mb).
This largely comes from problem that is specific our system: the segments of the compartments are super tiny, and the checker-board pattern falls off quickly with increasing genomic distance. So we have to use fairly small bin size (no greater than 10kb) for PCA to work, but this also increases noise in long-range interactions.
Heres a snap shot of 2mb region.
image
We've dealt with this issue by combination of sequencing more reads and subseting matrix using hicAdjustMatrix, but we don't like that we are computing PCA for different regions of chromosomes separately. So, I was hoping if you guys had some insights on this.
Thank you,
Jun

@joachimwolff
Copy link
Collaborator

Hi Jun,

so you basically need a --maximalRegion parameter that considers only the interactions for this genomic distance?

I need to check the source and the mathematical properties to see if this is possible.

Best,

Joachim

@kimj50
Copy link
Author

kimj50 commented Dec 14, 2020

Yes! thank you - Jun

@gtrichard
Copy link
Collaborator

I had the same issue here: https://www.cell.com/cell/pdf/S0092-8674(20)30621-8.pdf (early drosophila embryos with low compartmentalization). We ended-up computing PCA for submatrices of pre-defined size (near the diagonal, tried different sizes) with a sliding window approach and stitched the resulting eigen vectors. The advantage of sliding window is that you can compare if the eigenvector "signal" is similar or not between overlapping regions, and they most likely are very similar, so it's safe to stitch the eigenvector values of adjacent windows. The method is described in the supplementary of that paper. I might have a jupyternotebook about this lying around.

@kimj50
Copy link
Author

kimj50 commented Jan 5, 2021

Oh man. This is exactly what we were looking for! Finding the proper values for sliding window (a,a+w) will be important for our system, c elegans. Conceptually I can understand the method section, but I come from biology background, so I will definitely have a lot of trouble executing the method section. Could you share with us if you find that notebook? Thanks! - Jun

@mxw010
Copy link

mxw010 commented Dec 13, 2021

This is a very, very bad idea.

PCA is not numerical addition/multiplication. It is a matrix transformation, so when you do a subset of your data, you change the space that you are operating in, and therefore the decomposition (which is what the PCA is, essentially) get modified in unpredictable ways. A Compartment analysis of a subset of a chromosome does not necessarily retain the Compartment done on the whole chromosome arm.

There are algorithm to approximate PCAs for large matrix. This is not the way at all. See for yourself. I did a 8x8 vs 4x4 example.

import random
import numpy as np
from sklearn.decomposition import PCA

#set seeds for reproducible results
random.seed(16447)
numbers = []
#size of full dim matrix
n=8
for x in range(0,n*n):
    numbers.append(random.gauss(0,3))

y = np.array(numbers).reshape((n,n))
# for simplicity, do Y^T * Y for a symmetric matrix
full_mat = np.dot(y.T, y)

# chose the first 4 column and rows 
partial_mat = full_mat[0:4,0:4]

#fit PCA for both matrices
pca_partial = PCA(n_components=4)
pca_full = PCA(n_components=n)

pca_partial.fit(partial_mat)
#check variance, singular value, PC
print(pca_partial.explained_variance_)
print(pca_partial.explained_variance_ratio_)
print(pca_partial.singular_values_)
print(pca_partial.components_)

#none of them match
pca_full.fit(full_mat)
print(pca_full.explained_variance_)
print(pca_full.explained_variance_ratio_)
print(pca_full.singular_values_)

@LeilyR
Copy link
Collaborator

LeilyR commented Dec 13, 2021

I am not sure if this about our tool or the latest comment on this issue. If you are talking about how we handle hicPCA after hicAdjustmatrix , this is not about the size of matrix but about dealing with the biological question we are trying to answer. The reason to remove a region of chromosome is to remove a bias from the not very well assembled parts or very repetitive centromeric regions. If those regions well be kept, we will introduce variation which are nothing but artifact. Of course is up to the user to be careful about what part of genome they are removing.

@joachimwolff
Copy link
Collaborator

joachimwolff commented Dec 13, 2021

Thanks for your contribution, but I don't get your point. A different input will lead to a different result, that's quite clear. However, even if the result is not 1:1 comparable to a PCA on the full matrix, and also if the result is something else in mathematical terms, it might still be a valid result in terms of biological interpretation. It's then just a different algorithm. You always need to verify your results with orthogonal data, and if that shows it's valid, the approach can be used. Therefore I am not sure what kind of issue you are raising here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants