Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel support for prepScores2 #25

Open
DavisBrian opened this issue Dec 10, 2015 · 9 comments
Open

parallel support for prepScores2 #25

DavisBrian opened this issue Dec 10, 2015 · 9 comments

Comments

@DavisBrian
Copy link
Owner

Add parallel support for creating seqMeta objects.

@vsvinti
Copy link

vsvinti commented Feb 20, 2017

How about parallel support for skatOMeta ? It can take up to 2 weeks to run on large datasets - with saddlepoint method, which I know is slow, but would be great if it can be improved...

@DavisBrian
Copy link
Owner Author

@vsvinti that seems excessive. I was running a genotype matrix of ~7800 x 2,000,000 in under 24 hours (which I think is excessive run time). From your other posts here a couple of suggestions:

  1. The biggest time killer is the number of splits you have to make on the genotype matrix based on the SNPInfo data frame. (Large number of subsets from the genotype matrix is a surprisingly slow operation in R. See http://stackoverflow.com/questions/39670743/is-there-a-faster-way-to-subset-a-sparse-matrix-than#)

If you have a large number of snps that are not in a gene (aggregateBy unit), then you can combine them into a largish pseudo gene. This will allow you to still get single snp results and you can remove the pseudo genes from the output of the gene based meta-analysis. I typically use a pseudo gene size as large as my largest real gene. This cuts down the number of splits (in my data sets) tremendously and reduced the above 24 hour run time down to less than 6 hours.

  1. Break your SNPInfo file and genotype matrix by chromosome (or some smaller unit). R can be memory inefficient and make copies of the entire data set at times. I’ve tried to reduce these as much as I can but there’s a few I can’t get rid of. Reducing the size of the memory copying can save a quite a bit of time.

  2. With 2 you can parallelize your code to run the chromosomes.

Good Luck,

Brian

@vsvinti
Copy link

vsvinti commented Feb 20, 2017

Brian, thanks for your suggestions. My matrix doesn't have all that many variants (< 1 million), but the run time seems to be significantly increased if my number of samples are bigger (I am running on datasets between ~1000 and 20,000 samples, and I am noticing the slower performance on the latter). I am running the saddlepoint method which is probably also slower. In saying that, I have split up my setID into n number based on gene list, and parallelising it that way. So if the calculations of Pvalues etc are purely at gene-level (ie it does not borrow regional information from nearby/overlapping genes), then this will do. Can you please confirm that this is the case?

@DavisBrian
Copy link
Owner Author

Is it prepScores2 taking the most time or skatOMeta?

All the prep* functions are linear in length(unique(snpinfo[ , aggregateBy])). SkatOMeta should be relatively fast by comparison. While the saddlepoint method is the slowest method I don't think it should be slower than prepScores2.

@vsvinti
Copy link

vsvinti commented Feb 20, 2017

skatOMeta is the slower. I run prepScores2 once and save the objects, and then load and I use these to run my metas between the various datasets. I can let one job run without the parallelisation for the ~20k samples, and let you know if/when it completes, if it's helpful.

So just to clarify re the above: if I split up my setID based on x number of genes, and run in parallel, this is ok (ie the Pvalues are not affected by some regional component or overlapping genes etc)? For example, if I have 8 variants in gene A, and run skatOMeta only one this gene, I would get the same results as if I were to run in for all the gene/variants in my dataset ..?

@DavisBrian
Copy link
Owner Author

Yes. The unit of measure is the gene. As long as all the variants of the gene are included you will get the same answer. One thing to remember when you do this is to also, subset your SNPInfo file. (i.e if you run only geneA make sure your snpinfo file only contains geneA, and your genotype matrix only contains geneA).

Two side notes...

  1. Typically I subset on multiple genes (chromosome level). I find a single gene level is to much upfront cost, but I've never tried to optimize on this.
  2. skatOMeta - See Note 4 in the help file. It's probably better to run a subset of your results if you are using saddlepoint.

@vsvinti
Copy link

vsvinti commented Feb 20, 2017

Thanks Brian. Yes, I split my setID into say 10 chunks based on genes (though chromosome level would be more sensible to do). However, I am using the same un-split prepScores object. Apart from some warnings, are there are disadvantages of not subsetting the prepScores also? It still runs much faster than in a oner.

@DavisBrian
Copy link
Owner Author

Main disadvantage is you are passing, and making multiple copies of, large objects which is inefficient use of memory and cause some slow down. The results should be the same though.

@vsvinti
Copy link

vsvinti commented Feb 20, 2017

Cheers! Thanks so much for all your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants