parallel support for prepScores2 #25

DavisBrian · 2015-12-10T19:50:59Z

Add parallel support for creating seqMeta objects.

vsvinti · 2017-02-20T11:15:41Z

How about parallel support for skatOMeta ? It can take up to 2 weeks to run on large datasets - with saddlepoint method, which I know is slow, but would be great if it can be improved...

DavisBrian · 2017-02-20T16:26:52Z

@vsvinti that seems excessive. I was running a genotype matrix of ~7800 x 2,000,000 in under 24 hours (which I think is excessive run time). From your other posts here a couple of suggestions:

The biggest time killer is the number of splits you have to make on the genotype matrix based on the SNPInfo data frame. (Large number of subsets from the genotype matrix is a surprisingly slow operation in R. See http://stackoverflow.com/questions/39670743/is-there-a-faster-way-to-subset-a-sparse-matrix-than#)

If you have a large number of snps that are not in a gene (aggregateBy unit), then you can combine them into a largish pseudo gene. This will allow you to still get single snp results and you can remove the pseudo genes from the output of the gene based meta-analysis. I typically use a pseudo gene size as large as my largest real gene. This cuts down the number of splits (in my data sets) tremendously and reduced the above 24 hour run time down to less than 6 hours.

Break your SNPInfo file and genotype matrix by chromosome (or some smaller unit). R can be memory inefficient and make copies of the entire data set at times. I’ve tried to reduce these as much as I can but there’s a few I can’t get rid of. Reducing the size of the memory copying can save a quite a bit of time.
With 2 you can parallelize your code to run the chromosomes.

Good Luck,

Brian

vsvinti · 2017-02-20T16:52:28Z

Brian, thanks for your suggestions. My matrix doesn't have all that many variants (< 1 million), but the run time seems to be significantly increased if my number of samples are bigger (I am running on datasets between ~1000 and 20,000 samples, and I am noticing the slower performance on the latter). I am running the saddlepoint method which is probably also slower. In saying that, I have split up my setID into n number based on gene list, and parallelising it that way. So if the calculations of Pvalues etc are purely at gene-level (ie it does not borrow regional information from nearby/overlapping genes), then this will do. Can you please confirm that this is the case?

DavisBrian · 2017-02-20T17:33:01Z

Is it prepScores2 taking the most time or skatOMeta?

All the prep* functions are linear in length(unique(snpinfo[ , aggregateBy])). SkatOMeta should be relatively fast by comparison. While the saddlepoint method is the slowest method I don't think it should be slower than prepScores2.

vsvinti · 2017-02-20T21:00:34Z

skatOMeta is the slower. I run prepScores2 once and save the objects, and then load and I use these to run my metas between the various datasets. I can let one job run without the parallelisation for the ~20k samples, and let you know if/when it completes, if it's helpful.

So just to clarify re the above: if I split up my setID based on x number of genes, and run in parallel, this is ok (ie the Pvalues are not affected by some regional component or overlapping genes etc)? For example, if I have 8 variants in gene A, and run skatOMeta only one this gene, I would get the same results as if I were to run in for all the gene/variants in my dataset ..?

DavisBrian · 2017-02-20T21:14:19Z

Yes. The unit of measure is the gene. As long as all the variants of the gene are included you will get the same answer. One thing to remember when you do this is to also, subset your SNPInfo file. (i.e if you run only geneA make sure your snpinfo file only contains geneA, and your genotype matrix only contains geneA).

Two side notes...

Typically I subset on multiple genes (chromosome level). I find a single gene level is to much upfront cost, but I've never tried to optimize on this.
skatOMeta - See Note 4 in the help file. It's probably better to run a subset of your results if you are using saddlepoint.

vsvinti · 2017-02-20T21:53:39Z

Thanks Brian. Yes, I split my setID into say 10 chunks based on genes (though chromosome level would be more sensible to do). However, I am using the same un-split prepScores object. Apart from some warnings, are there are disadvantages of not subsetting the prepScores also? It still runs much faster than in a oner.

DavisBrian · 2017-02-20T22:10:06Z

Main disadvantage is you are passing, and making multiple copies of, large objects which is inefficient use of memory and cause some slow down. The results should be the same though.

vsvinti · 2017-02-20T22:13:49Z

Cheers! Thanks so much for all your help.

DavisBrian added the enhancement label Dec 10, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallel support for prepScores2 #25

parallel support for prepScores2 #25

DavisBrian commented Dec 10, 2015

vsvinti commented Feb 20, 2017

DavisBrian commented Feb 20, 2017

vsvinti commented Feb 20, 2017

DavisBrian commented Feb 20, 2017

vsvinti commented Feb 20, 2017

DavisBrian commented Feb 20, 2017

vsvinti commented Feb 20, 2017

DavisBrian commented Feb 20, 2017

vsvinti commented Feb 20, 2017

parallel support for prepScores2 #25

parallel support for prepScores2 #25

Comments

DavisBrian commented Dec 10, 2015

vsvinti commented Feb 20, 2017

DavisBrian commented Feb 20, 2017

vsvinti commented Feb 20, 2017

DavisBrian commented Feb 20, 2017

vsvinti commented Feb 20, 2017

DavisBrian commented Feb 20, 2017

vsvinti commented Feb 20, 2017

DavisBrian commented Feb 20, 2017

vsvinti commented Feb 20, 2017