-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parallel support for prepScores2 #25
Comments
How about parallel support for skatOMeta ? It can take up to 2 weeks to run on large datasets - with saddlepoint method, which I know is slow, but would be great if it can be improved... |
@vsvinti that seems excessive. I was running a genotype matrix of ~7800 x 2,000,000 in under 24 hours (which I think is excessive run time). From your other posts here a couple of suggestions:
If you have a large number of snps that are not in a gene (aggregateBy unit), then you can combine them into a largish pseudo gene. This will allow you to still get single snp results and you can remove the pseudo genes from the output of the gene based meta-analysis. I typically use a pseudo gene size as large as my largest real gene. This cuts down the number of splits (in my data sets) tremendously and reduced the above 24 hour run time down to less than 6 hours.
Good Luck, Brian |
Brian, thanks for your suggestions. My matrix doesn't have all that many variants (< 1 million), but the run time seems to be significantly increased if my number of samples are bigger (I am running on datasets between ~1000 and 20,000 samples, and I am noticing the slower performance on the latter). I am running the saddlepoint method which is probably also slower. In saying that, I have split up my setID into n number based on gene list, and parallelising it that way. So if the calculations of Pvalues etc are purely at gene-level (ie it does not borrow regional information from nearby/overlapping genes), then this will do. Can you please confirm that this is the case? |
Is it prepScores2 taking the most time or skatOMeta? All the prep* functions are linear in length(unique(snpinfo[ , aggregateBy])). SkatOMeta should be relatively fast by comparison. While the saddlepoint method is the slowest method I don't think it should be slower than prepScores2. |
skatOMeta is the slower. I run prepScores2 once and save the objects, and then load and I use these to run my metas between the various datasets. I can let one job run without the parallelisation for the ~20k samples, and let you know if/when it completes, if it's helpful. So just to clarify re the above: if I split up my setID based on x number of genes, and run in parallel, this is ok (ie the Pvalues are not affected by some regional component or overlapping genes etc)? For example, if I have 8 variants in gene A, and run skatOMeta only one this gene, I would get the same results as if I were to run in for all the gene/variants in my dataset ..? |
Yes. The unit of measure is the gene. As long as all the variants of the gene are included you will get the same answer. One thing to remember when you do this is to also, subset your SNPInfo file. (i.e if you run only geneA make sure your snpinfo file only contains geneA, and your genotype matrix only contains geneA). Two side notes...
|
Thanks Brian. Yes, I split my setID into say 10 chunks based on genes (though chromosome level would be more sensible to do). However, I am using the same un-split prepScores object. Apart from some warnings, are there are disadvantages of not subsetting the prepScores also? It still runs much faster than in a oner. |
Main disadvantage is you are passing, and making multiple copies of, large objects which is inefficient use of memory and cause some slow down. The results should be the same though. |
Cheers! Thanks so much for all your help. |
Add parallel support for creating seqMeta objects.
The text was updated successfully, but these errors were encountered: