-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create tabix index failed #117
Comments
Hard to say what has happened, have you tried to inspect the saved file at the returned path? I believe the sumstats isn't ordered correctly given below:
If you could provide a small sample dataset that gives the same error I will be happy to debug. Cheers, |
Hi Alan, I was not able to open the file and I don't know if that's because it's damaged. Here is the link of the GWAS dataset I used: Many thanks! Best, |
Hey! So firstly, processing this sumstats took 23.17 minutes to run on my machine. Given it took days for you to run it I think you should look to get access to a more powerful machine for future runs. Secondly, the issue with tabix is frustrating since it is due to an issue with seqminer::tabix.createIndex which we call to make the tabix file. The error you see is:
However I manually check the rows around 769472 and they are sorted correctly based on BP and CHR and they are correctly sorted:
We noted this issue before however it only seems to happen on some sumstats and what appears to be randomly. We have created an issue about it here: zhanxw/seqminer#25. @bschilder have we heard anything back on this? My advice is to run MSS with Thanks, |
Hi Alan, I tried running tabix_index=FALSE and got a new error: Could you please help me fix it? Thanks! Best, |
Hi Phoebe, So the vector memory exhausted is telling you that you don't have enough RAM to run the sumstats through MSS. You will need to get access to a machine with more RAM (perhaps if you are part of a university, on their HPC). Cheers, |
|
Hello, I'm wondering how to convert formatted summary statistics to a taxi file if I have to run MSS with tabix_index=FALSE. Thank you! |
Hey, have a look at seqminer::tabix.createIndex() which is what MSS uses but there are other options out there for R which should be easily findable with a quick google |
Hey @PhoebeGuo97, this should now be resolved and the dev version of MSS should allow tabix indexing https://github.com/neurogenomics/MungeSumstats/issues/122 |
Hey @Al-Murphy, Thanks for the update. I tried re-running the MSS and got different errors :( Tabix-indexing file. |
Hey @PhoebeGuo97, Is this with the same sumstats you linked above? - https://ctg.cncr.nl/documents/p1651/AD_sumstats_Jansenetal_2019sept.txt.gz Can you confirm the version of MSS you are using to get this error? Alan. |
Hi @Al-Murphy, Yes. The version of MSS is 1.5.14. Thanks! Phoebe |
@bschilder any clue why this error is occurring with the changed tabix code? Thanks! |
Hey @Al-Murphy and @PhoebeGuo97 , sorry it took me so long to get to this. Managed to replicate the issue, just trying to figure out the source of the issue.
This error usually occurs when:
Here's the reprex: res <- MungeSumstats::format_sumstats(path = "https://ctg.cncr.nl/documents/p1651/AD_sumstats_Jansenetal_2019sept.txt.gz",
ref_genome = "GRCH37",
tabix_index=FALSE)
### Fails ###
MungeSumstats::write_sumstats(sumstats_dt = sumstats_dt,
save_path="~/Downloads/sumstats_dt.tsv.gz",
tabix_index = TRUE)
# Error: [E::hts_idx_push] Unsorted positions on sequence #2: 71999998 followed by 7
### Find offending rows ###
sumstats_dt <- data.table::fread(res)
i <- which(sumstats_dt$CHR==2 & sumstats_dt$BP==71999998)
sumstats_dt[seq(i-2,i+2)]
I believe "sequence" refers to CHR, and numbers refer to the BP. But these positions don't seem to be out of order. |
I also checked the cdict <- MungeSumstats:::column_dictionary(file_path = res)
cdict
|
Ok, so sorting using a bash command line wrapper in R (instead of res_sort <- echotabix::sort_coordinates(target_path = res,
chrom_col = "CHR",
start_col = "BP",
save_path = "tmp.tsv.gz")
out <- MungeSumstats::index_tabular(path = res_sort$path)
But I still don't really know why sorting using |
Just found a third option via gr <- to_granges(sumstats_dt)
gr <- GenomicRanges::sort.GenomicRanges(gr)
sumstats_dt <- granges_to_dt(gr) This isn't nearly as efficient as the |
Something else to consider in for the future is dissecting the |
I also wonder if maybe |
Could be ignoring factors since factors aren't supported in data.table. What column has factors? There shouldn't be any I don't think? Might be a simple fix to just change these from factors before sorting? |
Didn't know factors weren't supported at all! |
I think it's more that they advise against them is all. Maybe try moving that check to after sorting the order? Should cover this? Not sure of any downstream issues so worth checking? Also just as a side note for when you are working on this, the push you made to use Rworkflows now makes checks run on Windows again which I had removed. Could you change it so it won't run on Windows again (comment out the
Could you also add back in |
Which check do you mean? Writing/indexing always occurs after sorting.
Sure, will do.
This would now be done at the level of the rworkflows action itself. Might be worth me adding this as an extra arg users can control tho. |
Apologies I misunderstood what the factor call was for - I implemented it so X,Y,MT chromosomes would come after the others. The column is changed back to a factor before returning. Could you share a version of the sumstats that is passed into the sort_coordinates function in the version with the error you can replicate? Obviously, it would be best to debug what's causing the issue to keep the commands within data.table without changing to genomic ranges to help with speed. If it is actually a bug with data.table I think we should raise it with them to fix rather than using the genomic ranges approach.
Thanks for sorting those two side things!! |
Cool, thanks for clarifying @Al-Murphy. I'll work on this a bit more and see if I can figure out a |
Ok, so I think I've fixed this in a way that keep the speed benefits of There were two main issues I identified:
I've added a new subfunction Will push these changes today. |
GHA is failing due to https://bioconductor.org/packages/release/bioc/html/BSgenome.html Will keep an eye on |
Great thanks Brian! Let me know how that goes with BSgenome |
Found out the full story about these deps not being available from Herve, who kindly took the time to explain it! They should all be back working v soon. |
@bschilder tabix indexing still seems to be failing in 1.7.13 (dev bioconductor):
|
@Al-Murphy this now seems to be working consistently for me. Some of the dep source code may have been fixed since you last checked. |
Great closing for now, we can reopen if it happens again |
I have the same issue with version (1.2.4) using the example data that you provide here: https://bioconductor.org/packages/release/bioc/vignettes/MungeSumstats/inst/doc/MungeSumstats.html
Output is here:
Do we have a solution for this? Because I need to convert my tab-delimiter GWAS summary into the tabix-index file My Bioconductor version (Bioconductor version 3.14 (BiocManager 1.30.21), R 4.1.2 (2021-11-01)) |
Hey @anbai106, MSS version 1.2 is now very old and there has been a lot of changes between that and the current release version 1.6.0. Could you please try updating to the current release and see if this resolves your issue? Cheers, |
Thanks for your reply. Somehow, I followed your tutorial to install the package, but the version that I got is 1.2.4: I don't know if this is because of my R and Bioconductor versions, but it seems ok |
Yep it's because of your R version, you need to install at least R v4.3.0 |
Thanks! I will try this wonderful package later because I don't want to upgrade my R version for now. A side point: R and Rstudio should allow users to easily switch different versions of R on Linux machines, like Python/Pycharms, so that we do not need to worry much about incompatibility across different R packages due to version issues. Best regards |
Hi @anbai106, while it does take a couple extra steps, you can have multiple versions of R/Rstudio installed on your computer at once by making containers: |
I tried munging my summary statistics using fullSS_path <- MungeSumstats::format_sumstats(path = fullSS_path, ref_genome = "GRCH37",tabix_index=TRUE). It took several days to finish. Here is what I saw in R Console:
Tabix-indexing file.
[ti_index_core] the file out of order at line 1253647
Create tabix index failed for [ /var/folders/g9/f5z6vsbx7q3_wmk06jz0j9zm0000gn/T//RtmpUeD8qB/filec8a3e5725dc.tsv.bgz ]!
Summary statistics report:
Successfully finished preparing sumstats file, preview:
Reading header.
SNP CHR BP A1 A2 UNIQID.A1A2 Z P NSUM N DIRECTION FRQ BETA SE
1: rs12184267 1 715265 C T 1:715265_T_C -2.121973 0.03384 359856 359856 ??+? 0.9591931 -0.01264265 0.005957967
2: rs12184277 1 715367 A G 1:715367_G_A -1.957915 0.05024 360230 360230 ??+? 0.9589313 -0.01162351 0.005936678
3: rs12184279 1 717485 C A 1:717485_A_C -1.912438 0.05582 360257 360257 ??+? 0.9594241 -0.01141891 0.005970864
4: rs116801199 1 720381 G T 1:720381_T_G -2.295404 0.02171 360980 360980 ??+? 0.9578380 -0.01344289 0.005856439
5: rs12565286 1 721290 G C 1:721290_C_G -2.315602 0.02058 360823 360823 ??+? 0.9576224 -0.01353111 0.005843450
Returning path to saved data.
Warning message:
In
[.data.table
(sumstats_dt, ,:=
((names(empty_cols)), NULL)) :Column 'NA' does not exist to remove
I'm wondering how to successfully creat tabix index. Thanks!
The text was updated successfully, but these errors were encountered: