Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in check_no_rs_snp() when joining with duplicate key values #176

Closed
cfbeuchel opened this issue Feb 9, 2024 · 3 comments
Closed
Labels
bug Something isn't working

Comments

@cfbeuchel
Copy link
Contributor

cfbeuchel commented Feb 9, 2024

Hi,

alright, I've been struggling with an edge-case again for a few days and I think I have found the reason for it. I hope at least...

If you think this should be implemented differently I'd gladly submit a PR in case we figure something out. Like simply matching against the index or something.

Edit: I guess this is a duplicate of #164

1. Bug description

Whenever many entries in the SNP column are NA, using the joining by key feature from data.table, e.g. in check_no_rs_snp() will result in data.table matching all NA entries in sumstats_dt to all NAs in miss_rs, leading to a quickly inflated number of rows.

For example, when imputation_ind = TRUE, the following segment will cause a crash (line 415):

setkey(miss_rs, SNP_old_temp)
setkey(sumstats_dt, SNP_old_temp)
sumstats_dt[miss_rs, IMPUTATION_SNP := TRUE]

Console output

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
  Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

Expected behaviour

I think having NA as SNP entry is rare and often CHR:BP:REF:ALT or something is used (I guess GWAS Catalog and openGWAS do this?). I am currently munging data from various other sources and my current problems come from the GBMI summary statistics, a large and prominent GWAS meta analysis consortium that sadly decided against using one of the suggested standard formats.

Still, it might be desirable to handle this case more robustly.

2. Reproducible example

Code

library(data.table)
dat <- data.table(snp = paste0("rs", 1:10))
dat[1:5, snp := NA]
miss_dat <- dat[!grep("^rs", snp), ]
setkey(miss_dat, snp)
setkey(dat, snp)
dat[miss_dat, ]
dat[miss_dat, IMPUTATION_SNP := TRUE]

Data

A summary statistics file that leads to errors can be downloaded from here: https://docs.google.com/spreadsheets/d/1sSU_JfPKs6EZLcY9t3gXsSGPZA1LsP_z/edit

3. Session info

(Add output of the R function utils::sessionInfo() below. This helps us assess version/OS conflicts which could be causing bugs.)

R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] glue_1.7.0        ggthemes_5.0.0    ggplot2_3.4.4     here_1.0.1        data.table_1.15.0

loaded via a namespace (and not attached):
  [1] tidyselect_1.2.0                         gwasvcf_0.1.2                           
  [3] dplyr_1.1.4                              blob_1.2.4                              
  [5] filelock_1.0.3                           R.utils_2.12.3                          
  [7] Biostrings_2.70.2                        bitops_1.0-7                            
  [9] fastmap_1.1.1                            RCurl_1.98-1.14                         
 [11] BiocFileCache_2.10.1                     VariantAnnotation_1.48.1                
 [13] GenomicAlignments_1.38.2                 XML_3.99-0.16.1                         
 [15] digest_0.6.34                            lifecycle_1.0.4                         
 [17] KEGGREST_1.42.0                          RSQLite_2.3.5                           
 [19] googleAuthR_2.0.1                        magrittr_2.0.3                          
 [21] compiler_4.3.1                           rlang_1.1.3                             
 [23] progress_1.2.3                           tools_4.3.1                             
 [25] utf8_1.2.4                               yaml_2.3.8                              
 [27] knitr_1.45                               rtracklayer_1.62.0                      
 [29] prettyunits_1.2.0                        S4Arrays_1.2.0                          
 [31] bit_4.0.5                                curl_5.2.0                              
 [33] DelayedArray_0.28.0                      xml2_1.3.6                              
 [35] abind_1.4-5                              BiocParallel_1.36.0                     
 [37] purrr_1.0.2                              withr_3.0.0                             
 [39] BiocGenerics_0.48.1                      R.oo_1.26.0                             
 [41] grid_4.3.1                               stats4_4.3.1                            
 [43] fansi_1.0.6                              MungeSumstats_1.11.5                    
 [45] colorspace_2.1-0                         scales_1.3.0                            
 [47] biomaRt_2.58.2                           SummarizedExperiment_1.32.0             
 [49] cli_3.6.2                                crayon_1.5.2                            
 [51] generics_0.1.3                           rstudioapi_0.15.0                       
 [53] httr_1.4.7                               rjson_0.2.21                            
 [55] sessioninfo_1.2.2                        DBI_1.2.1                               
 [57] cachem_1.0.8                             stringr_1.5.1                           
 [59] zlibbioc_1.48.0                          assertthat_0.2.1                        
 [61] parallel_4.3.1                           AnnotationDbi_1.64.1                    
 [63] XVector_0.42.0                           restfulr_0.0.15                         
 [65] matrixStats_1.2.0                        vctrs_0.6.5                             
 [67] Matrix_1.6-5                             jsonlite_1.8.8                          
 [69] IRanges_2.36.0                           hms_1.1.3                               
 [71] S4Vectors_0.40.2                         bit64_4.0.5                             
 [73] GenomicFeatures_1.54.3                   codetools_0.2-19                        
 [75] stringi_1.8.3                            gtable_0.3.4                            
 [77] GenomeInfoDb_1.38.5                      BiocIO_1.12.0                           
 [79] GenomicRanges_1.54.1                     munsell_0.5.0                           
 [81] tibble_3.2.1                             pillar_1.9.0                            
 [83] rappdirs_0.3.3                           SNPlocs.Hsapiens.dbSNP155.GRCh38_0.99.24
 [85] GenomeInfoDbData_1.2.11                  BSgenome_1.70.1                         
 [87] R6_2.5.1                                 dbplyr_2.4.0                            
 [89] rprojroot_2.0.4                          lattice_0.21-8                          
 [91] Biobase_2.62.0                           R.methodsS3_1.8.2                       
 [93] png_0.1-8                                Rsamtools_2.18.0                        
 [95] gargle_1.5.2                             memoise_2.0.1                           
 [97] SparseArray_1.2.3                        xfun_0.41                               
 [99] fs_1.6.3                                 MatrixGenerics_1.14.0                   
[101] pkgconfig_2.0.3                         
@cfbeuchel cfbeuchel added the bug Something isn't working label Feb 9, 2024
@cfbeuchel
Copy link
Contributor Author

cfbeuchel commented Feb 9, 2024

Just for reference, creating a temporary index column gets around this, but depending on the file size, this might cause some overhead, but would be a pragmatic solution that might work for most, if not all, cases.

dat <- data.table(snp = paste0("rs", 1:10))
dat[1:5, snp := NA]
dat[, i := .I]
miss_dat <- dat[!grep("^rs", snp), ]
setkey(miss_dat, i)
setkey(dat, i)
dat[miss_dat, IMPUTATION_SNP := TRUE]
dat
dat$i <- NULL

Edit: Since this becomes an issue already when only a few entries are NA (expanding to n=sum(is.na(miss_rs$SNP_old_temp)^2 additional rows, a simple check to avoid creating the additional column could be useful. E.g.:

if(any(is.na(miss_rs$SNP_old_temp)) {
  [...]
}

@Al-Murphy
Copy link
Owner

I'm actually not too sure if what you identified is causing the ongoing issue you referenced but nevertheless, it is something that should be fixed. I've made a fix in push of v1.11.6. Can you pull from Github and let me know if this addresses your issue?

@cfbeuchel
Copy link
Contributor Author

Hi! I've just successfully formatted a large file that previously produced the error with the current version. That would resolve this issue.
Thanks for the patch!
Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants