Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in cDict[[chrom_col]] : subscript out of bounds #83

Closed
mkoromina opened this issue Mar 8, 2022 · 5 comments
Closed

Error in cDict[[chrom_col]] : subscript out of bounds #83

mkoromina opened this issue Mar 8, 2022 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@mkoromina
Copy link

mkoromina commented Mar 8, 2022

## 1. Bug description

Upon running the finemap_loci() function, the above mentioned error message occurs.
(this is somehow a continuation from a thread in issue #74.)

### Console output

Full error message from console:
+ Extracting relevant variants from fullSS...
+ Query Method: tabix
+ QUERY: Chromosome = 22 ; Min position = 41143879 ; Max position = 41163879
echotabix:: Converting full summary stats file to tabix format for fast querying...
echotabix:: Ensuring file is bgzipped.
echotabix:: Tabix-indexing file.
Error in cDict[[chrom_col]] : subscript out of bounds
In addition: Warning messages:
1: In readLines(con = large_file, n = n) :
  line 1 appears to contain an embedded nul
2: In readLines(con = large_file, n = n) :
  line 2 appears to contain an embedded nul
Fine-mapping complete in:
Time difference of 5.5 secs
Returning results as nested list.

Expected behaviour

I'd expect the finemap_loci() function to run efficiently.
To note also, that upon running the script below, the munged sumstats (.parquet) are further zipped (i.e., .parquet.bgz), which I am not sure if it is somehow related to the error message.

##update: same error message produced when using .tsv files with MungeSumStats package.

## 2. Reproducible example
##Code

##all echoverse sub packages are loaded 

##import top SNPs (to be fine-mapped)
top_SNPs <- import_topSNPs(topSS = file.path("/path/to/my/top-snps"),
show_table = T, 
chrom_col = "CHR",
position_col = "BP", 
snp_col= "SNP",
pval_col= "P",
effect_col= "OR",
locus_col = "SNP",
grouping_vars = c("SNP"))
      
##path to GWAS summary stats
fullSS_path <- file.path("/path/to/my/parquet/file")

##run the fine-mapping function
my.results<- finemap_loci(top_SNPs = top_SNPs, 
                                          results_dir = file.path("path/to/results/folder"),
                                          loci = c("rs13044225","rs5758064"),#top_SNPs$Locus,
                                          dataset_name = "mydataset",
                                          dataset_type = "GWAS",  
                                          force_new_subset = TRUE,
                                          force_new_LD = FALSE,
                                          force_new_finemap = TRUE,
                                          remove_tmps = FALSE,
                                          
                                          # Munge full sumstats first
                                          munged = FALSE,
                                          
                                          # SUMMARY STATS ARGUMENTS
                                          fullSS_path = fullSS_path,
                                          fullSS_genome_build = "hg19",
                                          query_by = "tabix",  
                                          MAF_col = "calculate",   
                                         
                                          # FILTERING ARGUMENTS
                                          bp_distance = 10000,
                                          min_MAF = 0.001,  
                                          trim_gene_limits = FALSE,
                                         
                                          # FINE-MAPPING ARGUMENTS
                                          ## General
                                          finemap_methods = c("ABF","FINEMAP","SUSIE","POLYFUN_SUSIE"), 
                                          n_causal = 5,
                                          PP_threshold = .95, 
                                         
                                          # LD ARGUMENTS 
                                          LD_reference = "1KGphase1",#"UKB",
                                          superpopulation = "EUR",
                                          download_method = "axel",
                                          verbose = TRUE)

### Data
Unfortunately, I cannot upload the data, but here is a short description:
-My loci to be fine mapped are saved in a standard excel format with the following header:
SNP | CHR | BP | P | OR | SE.

-My GWAS sumstats have been munged with the munge_polyfun_sumstats.py and they are saved in a .parquet format.

## 3. Session info

R version 4.1.2 (2021-11-01)

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] echotabix_0.99.2 echoplot_0.99.1 echoLD_0.99.1 echofinemap_0.99.0 echoconda_0.99.4 echodata_0.99.6 echoannot_0.99.3 echolocatoR_2.0.0

loaded via a namespace (and not attached):
[1] utf8_1.2.2 reticulate_1.24 R.utils_2.11.0 tidyselect_1.1.2 RSQLite_2.2.10
[6] AnnotationDbi_1.56.2 htmlwidgets_1.5.4 grid_4.1.2 BiocParallel_1.28.3 XGR_1.1.7
[11] munsell_0.5.0 DT_0.21 colorspace_2.0-3 Biobase_2.54.0 filelock_1.0.2
[16] OrganismDbi_1.36.0 knitr_1.37 supraHex_1.32.0 rstudioapi_0.13 stats4_4.1.2
[21] DescTools_0.99.44 MatrixGenerics_1.6.0 GenomeInfoDbData_1.2.7 mixsqp_0.3-43 bit64_4.0.5
[26] rprojroot_2.0.2 vctrs_0.3.8 generics_0.1.2 xfun_0.30 biovizBase_1.42.0
[31] BiocFileCache_2.2.1 R6_2.5.1 GenomeInfoDb_1.30.1 AnnotationFilter_1.18.0 bitops_1.0-7
[36] cachem_1.0.6 reshape_0.8.8 DelayedArray_0.20.0 assertthat_0.2.1 BiocIO_1.4.0
[41] scales_1.1.1 nnet_7.3-17 rootSolve_1.8.2.3 gtable_0.3.0 lmom_2.8
[46] ggbio_1.42.0 ensembldb_2.18.3 rlang_1.0.2 clisymbols_1.2.0 splines_4.1.2
[51] rtracklayer_1.54.0 lazyeval_0.2.2 dichromat_2.0-0 hexbin_1.28.2 checkmate_2.0.0
[56] BiocManager_1.30.16 yaml_2.3.5 reshape2_1.4.4 crosstalk_1.2.0 snpStats_1.44.0
[61] GenomicFeatures_1.46.5 ggnetwork_0.5.10 backports_1.4.1 Hmisc_4.6-0 RBGL_1.70.0
[66] tools_4.1.2 ggplot2_3.3.5 ellipsis_0.3.2 jquerylib_0.1.4 RColorBrewer_1.1-2
[71] proxy_0.4-26 BiocGenerics_0.40.0 coloc_5.1.1 Rcpp_1.0.8 plyr_1.8.6
[76] base64enc_0.1-3 progress_1.2.2 zlibbioc_1.40.0 purrr_0.3.4 RCurl_1.98-1.6
[81] prettyunits_1.1.1 rpart_4.1.16 viridis_0.6.2 S4Vectors_0.32.3 SummarizedExperiment_1.24.0
[86] ggrepel_0.9.1 cluster_2.1.2 here_1.0.1 fs_1.5.2 magrittr_2.0.2
[91] data.table_1.14.2 dnet_1.1.7 openxlsx_4.2.5 gh_1.3.0 mvtnorm_1.1-3
[96] ProtGenerics_1.26.0 matrixStats_0.61.0 hms_1.1.1 patchwork_1.1.1 XML_3.99-0.9
[101] jpeg_0.1-9 IRanges_2.28.0 gridExtra_2.3 compiler_4.1.2 biomaRt_2.50.3
[106] tibble_3.1.6 crayon_1.5.0 R.oo_1.24.0 htmltools_0.5.2 tzdb_0.2.0
[111] Formula_1.2-4 tidyr_1.2.0 expm_0.999-6 Exact_3.1 lubridate_1.8.0
[116] DBI_1.1.2 dbplyr_2.1.1 MASS_7.3-55 rappdirs_0.3.3 boot_1.3-28
[121] Matrix_1.4-0 readr_2.1.2 piggyback_0.1.1 cli_3.2.0 R.methodsS3_1.8.1
[126] parallel_4.1.2 igraph_1.2.11 GenomicRanges_1.46.1 pkgconfig_2.0.3 GenomicAlignments_1.30.0
[131] RCircos_1.2.2 foreign_0.8-82 xml2_1.3.3 bslib_0.3.1 XVector_0.34.0
[136] stringr_1.4.0 VariantAnnotation_1.40.0 digest_0.6.29 graph_1.72.0 Biostrings_2.62.0
[141] htmlTable_2.4.0 gld_2.6.4 restfulr_0.0.13 curl_4.3.2 Rsamtools_2.10.0
[146] rjson_0.2.21 seqminer_8.4 lifecycle_1.0.1 nlme_3.1-155 jsonlite_1.8.0
[151] viridisLite_0.4.0 BSgenome_1.62.0 fansi_1.0.2 susieR_0.11.92 pillar_1.7.0
[156] lattice_0.20-45 GGally_2.1.2 KEGGREST_1.34.0 fastmap_1.1.0 httr_1.4.2
[161] survival_3.3-1 glue_1.6.2 zip_2.2.0 png_0.1-7 bit_4.0.4
[166] Rgraphviz_2.38.0 sass_0.4.0 class_7.3-20 stringi_1.7.6 blob_1.2.2
[171] latticeExtra_0.6-29 memoise_2.0.1 dplyr_1.0.8 irlba_2.3.5 e1071_1.7-9
[176] ape_5.6-2

@mkoromina mkoromina added the bug Something isn't working label Mar 8, 2022
@mkoromina
Copy link
Author

mkoromina commented Mar 8, 2022

May I just say that I am not sure how the bug label was added in there, but please feel free to remove it and leave it as an issue to be solved! Thanks a lot!

@bschilder
Copy link
Member

bschilder commented Mar 16, 2022

Seems like there's at least several things going on here.

Most importantly, echolocatoR doesn't support sumstats provided in parquet file format (yet). When you write "/path/to/my/parquet/file" do you literally mean a parquet file? Because that won't work. As I mentioned here, you need to convert it to a format that data.table can recognize (e.g. tsv.gz). There's a number of ways to do this, but if you prefer to stay in R, you can use arrow.

dat <- arrow::read_parquet(filepath)
data.table::fwrite(dat, newfilepath, sep="\t")

PS - I set the bug label to be automatic, so don't worry about it!

@mkoromina
Copy link
Author

Hi @bschilder, thank you very much for this! I experienced though the same issue when using .tsv files created by MungeSumstats. I will try though your way with arrow package and come back to you if needed. Thanks a lot once again!

@bschilder
Copy link
Member

Ok, keep me posted!
I should also mention I'm looking into some issues with munging and tabix indexing PGC files such as the BD2021 dataset.
https://github.com/neurogenomics/MungeSumstats/issues/91

@mkoromina
Copy link
Author

Hey, Just an update in here, so as to let you know that .parquet format won't work (even when reformatted via arrow::read_parquet to a .tsv.gz file). The main issue is actually error messages occurring and which have to do with its appropriate tabix indexing in downstream analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Development

No branches or pull requests

2 participants