Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_genome_build: Can't find SNP, CHR, BP in PGC VCFs #91

Closed
bschilder opened this issue Mar 16, 2022 · 5 comments
Closed

get_genome_build: Can't find SNP, CHR, BP in PGC VCFs #91

bschilder opened this issue Mar 16, 2022 · 5 comments
Labels
bug Something isn't working

Comments

@bschilder
Copy link
Collaborator

bschilder commented Mar 16, 2022

1. Bug description

There seems to be some issues when trying to munge the Psychiatric Genomics Consortium (PGC) sumstats format, which is a bit different from the OpenGWAS format.

FIrst of all, the file names end in ".vcf.tsv.gz" which is confusing and might be tripping up our code that infers file type by extension names. Wondering if this is happening due to a slight discrepancy between how read_sumstats and read_header are inferring file type, because format_sumstats does manage to get partway through before hitting an error (it even correctly counts the number of rows!).

Here's part of the header from one of these files:
Screenshot 2022-03-16 at 22 48 08

Console output

Error from the first reprex below.

Error in get_genome_build(sumstats = sumstats_return$sumstats_dt, standardise_headers = FALSE,  : 
  SNP ID column (RS ID), CHR and BP (POSITION) columns are needed to infer the genome build. These could not be
found in your dataset. Please specify the genome build manually to run format_sumstats().

Also including the full message output.
MungeSumstats_log_msg.txt

Expected behaviour

format_sumstats is able to run the full pipeline and produce a munged tsv.

2. Reproducible example

Code

This produces an error

fullSS_path_vcf <- "~/Downloads/pgc-bip2021-all.vcf.tsv.gz" 
fullSS_path <- MungeSumstats::format_sumstats(path = fullSS_path_vcf, 
                                              sort_coordinates = TRUE,
                                              log_folder = "~/Downloads/logs", 
                                              log_mungesumstats_msgs = TRUE, 
                                              log_folder_ind = TRUE)

But this works ok

I tried reformatting the file manually so it was undoubtedly a regular tsv file.

#### Set up paths #### 
fullSS_path_tsv <- gsub("\\.vcf","",fullSS_path_vcf) 
fullSS_path_munged <- gsub("-all","-all_munged",fullSS_path_tsv)
#### Edit ####
dat <- data.table::fread(fullSS_path_vcf, 
                         skip = "#CHROM")
colnames(dat) <- gsub("#","",colnames(dat))
#### Save ####
data.table::fwrite(x = dat, 
                   file = fullSS_path_tsv, 
                   sep="\t")
#### Munge ####
fullSS_path <- MungeSumstats::format_sumstats(path = fullSS_path_tsv, 
                                              save_path = fullSS_path_munged, 
                                              sort_coordinates = TRUE,
                                              log_folder = "~/Downloads/logs", 
                                              log_mungesumstats_msgs = TRUE, 
                                              log_folder_ind = TRUE) 

Data

The data can be downloaded here.

Session info

R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.4

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] arrow_7.0.0      ggimage_0.3.0    ggplot2_3.3.5    dplyr_1.0.8      hexSticker_0.4.9
[6] echotabix_0.99.3

loaded via a namespace (and not attached):
  [1] AnnotationHub_3.2.2           BiocFileCache_2.2.1           systemfonts_1.0.4            
  [4] igraph_1.2.11                 BiocParallel_1.28.3           GenomeInfoDb_1.30.1          
  [7] digest_0.6.29                 yulab.utils_0.0.4             htmltools_0.5.2              
 [10] magick_2.7.3                  fansi_1.0.2                   magrittr_2.0.2               
 [13] memoise_2.0.1                 BSgenome_1.62.0               echoverseTemplate_0.99.0     
 [16] ontologyPlot_1.6              openxlsx_4.2.5                Biostrings_2.62.0            
 [19] matrixStats_0.61.0            R.utils_2.11.0                sysfonts_0.8.5               
 [22] prettyunits_1.1.1             colorspace_2.0-3              blob_1.2.2                   
 [25] rappdirs_0.3.3                textshaping_0.3.6             xfun_0.30                    
 [28] crayon_1.5.0                  RCurl_1.98-1.6                echodata_0.99.6              
 [31] jsonlite_1.8.0                hexbin_1.28.2                 graph_1.72.0                 
 [34] VariantAnnotation_1.40.0      glue_1.6.2                    gtable_0.3.0                 
 [37] zlibbioc_1.40.0               XVector_0.34.0                DelayedArray_0.20.0          
 [40] Rgraphviz_2.38.0              BiocGenerics_0.40.0           scales_1.1.1                 
 [43] DBI_1.1.2                     Rcpp_1.0.8.2                  showtextdb_3.0               
 [46] xtable_1.8-4                  progress_1.2.2                gridGraphics_0.5-1           
 [49] bit_4.0.4                     clisymbols_1.2.0              stats4_4.1.0                 
 [52] DT_0.21                       htmlwidgets_1.5.4             httr_1.4.2                   
 [55] ontologyIndex_2.7             ellipsis_0.3.2                pkgconfig_2.0.3              
 [58] XML_3.99-0.9                  R.methodsS3_1.8.1             farver_2.1.0                 
 [61] seqminer_8.4                  dbplyr_2.1.1                  utf8_1.2.2                   
 [64] here_1.0.1                    ggplotify_0.1.0               tidyselect_1.1.2             
 [67] labeling_0.4.2                rlang_1.0.2                   later_1.3.0                  
 [70] AnnotationDbi_1.56.2          BiocVersion_3.14.0            munsell_0.5.0                
 [73] tools_4.1.0                   cachem_1.0.6                  cli_3.2.0                    
 [76] generics_0.1.2                RSQLite_2.2.10                evaluate_0.15                
 [79] stringr_1.4.0                 fastmap_1.1.0                 yaml_2.3.5                   
 [82] ragg_1.2.2                    knitr_1.37                    bit64_4.0.5                  
 [85] fs_1.5.2                      zip_2.2.0                     purrr_0.3.4                  
 [88] KEGGREST_1.34.0               gh_1.3.0                      showtext_0.9-5               
 [91] mime_0.12                     R.oo_1.24.0                   xml2_1.3.3                   
 [94] biomaRt_2.50.3                brio_1.1.3                    compiler_4.1.0               
 [97] rstudioapi_0.13               interactiveDisplayBase_1.32.0 filelock_1.0.2               
[100] curl_4.3.2                    png_0.1-7                     testthat_3.1.2               
[103] paintmap_1.0                  tibble_3.1.6                  stringi_1.7.6                
[106] GenomicFeatures_1.46.5        desc_1.4.1                    lattice_0.20-45              
[109] Matrix_1.4-0                  vctrs_0.3.8                   pillar_1.7.0                 
[112] lifecycle_1.0.1               BiocManager_1.30.16           data.table_1.14.2            
[115] bitops_1.0-7                  httpuv_1.6.5                  rtracklayer_1.54.0           
[118] GenomicRanges_1.46.1          R6_2.5.1                      BiocIO_1.4.0                 
[121] promises_1.2.0.1              IRanges_2.28.0                ontoProc_1.16.0              
[124] assertthat_0.2.1              pkgload_1.2.4                 SummarizedExperiment_1.24.0  
[127] rprojroot_2.0.2               rjson_0.2.21                  withr_2.5.0                  
[130] GenomicAlignments_1.30.0      Rsamtools_2.10.0              S4Vectors_0.32.3             
[133] GenomeInfoDbData_1.2.7        parallel_4.1.0                hms_1.1.1                    
[136] grid_4.1.0                    ggfun_0.0.5                   tidyr_1.2.0                  
[139] rmarkdown_2.13                MatrixGenerics_1.6.0          piggyback_0.1.1              
[142] Biobase_2.54.0                shiny_1.7.1                   lubridate_1.8.0              
[145] restfulr_0.0.13          
@bschilder bschilder added the bug Something isn't working label Mar 16, 2022
@bschilder
Copy link
Collaborator Author

bschilder commented Mar 16, 2022

One place to start might be adding ".vcf.tsv", ".vcf.tsv.gz", and ".vcf.tsv.bgz" as an additional possibilities in MungeSumstats:::supported_suffixes():

Screenshot 2022-03-16 at 22 57 53

@Al-Murphy
Copy link
Owner

Seems to be an issue reading the header of this VCF (probably to do with the format?) with variant annotation:

VariantAnnotation::scanVcfHeader(path)
Error in scanBcfHeader(bf) : [internal] _hts_rewind() failed

@Al-Murphy
Copy link
Owner

Other than that it will run once I add #CHROM to the mapping file. So my solution is to add an error catch to VariantAnnotation::scanVcfHeader(path) in read_header() and use your other approach:

header <- readLines(path, n = 100)
            i <- which(startsWith(header, "#CHR"))
            header <- data.table::fread(text = header[seq(i, i + n)],
                                        nThread = 1)

For vcfs if this fails. Let me know if there is any downstream issue of not using variantAnnotation approach? I'll add these fixes into the solution I have for Indels to be pushed to the master branch (not current) so use the github version for this fix (until late April when it's released)

@bschilder
Copy link
Collaborator Author

bschilder commented Mar 17, 2022

Excellent, I think this sounds like a solid solution to me. Thanks, Alan!

I'll keep you posted about any downstream issues. Currently dealing with one from my more manual solution above, with some weird errors about the rows being out of order (when they don't seem to be) during tabix indexing:

@Al-Murphy
Copy link
Owner

Added to master branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants