Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: scanTabix: '4' not present in tabix index #33

Closed
bschilder opened this issue Mar 19, 2022 · 22 comments
Closed

Error: scanTabix: '4' not present in tabix index #33

bschilder opened this issue Mar 19, 2022 · 22 comments

Comments

@bschilder
Copy link

bschilder commented Mar 19, 2022

Hello,

So I seem to be having some issues with querying remote tabix files (e.g from ENCODE). Though I'm not sure if this is strictly related to the file being remote, or some other difference in how the file is formatted.

Reprex

Main example

 #### Setup ####
    
    #### Get example data (Parkinson's Disease GWAS) ####
    if(!require("echodata")) remotes::install_github("RajLabMSSM/echodata") 
    #### Construct query as granges ####
    dat <- echodata::BST1
    gr <- GenomicRanges::GRanges(
        seqnames = as.integer(dat$CHR),
        ranges = IRanges::IRanges(
            start = as.integer(dat$POS),
            end = as.integer(dat$POS)
        )
    )
    #### For reconstructing the data as a data.table with colnames ####
    scanTabix_to_dt <- function(bgz, query){
        ### Add missing header back in 
        header <- Rsamtools::headerTabix(bgz) 
        query_dt <- data.table::fread(paste(c(header$header, query),
                                            collapse = "\n"), fill=TRUE)
        return(query_dt)
    } 
    
    
    
    #### ------- Example 1: Remote file ------- ####
    ### Currently produces error
    
    ## This is an tabix-indexed and bgzip-compressed ENCODE file.
    bgz1 <- file.path(
        "https://egg2.wustl.edu/roadmap/data/byFileType",
        "chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final",
        "E099_15_coreMarks_dense.bed.bgz"
    ) 
    #### Query ####
    query1 <- Rsamtools::scanTabix(file = bgz1, param = gr) # <-- error occurs here
    #  "Error: scanTabix: '4' not present in tabix index"
    
    #### Running without the gr query returns the while file ####
    ## However, because it's the whole file, this takes a while.
    query1 <- Rsamtools::scanTabix(file = bgz1)
    ## This part tends to crash R (memory overload?)
    # query_dt1 <- scanTabix_to_dt(bgz1, query1) 

Extended examples

    #### ------- Example 2: Local file ------- #### 
   
 ### Currently working
    fullSS_path <- echodata::example_fullSS()
    bgz2 <- Rsamtools::bgzip(file = fullSS_path, 
                            dest = paste0(fullSS_path,".bgz"), 
                            overwrite = TRUE)
    tbi2 <- Rsamtools::indexTabix(file = bgz2, 
                                 seq = 2, 
                                 start = 3,
                                 end = 3, 
                                 comment = "SNP")  
    #### Query ####
    query2 <- Rsamtools::scanTabix(file = bgz2, param = gr) 
    query_dt2 <- scanTabix_to_dt(bgz2, query2)
     
    
    
    
    #### ------- Example 3: Local file prepared without Rsamtools ------- #### 
    ### Currently working 
    ## Tho not when bgzipping with CLI. may be a version conflict issue
    ## (unrelated to Rsamtools per se).
    
    fullSS_path <- echodata::example_fullSS()
    bgz3 <- paste0(fullSS_path,".bgz") 
    #### Compress the file via CLI ####
    bgzip_method <- "rsamtools"
    if(bgzip_method == "cli"){
        system(paste("bgzip -f", fullSS_path))
        ## However, this causes an error with tabix:
        ## "[tabix] the compression of '..._subset.tsv.bgz' is not BGZF". weird!
        #### Check version of bgzip ####
        help <- system("bgzip -h", intern = TRUE)
        cat(help, sep = "\n")
    } else {
        ## Compressing with Rsamtools seems to work fine with tabix CLI 
        bgz3 <- Rsamtools::bgzip(file = fullSS_path,
                                 dest = bgz3, 
                                 overwrite = TRUE)
    }
    #### Tabix-index ####
    system(paste("tabix",
                 "-f",
                 "-h",
                 "-s",2,
                 "-b",3,
                 "-e",3,
                 "-c","SNP",
                 bgz3
    )) 
    query3 <- Rsamtools::scanTabix(file = bgz3, param = gr)
    query_dt3 <- scanTabix_to_dt(bgz3, query3)

Session info

R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.4

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
  [1] bitops_1.0-7                matrixStats_0.61.0          fs_1.5.2                   
  [4] lubridate_1.8.0             bit64_4.0.5                 filelock_1.0.2             
  [7] progress_1.2.2              httr_1.4.2                  GenomeInfoDb_1.30.1        
 [10] gh_1.3.0                    tools_4.1.0                 utf8_1.2.2                 
 [13] R6_2.5.1                    DT_0.21                     DBI_1.1.2                  
 [16] BiocGenerics_0.40.0         tidyselect_1.1.2            prettyunits_1.1.1          
 [19] bit_4.0.4                   curl_4.3.2                  compiler_4.1.0             
 [22] cli_3.2.0                   Biobase_2.54.0              xml2_1.3.3                 
 [25] DelayedArray_0.20.0         rtracklayer_1.54.0          rappdirs_0.3.3             
 [28] stringr_1.4.0               digest_0.6.29               Rsamtools_2.10.0           
 [31] piggyback_0.1.1             R.utils_2.11.0              XVector_0.34.0             
 [34] pkgconfig_2.0.3             htmltools_0.5.2             echodata_0.99.7            
 [37] MatrixGenerics_1.6.0        BSgenome_1.62.0             dbplyr_2.1.1               
 [40] fastmap_1.1.0               htmlwidgets_1.5.4           rlang_1.0.2                
 [43] rstudioapi_0.13             RSQLite_2.2.10              BiocIO_1.4.0               
 [46] generics_0.1.2              jsonlite_1.8.0              echoconda_0.99.5           
 [49] BiocParallel_1.28.3         dplyr_1.0.8                 zip_2.2.0                  
 [52] R.oo_1.24.0                 VariantAnnotation_1.40.0    RCurl_1.98-1.6             
 [55] magrittr_2.0.2              GenomeInfoDbData_1.2.7      Matrix_1.4-0               
 [58] Rcpp_1.0.8.3                S4Vectors_0.32.3            fansi_1.0.2                
 [61] reticulate_1.24-9000        lifecycle_1.0.1             R.methodsS3_1.8.1          
 [64] yaml_2.3.5                  stringi_1.7.6               brio_1.1.3                 
 [67] SummarizedExperiment_1.24.0 zlibbioc_1.40.0             BiocFileCache_2.2.1        
 [70] grid_4.1.0                  blob_1.2.2                  parallel_4.1.0             
 [73] crayon_1.5.0                lattice_0.20-45             Biostrings_2.62.0          
 [76] GenomicFeatures_1.46.5      hms_1.1.1                   KEGGREST_1.34.0            
 [79] pillar_1.7.0                GenomicRanges_1.46.1        rjson_0.2.21               
 [82] biomaRt_2.50.3              clisymbols_1.2.0            stats4_4.1.0               
 [85] XML_3.99-0.9                glue_1.6.2                  data.table_1.14.2          
 [88] png_0.1-7                   vctrs_0.3.8                 testthat_3.1.2             
 [91] purrr_0.3.4                 tidyr_1.2.0                 assertthat_0.2.1           
 [94] cachem_1.0.6                openxlsx_4.2.5              echotabix_0.99.3           
 [97] restfulr_0.0.13             gitcreds_0.1.1              tibble_3.1.6               
[100] GenomicAlignments_1.30.0    AnnotationDbi_1.56.2        memoise_2.0.1              
[103] IRanges_2.28.0              ellipsis_0.3.2             
@bschilder
Copy link
Author

bschilder commented Mar 19, 2022

i think there might actually be a couple things going on here:

  • I need to reformat my query so that it uses the "chr1" seqnames style instead of "1" (to match the ENCODE file's style).
  • The remote ENCODE file doesn't contain any column names, which might cause problems for Rsamtools.

Screenshot 2022-03-19 at 18 05 56

@vjcitn
Copy link

vjcitn commented Mar 27, 2022

I find that rtracklayer::import can import your data. As you note, the seqlevelsStyle is UCSC. A GRanges without this style can never succeed. After seqlevelsStyle(gr) = "UCSC",

> head(gr)
GRanges object with 6 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr4  15712787      *
  [2]     chr4  15730146      *
  [3]     chr4  15730398      *
  [4]     chr4  15710330      *
  [5]     chr4  15706790      *
  [6]     chr4  15737348      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> rtracklayer::import("E099_15_coreMarks_dense.bed.bgz", which=gr)
GRanges object with 163 ranges and 4 metadata columns:
        seqnames            ranges strand |        name     score     itemRgb
           <Rle>         <IRanges>  <Rle> | <character> <numeric> <character>
    [1]     chr4 14108401-14843200      * |    15_Quies         0     #FFFFFF
    [2]     chr4 14843201-14843600      * |       7_Enh         0     #FFFF00
    [3]     chr4 14843601-14844400      * |      5_TxWk         0     #006400
    [4]     chr4 14844401-14844600      * |       7_Enh         0     #FFFF00
    [5]     chr4 14844601-14847000      * |      5_TxWk         0     #006400
    ...      ...               ...    ... .         ...       ...         ...
  [159]     chr4 16723401-16724400      * |      5_TxWk         0     #006400
  [160]     chr4 16724401-16724600      * |       7_Enh         0     #FFFF00
  [161]     chr4 16724601-16725000      * |      5_TxWk         0     #006400
  [162]     chr4 16725001-16725600      * |       7_Enh         0     #FFFF00
  [163]     chr4 16725601-16805200      * |    15_Quies         0     #FFFFFF
                    thick
                <IRanges>
    [1] 14108401-14843200
    [2] 14843201-14843600
    [3] 14843601-14844400
    [4] 14844401-14844600
    [5] 14844601-14847000
    ...               ...
  [159] 16723401-16724400
  [160] 16724401-16724600
  [161] 16724601-16725000
  [162] 16725001-16725600
  [163] 16725601-16805200
  -------
  seqinfo: 25 sequences from an unspecified genome; no seqlengths

@vjcitn
Copy link

vjcitn commented Mar 27, 2022

For scanTabix, with a correct seqlevelsStyle in the query,

> scanTabix(file=TabixFile(bgz1), param=gr) -> ii
[E::bgzf_read] Read block operation failed with error -1 after 0 of 8 bytes

I don't understand that.

@mtmorgan
Copy link
Collaborator

This is from samtools / htslib with an update to the Bioconductor version required; see #32 (comment)

@hpages
Copy link
Contributor

hpages commented Mar 27, 2022

A long due update! See Bioconductor/Rhtslib#4 and #8 (comment). Should we make this a priority for BioC 3.16?

@bschilder
Copy link
Author

bschilder commented Apr 4, 2022

For scanTabix, with a correct seqlevelsStyle in the query,

> scanTabix(file=TabixFile(bgz1), param=gr) -> ii
[E::bgzf_read] Read block operation failed with error -1 after 0 of 8 bytes

I don't understand that.

Getting this error as well now with VCFs from the 1000 Genomes Project, which I previously didn't have any issues with.`

This seems to have the effect of breaking VariantAnnotation::readVcf as well. @vjcitn have you noticed this with other files?

Here are several examples that currently work to varying degrees.
I can confirm this because I have these numbers recorded in my unit tests.
https://github.com/RajLabMSSM/echotabix/blob/main/tests/testthat/test-query_vcf.R

target_path <- file.path(
"ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/",
"ALL.chr4.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz")
param <- GenomicRanges::GRanges("4:14737349-16737284")

## This produces the error message but successfully returns the header
header <- Rsamtools::headerTabix(file = target_path) 

## This does NOT produce the error message and successfully returns the header
header <- VariantAnnotation::scanVcfHeader(file = target_path) 

## These both  produce the error message but return only 468 variants from all of chrom 4
## Was previously returning many many thousands of variants.
vcf <-  VariantAnnotation::readVcf(target_path)
vcf <- Rsamtools::scanTabix(file = target_path)
## Produces the same "Read block operation failed" error message as the other two methods,
## but then fails with an error in R, thus returning no output:
### Error in read.table(con, sep = "\t", ...) : 
###   incomplete final line found by readTableHeader on 'gzcon(ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521//ALL.chr4.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz)'
tbx <- Rsamtools::TabixFile(target_path)
out <- rtracklayer::import(tbx) 

## These both produce the error message but return 0 variants
## Was previously returning 24,376 variants.  
vcf <-  VariantAnnotation::readVcf(target_path, param=param)
## Warning: Take a very long time
vcf <- Rsamtools::scanTabix(file = target_path, param=param) 
## Produces the same error message (3 times) but return 0 variants, 
## and then throws an error indicating that there's no input to parse.
tbx <- Rsamtools::TabixFile(target_path)
out <- rtracklayer::import(tbx, which=param) 

Screenshot 2022-04-10 at 11 17 34

Session info

R version 4.1.3 (2022-03-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.3.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] dplyr_1.0.8                 echotabix_0.99.5            VariantAnnotation_1.40.0   
 [4] Rsamtools_2.10.0            Biostrings_2.62.0           XVector_0.34.0             
 [7] SummarizedExperiment_1.24.0 Biobase_2.54.0              GenomicRanges_1.46.1       
[10] GenomeInfoDb_1.30.1         IRanges_2.28.0              S4Vectors_0.32.4           
[13] MatrixGenerics_1.6.0        matrixStats_0.61.0          BiocGenerics_0.40.0        

loaded via a namespace (and not attached):
  [1] colorspace_2.0-3         rjson_0.2.21             class_7.3-20            
  [4] ellipsis_0.3.2           rprojroot_2.0.3          fs_1.5.2                
  [7] gld_2.6.4                proxy_0.4-26             rstudioapi_0.13         
 [10] DT_0.22                  gh_1.3.0                 bit64_4.0.5             
 [13] AnnotationDbi_1.56.2     fansi_1.0.3              mvtnorm_1.1-3           
 [16] lubridate_1.8.0          xml2_1.3.3               R.methodsS3_1.8.1       
 [19] rootSolve_1.8.2.3        cachem_1.0.6             pkgload_1.2.4           
 [22] jsonlite_1.8.0           dbplyr_2.1.1             png_0.1-7               
 [25] R.oo_1.24.0              echoconda_0.99.5         echodata_0.99.7         
 [28] readr_2.1.2              compiler_4.1.3           httr_1.4.2              
 [31] assertthat_0.2.1         Matrix_1.4-0             fastmap_1.1.0           
 [34] gargle_1.2.0             cli_3.2.0                htmltools_0.5.2         
 [37] prettyunits_1.1.1        tools_4.1.3              gtable_0.3.0            
 [40] glue_1.6.2               lmom_2.8                 GenomeInfoDbData_1.2.7  
 [43] rappdirs_0.3.3           Rcpp_1.0.8.3             vctrs_0.4.0             
 [46] rtracklayer_1.54.0       stringr_1.4.0            MungeSumstats_1.3.16    
 [49] brio_1.1.3               openxlsx_4.2.5           testthat_3.1.3          
 [52] lifecycle_1.0.1          restfulr_0.0.13          XML_3.99-0.9            
 [55] googleAuthR_2.0.0        scales_1.1.1             MASS_7.3-55             
 [58] zlibbioc_1.40.0          BSgenome_1.62.0          clisymbols_1.2.0        
 [61] hms_1.1.1                parallel_4.1.3           expm_0.999-6            
 [64] Exact_3.1                yaml_2.3.5               curl_4.3.2              
 [67] memoise_2.0.1            reticulate_1.24-9000     ggplot2_3.3.5           
 [70] biomaRt_2.50.3           stringi_1.7.6            RSQLite_2.2.11          
 [73] BiocIO_1.4.0             desc_1.4.1               e1071_1.7-9             
 [76] GenomicFeatures_1.46.5   filelock_1.0.2           boot_1.3-28             
 [79] zip_2.2.0                BiocParallel_1.28.3      rlang_1.0.2             
 [82] pkgconfig_2.0.3          bitops_1.0-7             lattice_0.20-45         
 [85] scKirby_0.1.0            purrr_0.3.4              GenomicAlignments_1.30.0
 [88] htmlwidgets_1.5.4        bit_4.0.4                tidyselect_1.1.2        
 [91] magrittr_2.0.3           R6_2.5.1                 DescTools_0.99.44       
 [94] generics_0.1.2           piggyback_0.1.1          DelayedArray_0.20.0     
 [97] DBI_1.1.2                pillar_1.7.0             withr_2.5.0             
[100] KEGGREST_1.34.0          RCurl_1.98-1.6           tibble_3.1.6            
[103] crayon_1.5.1             utf8_1.2.2               BiocFileCache_2.2.1     
[106] tzdb_0.3.0               progress_1.2.2           grid_4.1.3              
[109] data.table_1.14.2        blob_1.2.2               digest_0.6.29           
[112] tidyr_1.2.0              R.utils_2.11.0           munsell_0.5.0  

@mtmorgan
Copy link
Collaborator

mtmorgan commented Apr 5, 2022

I'll look into this more closely; I also remember the 1000 genomes VCFs working.

Unrelated, but file.path() is not the right way to assemble URLs, because it uses as separator the default for the platform -- on Windows that's \ rather than /, at least in principle. Also philosophically those things at the end of a URL are not paths, just identifiers, as in object stores such as Amazon S3 where the 'object' just happens to have an identifier that looks like a file path.

@bschilder
Copy link
Author

thanks @mtmorgan. ah, hadn't thought of that with URLs, i'll be more careful with those in the future.

@bschilder
Copy link
Author

Just a heads up, this also affects rtracklayer, which also has some functionality for importing VCFs (which I never knew it could do!). Tagging the rtracklayer maintainer as well. @sanchit-saini

I've updated the reprex above to demonstrate that the same error occurs with this method.

@bschilder
Copy link
Author

Interestingly, seqminer is still able to successfully perform these VCF queries. I believe this is because it does not rely on Rhtslib or Rsamtools. I may need to switch to this method for my packages while the issues with Rsamtools and/or Rhtslib are being resolved, since querying VCFs is pretty crucial to my packages. That said, I'd prefer to use VariantAnnotation once it is working again since it has the added benefit of subsetting VCFs by sample IDs. @vjcitn

Tagging the seqminer maintainers here: @zhanxw @yang-lina

seqminer reprex

 target_path <- paste(
        "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/",
        "ALL.chr4.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz",
        sep="/")

out <- seqminer::tabix.read.table(target_path, tabixRange = "4:14737349-16737284")
dim(out)
## [1] 28228  1101

Session info

R version 4.1.3 (2022-03-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.3.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] echotabix_0.99.5            dplyr_1.0.8                 VariantAnnotation_1.40.0    Rsamtools_2.10.0           
 [5] Biostrings_2.62.0           XVector_0.34.0              SummarizedExperiment_1.24.0 Biobase_2.54.0             
 [9] GenomicRanges_1.46.1        GenomeInfoDb_1.30.1         IRanges_2.28.0              S4Vectors_0.32.4           
[13] MatrixGenerics_1.6.0        matrixStats_0.61.0          BiocGenerics_0.40.0        

loaded via a namespace (and not attached):
 [1] fs_1.5.2                 bitops_1.0-7             lubridate_1.8.0          bit64_4.0.5             
 [5] filelock_1.0.2           progress_1.2.2           httr_1.4.2               rprojroot_2.0.3         
 [9] gh_1.3.0                 tools_4.1.3              utf8_1.2.2               R6_2.5.1                
[13] DT_0.22                  DBI_1.1.2                withr_2.5.0              tidyselect_1.1.2        
[17] prettyunits_1.1.1        bit_4.0.4                curl_4.3.2               compiler_4.1.3          
[21] cli_3.2.0                xml2_1.3.3               desc_1.4.1               DelayedArray_0.20.0     
[25] rtracklayer_1.54.0       readr_2.1.2              rappdirs_0.3.3           stringr_1.4.0           
[29] digest_0.6.29            piggyback_0.1.1          R.utils_2.11.0           htmltools_0.5.2         
[33] pkgconfig_2.0.3          echodata_0.99.7          dbplyr_2.1.1             fastmap_1.1.0           
[37] BSgenome_1.62.0          htmlwidgets_1.5.4        rlang_1.0.2              rstudioapi_0.13         
[41] RSQLite_2.2.12           BiocIO_1.4.0             generics_0.1.2           jsonlite_1.8.0          
[45] echoconda_0.99.5         BiocParallel_1.28.3      zip_2.2.0                R.oo_1.24.0             
[49] RCurl_1.98-1.6           magrittr_2.0.3           GenomeInfoDbData_1.2.7   Matrix_1.4-1            
[53] Rcpp_1.0.8.3             fansi_1.0.3              reticulate_1.24-9000     lifecycle_1.0.1         
[57] R.methodsS3_1.8.1        stringi_1.7.6            yaml_2.3.5               zlibbioc_1.40.0         
[61] brio_1.1.3               BiocFileCache_2.2.1      grid_4.1.3               blob_1.2.2              
[65] parallel_4.1.3           crayon_1.5.1             lattice_0.20-45          GenomicFeatures_1.46.5  
[69] hms_1.1.1                KEGGREST_1.34.0          seqminer_8.4             pillar_1.7.0            
[73] rjson_0.2.21             clisymbols_1.2.0         biomaRt_2.50.3           pkgload_1.2.4           
[77] XML_3.99-0.9             glue_1.6.2               data.table_1.14.2        tzdb_0.3.0              
[81] png_0.1-7                vctrs_0.4.0              testthat_3.1.3           tidyr_1.2.0             
[85] purrr_0.3.4              assertthat_0.2.1         cachem_1.0.6             openxlsx_4.2.5          
[89] restfulr_0.0.13          tibble_3.1.6             GenomicAlignments_1.30.0 AnnotationDbi_1.56.2    
[93] memoise_2.0.1            ellipsis_0.3.2          

@vjcitn
Copy link

vjcitn commented Apr 10, 2022

Thanks for these reports. I am having a look at migrating Rsamtools/Rhtslib to more current htslib. I do not know how long it will take. @hpages @mtmorgan -- Herve has a number of fixes to htslib 1.7 sources that may or may not need to be migrated to an updated Rhtslib.

@vjcitn
Copy link

vjcitn commented Apr 10, 2022

@bschilder would you consider forking the relevant Bioconductor packages and adding unit tests that exhibit the problems you have identified, and adding these unit tests that will fail until these conditions are fixed?

@sanchit-saini
Copy link

Just a heads up, this also affects rtracklayer, which also has some functionality for importing VCFs (which I never knew it could do!). Tagging the rtracklayer maintainer as well. @sanchit-saini

I've updated the reprex above to demonstrate that the same error occurs with this method.

@lawremi
I'm just a collaborator tagging the actual maintainer.

@bschilder
Copy link
Author

bschilder commented Apr 25, 2022

@bschilder would you consider forking the relevant Bioconductor packages and adding unit tests that exhibit the problems you have identified, and adding these unit tests that will fail until these conditions are fixed?

I'm afraid this is a bit more than I can commit to atm, but please feel free to use the example I provided above. I think that should contain all of the information you need. @vjcitn

@bschilder
Copy link
Author

Thanks for these reports. I am having a look at migrating Rsamtools/Rhtslib to more current htslib. I do not know how long it will take. @hpages @mtmorgan -- Herve has a number of fixes to htslib 1.7 sources that may or may not need to be migrated to an updated Rhtslib.

Just checking in, has there been progress on fixing this? @hpages @mtmorgan

@vjcitn
Copy link

vjcitn commented Apr 25, 2022

Progress is being made but we need to release 3.15 of the whole ecosystem. After that I will try to deal with this.

@hpages
Copy link
Contributor

hpages commented May 5, 2022

An update on this: This is done for Linux and Mac (see Bioconductor/Rhtslib#4 (comment)). We still need to make sure things work properly on Windows.

@bschilder
Copy link
Author

bschilder commented May 7, 2022

thanks for the update @hpages

Updating Rhtslib via GitHub

remotes::install_github('Bioconductor/Rhtslib')

I just tried installing the latest Rhtslib from GitHub (v1.99.2) and rerunning the examples above. Even after restarting R, I still seem to be getting the exact same issues as before.

Updating Rhtslib via Bioc 3.16

BiocManager::install(version='devel')

That said, when I updated from Bioc 3.15 --> 3.16 within a Bioc Docker container, I noticed that the above examples work as expected! As in, they return the correct number of rows back without error.

So i'm wondering if this difference between installation methods might have something to do with the new version of htslib not replacing the old one when I install via GitHub? Or perhaps some other Bioc libraries also need to be upgraded in order for this to work, in which case maybe some minimum versions could be specified in the Rhtslib DESCRIPTION file.

Session info

R version 4.2.0 (2022-04-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.3.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
  [1] bitops_1.0-7                matrixStats_0.62.0          bit64_4.0.5                
  [4] filelock_1.0.2              progress_1.2.2              httr_1.4.3                 
  [7] GenomeInfoDb_1.32.1         tools_4.2.0                 utf8_1.2.2                 
 [10] R6_2.5.1                    DT_0.22                     DBI_1.1.2                  
 [13] BiocGenerics_0.42.0         tidyselect_1.1.2            prettyunits_1.1.1          
 [16] bit_4.0.4                   curl_4.3.2                  compiler_4.2.0             
 [19] cli_3.3.0                   Biobase_2.56.0              basilisk.utils_1.8.0       
 [22] xml2_1.3.3                  DelayedArray_0.22.0         rtracklayer_1.56.0         
 [25] readr_2.1.2                 rappdirs_0.3.3              stringr_1.4.0              
 [28] digest_0.6.29               Rsamtools_2.12.0            piggyback_0.1.2            
 [31] R.utils_2.11.0              basilisk_1.8.0              XVector_0.36.0             
 [34] pkgconfig_2.0.3             htmltools_0.5.2             echodata_0.99.9            
 [37] MatrixGenerics_1.8.0        dbplyr_2.1.1                fastmap_1.1.0              
 [40] BSgenome_1.64.0             htmlwidgets_1.5.4           rlang_1.0.2                
 [43] rstudioapi_0.13             RSQLite_2.2.14              BiocIO_1.6.0               
 [46] generics_0.1.2              jsonlite_1.8.0              echoconda_0.99.6           
 [49] BiocParallel_1.30.0         zip_2.2.0                   dplyr_1.0.9                
 [52] R.oo_1.24.0                 VariantAnnotation_1.42.0    RCurl_1.98-1.6             
 [55] magrittr_2.0.3              GenomeInfoDbData_1.2.8      Matrix_1.4-1               
 [58] Rcpp_1.0.8.3                S4Vectors_0.34.0            fansi_1.0.3                
 [61] reticulate_1.24             lifecycle_1.0.1             R.methodsS3_1.8.1          
 [64] stringi_1.7.6               yaml_2.3.5                  SummarizedExperiment_1.26.1
 [67] zlibbioc_1.42.0             brio_1.1.3                  BiocFileCache_2.4.0        
 [70] grid_4.2.0                  blob_1.2.3                  parallel_4.2.0             
 [73] crayon_1.5.1                dir.expiry_1.4.0            lattice_0.20-45            
 [76] Biostrings_2.64.0           GenomicFeatures_1.48.0      hms_1.1.1                  
 [79] KEGGREST_1.36.0             pillar_1.7.0                GenomicRanges_1.48.0       
 [82] rjson_0.2.21                biomaRt_2.52.0              stats4_4.2.0               
 [85] XML_3.99-0.9                glue_1.6.2                  data.table_1.14.2          
 [88] tzdb_0.3.0                  png_0.1-7                   vctrs_0.4.1                
 [91] testthat_3.1.4              tidyr_1.2.0                 purrr_0.3.4                
 [94] assertthat_0.2.1            cachem_1.0.6                openxlsx_4.2.5             
 [97] echotabix_0.99.6            restfulr_0.0.13             tibble_3.1.7               
[100] GenomicAlignments_1.32.0    AnnotationDbi_1.58.0        memoise_2.0.1              
[103] IRanges_2.30.0              ellipsis_0.3.2   

@mtmorgan
Copy link
Collaborator

mtmorgan commented May 7, 2022

VariantAnnotation needs to be re-installed (it calls Rsamtools' C code from C code). It has had a version bump so should be installable via BiocManager either later today or later on Sunday, all being well...

It looks like Rhtslib is available via BiocManager https://bioconductor.org/packages/3.16/bioc/html/Rhtslib.html.

Packages need to be installed in the correct order (which BiocManager::install() takes care of, once updated versions have successfully propagated...) ... first Rhtslib then Rsamtools then VariantAnnotation. If you installed Rsamtools (using a previous version of Rhtslib), then Rhtslib, Rsamtools will be statically linked to the previous version of Rhtslib, which explains why you see the same behavior. If you install Rhtslib then Rsamtools but don't install VariantAnnotation, the readVcf() etc will result in a segfault because VariantAnnotation is expecting a different version of the Rsamtools C code.

I'm not completely familiar with the macOS build system, but in general it is important that the same compiler and compiler settings are used for each library, so in general one would want to either install all from source, or all as binaries.

@hpages
Copy link
Contributor

hpages commented May 19, 2022

The latest Rsamtools (2.13.2) was updated to work with the new Rhtslib (based on htslib 1.15.1). It is now available in BioC 3.16 (current devel) via BiocManager::install(). This should grab the new Windows or Mac binary for Rsamtools 2.13.2 if you are on these platforms.

Can someone confirm that this issue is gone with Rsamtools 2.13.2? We want to make sure that this is tested on Windows before we close. Thanks!

H.

@vjcitn
Copy link

vjcitn commented May 19, 2022

I confirm that

vcf <- Rsamtools::scanTabix(file = target_path, param=param) 

from #33 (comment) succeeds with

R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    
 
attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] Rsamtools_2.13.2     Biostrings_2.65.0    XVector_0.37.0      
[4] GenomicRanges_1.49.0 GenomeInfoDb_1.33.3  IRanges_2.31.0      
[7] S4Vectors_0.35.0     BiocGenerics_0.43.0 

loaded via a namespace (and not attached):
[1] crayon_1.5.1           bitops_1.0-7           zlibbioc_1.43.0       
[4] BiocParallel_1.31.3    tools_4.2.0            RCurl_1.98-1.6        
[7] parallel_4.2.0         compiler_4.2.0         GenomeInfoDbData_1.2.8

@hpages
Copy link
Contributor

hpages commented May 19, 2022

Excellent. Thanks Vince!

@hpages hpages closed this as completed May 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants