A resource should not been made available until it's fully downloaded #32

hpages · 2023-10-12T22:29:53Z

Reporting this for ExperimentHub but I suspect that AnnotationHub and BiocFileCache have similar issues.

In session 1 do:

library(ExperimentHub)
hub <- ExperimentHub()
removeResources(hub, "EH1039")
fname <- hub[["EH1039"]]  # this will start downloading the resource to the cache

While the EH1039 resource is being downloaded in session 1, it's immediately made available in session 2.

In session 2 do:

library(rhdf5)
library(ExperimentHub)
hub <- ExperimentHub()

## Don't wait for the download in session 1 to finish to do this:
fname <- hub[["EH1039"]]
rhdf5::ls(fname)
# Error in H5Fopen(file, flags = flags, fapl = fapl, native = native) : 
#   HDF5. File accessibility. Unable to open file.

The classic way around this is to have some kind of lock mechanism that lets other sessions know that the resource is currently being downloaded. The session that starts the download puts the lock on the resource, only if the resource is not already locked. Other sessions trying to access the resource then just wait for the lock to be removed before they return the resource to the user.

This will also have the benefit of preventing 2 sessions from starting the same download. Right now this is possible if for example session 2 does fname <- hub[["EH1039", force=TRUE]] while session 1 is still downloading the resource. I don't know what the exact consequences of this are but concurrent downloads of the same resource seem like something that should not be permitted.

There's also the question of what happens when the session that started a download dies before the download is complete. Right now it seems that we end up with a corrupted resource in the cache, unless the download was cleanly interrupted at the command line with <CTRL+C> in session 1, in which case it seems that ExperimentHub is able to remove the corrupted resource from the cache. But in case of a more brutal death (e.g. the user inadvertently kills their RStudio session or kills the terminal where they were running R at the command line, or the server is rebooted), then the resource that ends up in the cache will be corrupted. This can be avoided by making the "download + register the resource in the sqlite db" sequence an atomic operation. Note that is something that was brought up here last year.

Thanks,
H.

> sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 23.04

Matrix products: default
BLAS:   /home/hpages/R/R-4.3.0/lib/libRblas.so 
LAPACK: /home/hpages/R/R-4.3.0/lib/libRlapack.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] TENxBrainData_1.21.0        SingleCellExperiment_1.23.0
 [3] SummarizedExperiment_1.31.1 Biobase_2.61.0             
 [5] ExperimentHub_2.9.1         AnnotationHub_3.9.2        
 [7] BiocFileCache_2.9.1         dbplyr_2.3.4               
 [9] HDF5Array_1.29.3            rhdf5_2.45.1               
[11] DelayedArray_0.27.10        SparseArray_1.1.12         
[13] S4Arrays_1.1.6              abind_1.4-5                
[15] MatrixGenerics_1.13.1       matrixStats_1.0.0          
[17] Matrix_1.6-1.1              GenomicRanges_1.53.2       
[19] GenomeInfoDb_1.37.6         IRanges_2.35.3             
[21] S4Vectors_0.39.3            BiocGenerics_0.47.0        

loaded via a namespace (and not attached):
 [1] KEGGREST_1.41.4               lattice_0.21-9               
 [3] rhdf5filters_1.13.5           vctrs_0.6.3                  
 [5] tools_4.3.0                   bitops_1.0-7                 
 [7] generics_0.1.3                curl_5.1.0                   
 [9] AnnotationDbi_1.63.2          tibble_3.2.1                 
[11] fansi_1.0.5                   RSQLite_2.3.1                
[13] blob_1.2.4                    pkgconfig_2.0.3              
[15] lifecycle_1.0.3               GenomeInfoDbData_1.2.10      
[17] compiler_4.3.0                Biostrings_2.69.2            
[19] httpuv_1.6.11                 htmltools_0.5.6.1            
[21] RCurl_1.98-1.12               yaml_2.3.7                   
[23] interactiveDisplayBase_1.39.0 pillar_1.9.0                 
[25] later_1.3.1                   crayon_1.5.2                 
[27] ellipsis_0.3.2                cachem_1.0.8                 
[29] mime_0.12                     tidyselect_1.2.0             
[31] digest_0.6.33                 purrr_1.0.2                  
[33] dplyr_1.1.3                   BiocVersion_3.18.0           
[35] fastmap_1.1.1                 grid_4.3.0                   
[37] cli_3.6.1                     magrittr_2.0.3               
[39] utf8_1.2.3                    withr_2.5.1                  
[41] filelock_1.0.2                promises_1.2.1               
[43] rappdirs_0.3.3                bit64_4.0.5                  
[45] XVector_0.41.1                httr_1.4.7                   
[47] bit_4.0.5                     png_0.1-8                    
[49] memoise_2.0.1                 shiny_1.7.5                  
[51] rlang_1.1.1                   Rcpp_1.0.11                  
[53] xtable_1.8-4                  glue_1.6.2                   
[55] DBI_1.1.3                     BiocManager_1.30.22          
[57] R6_2.5.1                      Rhdf5lib_1.23.2              
[59] zlibbioc_1.47.0

The text was updated successfully, but these errors were encountered:

lshep · 2023-10-13T12:54:03Z

We do have a lock in BFC but apparently that isn't sufficient enough. We can look into it.

romanzenka · 2023-10-13T13:56:22Z

I think this is a related, but different problem. The problem you describe is about resource acquisition going through stages "we do not have it" -> "downloading" -> "available". A caching mechanism not aware of the fact that a resource is already being downloaded will waste downloads, as multiple processes try and only one (hopefully) wins and gets the cache updated.

The problem with marking a resource as "downloading" is that if the download fails, the item will become a deadlock and everyone will wait forever for the download to finish. So you would have to be extremely careful and develop traps so that you unlock these resources if your download process dies prior to download completion. And even then it can fail, so you would have to implement timeouts to clear downloads after a while of inactivity. This all is necessary to have a truly robust solution, but I think it is also quite hard to do correctly.

Lastly, SQLite is not a best mechanism for synchronizing large number of very active processes. It does support rudimentary locking on the file level, but it is not bulletproof (read more here: https://www.sqlite.org/lockingv3.html ). So if you are after a genuinely robust solution, the package would have to support other, more robust database engines - things that you can only do with a database server.

HenrikBengtsson · 2025-01-06T21:54:06Z

Without having looked into the internals of how it is currently working, a complementary mechanism for file locks, which are never 100% reliable, is to create/write/copy files atomically. For instance, when copying or saving a file, there's a time period where the file is incomplete. The greater the file is, or the slower the file system is, the greater this time period is. To lower the length of time that an incomplete file is exposed, is to write to a temporary file that is then renamed. For instance, instead of saving an RDS file using:

saveRDS(data, file = filename)

one can rely on file renaming being an atomic operation:

filename_tmp <- sprintf("%s.tmp", filename)
saveRDS(data, file = filename_tmp)
file.rename(filename_tmp, filename)

One can of course come up with more elaborate file extensions than *.tmp to reduce the risk for filename clashes too, e.g. use tempfile() to generate filename_tmp is the same directory. Note, we want it to be in the same directory to avoid the risk of ending up with cross-device file locations.

I have used the above atomic write/copy strategy for a decade or so in other packages, and it has lowered the number of file-corruption errors reported.

hpages mentioned this issue Oct 12, 2023

How to query and add atomically? Bioconductor/BiocFileCache#42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A resource should not been made available until it's fully downloaded #32

A resource should not been made available until it's fully downloaded #32

hpages commented Oct 12, 2023

lshep commented Oct 13, 2023

romanzenka commented Oct 13, 2023 •

edited

Loading

HenrikBengtsson commented Jan 6, 2025

A resource should not been made available until it's fully downloaded #32

A resource should not been made available until it's fully downloaded #32

Comments

hpages commented Oct 12, 2023

lshep commented Oct 13, 2023

romanzenka commented Oct 13, 2023 • edited Loading

HenrikBengtsson commented Jan 6, 2025

romanzenka commented Oct 13, 2023 •

edited

Loading