Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A resource should not been made available until it's fully downloaded #32

Open
hpages opened this issue Oct 12, 2023 · 3 comments
Open

Comments

@hpages
Copy link
Contributor

hpages commented Oct 12, 2023

Reporting this for ExperimentHub but I suspect that AnnotationHub and BiocFileCache have similar issues.

In session 1 do:

library(ExperimentHub)
hub <- ExperimentHub()
removeResources(hub, "EH1039")
fname <- hub[["EH1039"]]  # this will start downloading the resource to the cache 

While the EH1039 resource is being downloaded in session 1, it's immediately made available in session 2.

In session 2 do:

library(rhdf5)
library(ExperimentHub)
hub <- ExperimentHub()

## Don't wait for the download in session 1 to finish to do this:
fname <- hub[["EH1039"]]
rhdf5::ls(fname)
# Error in H5Fopen(file, flags = flags, fapl = fapl, native = native) : 
#   HDF5. File accessibility. Unable to open file.

The classic way around this is to have some kind of lock mechanism that lets other sessions know that the resource is currently being downloaded. The session that starts the download puts the lock on the resource, only if the resource is not already locked. Other sessions trying to access the resource then just wait for the lock to be removed before they return the resource to the user.

This will also have the benefit of preventing 2 sessions from starting the same download. Right now this is possible if for example session 2 does fname <- hub[["EH1039", force=TRUE]] while session 1 is still downloading the resource. I don't know what the exact consequences of this are but concurrent downloads of the same resource seem like something that should not be permitted.

There's also the question of what happens when the session that started a download dies before the download is complete. Right now it seems that we end up with a corrupted resource in the cache, unless the download was cleanly interrupted at the command line with <CTRL+C> in session 1, in which case it seems that ExperimentHub is able to remove the corrupted resource from the cache. But in case of a more brutal death (e.g. the user inadvertently kills their RStudio session or kills the terminal where they were running R at the command line, or the server is rebooted), then the resource that ends up in the cache will be corrupted. This can be avoided by making the "download + register the resource in the sqlite db" sequence an atomic operation. Note that is something that was brought up here last year.

Thanks,
H.

> sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 23.04

Matrix products: default
BLAS:   /home/hpages/R/R-4.3.0/lib/libRblas.so 
LAPACK: /home/hpages/R/R-4.3.0/lib/libRlapack.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] TENxBrainData_1.21.0        SingleCellExperiment_1.23.0
 [3] SummarizedExperiment_1.31.1 Biobase_2.61.0             
 [5] ExperimentHub_2.9.1         AnnotationHub_3.9.2        
 [7] BiocFileCache_2.9.1         dbplyr_2.3.4               
 [9] HDF5Array_1.29.3            rhdf5_2.45.1               
[11] DelayedArray_0.27.10        SparseArray_1.1.12         
[13] S4Arrays_1.1.6              abind_1.4-5                
[15] MatrixGenerics_1.13.1       matrixStats_1.0.0          
[17] Matrix_1.6-1.1              GenomicRanges_1.53.2       
[19] GenomeInfoDb_1.37.6         IRanges_2.35.3             
[21] S4Vectors_0.39.3            BiocGenerics_0.47.0        

loaded via a namespace (and not attached):
 [1] KEGGREST_1.41.4               lattice_0.21-9               
 [3] rhdf5filters_1.13.5           vctrs_0.6.3                  
 [5] tools_4.3.0                   bitops_1.0-7                 
 [7] generics_0.1.3                curl_5.1.0                   
 [9] AnnotationDbi_1.63.2          tibble_3.2.1                 
[11] fansi_1.0.5                   RSQLite_2.3.1                
[13] blob_1.2.4                    pkgconfig_2.0.3              
[15] lifecycle_1.0.3               GenomeInfoDbData_1.2.10      
[17] compiler_4.3.0                Biostrings_2.69.2            
[19] httpuv_1.6.11                 htmltools_0.5.6.1            
[21] RCurl_1.98-1.12               yaml_2.3.7                   
[23] interactiveDisplayBase_1.39.0 pillar_1.9.0                 
[25] later_1.3.1                   crayon_1.5.2                 
[27] ellipsis_0.3.2                cachem_1.0.8                 
[29] mime_0.12                     tidyselect_1.2.0             
[31] digest_0.6.33                 purrr_1.0.2                  
[33] dplyr_1.1.3                   BiocVersion_3.18.0           
[35] fastmap_1.1.1                 grid_4.3.0                   
[37] cli_3.6.1                     magrittr_2.0.3               
[39] utf8_1.2.3                    withr_2.5.1                  
[41] filelock_1.0.2                promises_1.2.1               
[43] rappdirs_0.3.3                bit64_4.0.5                  
[45] XVector_0.41.1                httr_1.4.7                   
[47] bit_4.0.5                     png_0.1-8                    
[49] memoise_2.0.1                 shiny_1.7.5                  
[51] rlang_1.1.1                   Rcpp_1.0.11                  
[53] xtable_1.8-4                  glue_1.6.2                   
[55] DBI_1.1.3                     BiocManager_1.30.22          
[57] R6_2.5.1                      Rhdf5lib_1.23.2              
[59] zlibbioc_1.47.0              
@lshep
Copy link
Contributor

lshep commented Oct 13, 2023

We do have a lock in BFC but apparently that isn't sufficient enough. We can look into it.

@romanzenka
Copy link

romanzenka commented Oct 13, 2023

I think this is a related, but different problem. The problem you describe is about resource acquisition going through stages "we do not have it" -> "downloading" -> "available". A caching mechanism not aware of the fact that a resource is already being downloaded will waste downloads, as multiple processes try and only one (hopefully) wins and gets the cache updated.

The problem with marking a resource as "downloading" is that if the download fails, the item will become a deadlock and everyone will wait forever for the download to finish. So you would have to be extremely careful and develop traps so that you unlock these resources if your download process dies prior to download completion. And even then it can fail, so you would have to implement timeouts to clear downloads after a while of inactivity. This all is necessary to have a truly robust solution, but I think it is also quite hard to do correctly.

Lastly, SQLite is not a best mechanism for synchronizing large number of very active processes. It does support rudimentary locking on the file level, but it is not bulletproof (read more here: https://www.sqlite.org/lockingv3.html ). So if you are after a genuinely robust solution, the package would have to support other, more robust database engines - things that you can only do with a database server.

@HenrikBengtsson
Copy link

Without having looked into the internals of how it is currently working, a complementary mechanism for file locks, which are never 100% reliable, is to create/write/copy files atomically. For instance, when copying or saving a file, there's a time period where the file is incomplete. The greater the file is, or the slower the file system is, the greater this time period is. To lower the length of time that an incomplete file is exposed, is to write to a temporary file that is then renamed. For instance, instead of saving an RDS file using:

saveRDS(data, file = filename)

one can rely on file renaming being an atomic operation:

filename_tmp <- sprintf("%s.tmp", filename)
saveRDS(data, file = filename_tmp)
file.rename(filename_tmp, filename)

One can of course come up with more elaborate file extensions than *.tmp to reduce the risk for filename clashes too, e.g. use tempfile() to generate filename_tmp is the same directory. Note, we want it to be in the same directory to avoid the risk of ending up with cross-device file locations.

I have used the above atomic write/copy strategy for a decade or so in other packages, and it has lowered the number of file-corruption errors reported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants