-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A resource should not been made available until it's fully downloaded #32
Comments
We do have a lock in BFC but apparently that isn't sufficient enough. We can look into it. |
I think this is a related, but different problem. The problem you describe is about resource acquisition going through stages "we do not have it" -> "downloading" -> "available". A caching mechanism not aware of the fact that a resource is already being downloaded will waste downloads, as multiple processes try and only one (hopefully) wins and gets the cache updated. The problem with marking a resource as "downloading" is that if the download fails, the item will become a deadlock and everyone will wait forever for the download to finish. So you would have to be extremely careful and develop traps so that you unlock these resources if your download process dies prior to download completion. And even then it can fail, so you would have to implement timeouts to clear downloads after a while of inactivity. This all is necessary to have a truly robust solution, but I think it is also quite hard to do correctly. Lastly, SQLite is not a best mechanism for synchronizing large number of very active processes. It does support rudimentary locking on the file level, but it is not bulletproof (read more here: https://www.sqlite.org/lockingv3.html ). So if you are after a genuinely robust solution, the package would have to support other, more robust database engines - things that you can only do with a database server. |
Without having looked into the internals of how it is currently working, a complementary mechanism for file locks, which are never 100% reliable, is to create/write/copy files atomically. For instance, when copying or saving a file, there's a time period where the file is incomplete. The greater the file is, or the slower the file system is, the greater this time period is. To lower the length of time that an incomplete file is exposed, is to write to a temporary file that is then renamed. For instance, instead of saving an RDS file using: saveRDS(data, file = filename) one can rely on file renaming being an atomic operation: filename_tmp <- sprintf("%s.tmp", filename)
saveRDS(data, file = filename_tmp)
file.rename(filename_tmp, filename) One can of course come up with more elaborate file extensions than I have used the above atomic write/copy strategy for a decade or so in other packages, and it has lowered the number of file-corruption errors reported. |
Reporting this for ExperimentHub but I suspect that AnnotationHub and BiocFileCache have similar issues.
In session 1 do:
While the EH1039 resource is being downloaded in session 1, it's immediately made available in session 2.
In session 2 do:
The classic way around this is to have some kind of lock mechanism that lets other sessions know that the resource is currently being downloaded. The session that starts the download puts the lock on the resource, only if the resource is not already locked. Other sessions trying to access the resource then just wait for the lock to be removed before they return the resource to the user.
This will also have the benefit of preventing 2 sessions from starting the same download. Right now this is possible if for example session 2 does
fname <- hub[["EH1039", force=TRUE]]
while session 1 is still downloading the resource. I don't know what the exact consequences of this are but concurrent downloads of the same resource seem like something that should not be permitted.There's also the question of what happens when the session that started a download dies before the download is complete. Right now it seems that we end up with a corrupted resource in the cache, unless the download was cleanly interrupted at the command line with <CTRL+C> in session 1, in which case it seems that ExperimentHub is able to remove the corrupted resource from the cache. But in case of a more brutal death (e.g. the user inadvertently kills their RStudio session or kills the terminal where they were running R at the command line, or the server is rebooted), then the resource that ends up in the cache will be corrupted. This can be avoided by making the "download + register the resource in the sqlite db" sequence an atomic operation. Note that is something that was brought up here last year.
Thanks,
H.
The text was updated successfully, but these errors were encountered: