-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to turn off caching (or automatically remove it once files are uploaded or loaded into memory) #233
Comments
Thanks for the report @pbridger. Keeping the files in the temp folder is by design, since having the cache means that you won't need to download them if they get used again. See the docs page on caching for more information. We use the OS supplied temp directory by default. It varies from OS to OS, but this should get cleared by the operating system with some frequency. It seems like whatever system your container is running either doesn't clear the cache explicitly or you haven't hit the OS conditions where it empties the temp directory for you (e.g., for some OSes system reboot, which might not happen in a container). Which of these would be better for your use case?
Either way, removing the files from the tmp dir will have no adverse affects once the file is uploaded. The |
Thanks for your great response! I should have read the docs more thoroughly. My current work-around is roughly what you describe. For our use case the ability to turn off the cache is simplest from the API client perspective: write files from many places in the code, and never think about cleaning up cloudpathlib internals. Part of me thinks if a remote path is used in a create, write, close pattern then caching is unlikely to be useful. But OTOH you probably just want a consistent, simple caching behaviour. |
Updated title to reflect this. I think the simplest implementation would be to check an envvar and/or kwarg in
Yeah, something like that is going to be too complex to bookkeep, so not worth the implementation/maintenance burden IMO. It may be the case that when we implement #9 we can have a mode where nothing ever gets written to a cache on disk. |
+1 for this feature! Took me over two hours of debugging to find the reason why the root volume on my linux VM is running out of space. Similar to OP I run cloudpathlib in a docker container too, which basically means that the temporary directory of the container is never cleared by the system as long as the container lives (many months in my case), not even on VM reboots. Besides, I didn't even expect cloudpathlib to store the files locally, for me it feels like I open a file on the cloud server and have thus expected that I have to care for caching myself. |
I would like to fully disable the cache instead of auto-cleaning it. I wasn't expecting there to be a cache (my fault for not fully reading the manual) and this now complicates my testing. I have a large scale pipeline for the media industry with files usually in the 100s of GBs. We're hybrid on-prem and aws, our app runs in a container but when its on-prem it will also have local network storage mounted, and a locally deployed s3 object store. We will support every combination of s3:// and file:// sources and destinations. I think I can figure out how to hack around the cache system but it won't be trivial to validate this for production. It's currently looking like it will make more sense to use cloudpathlib for the pydantic integration and bootstrapping s3 clients but to actually boto for the file transfers. However if I could do This would be better than auto deleting the cache because even with that option I still have to make sure my cache is on the external mounted storage That said, I think the cache is pretty cool and am thinking about personal backup use cases, having an on-off switch would make this a really powerful library. |
Consolidating into #10. |
@pbridger @b-rad-c @david-woelfle We recently updated the file cache clearing options, which may be of interest for your use cases. Documentation here: https://cloudpathlib.drivendata.org/stable/caching/#clearing-the-file-cache |
Thanks a lot for implementing this @pjbull! Looks good and I am looking forward to use the new feature :) |
I've been writing large files to s3 something like this:
where dest_path is something like
s3://bucket/path.pt
.The bug is that I see my docker container accumulating all files saved to S3 as
/tmp/tmp.../bucket/path.pt
The text was updated successfully, but these errors were encountered: