Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to turn off caching (or automatically remove it once files are uploaded or loaded into memory) #233

Closed
pbridger opened this issue Jun 2, 2022 · 8 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@pbridger
Copy link

pbridger commented Jun 2, 2022

I've been writing large files to s3 something like this:

with dest_path.open('wb') as f:
    torch.save(model.state_dict(), f)

where dest_path is something like s3://bucket/path.pt.

The bug is that I see my docker container accumulating all files saved to S3 as /tmp/tmp.../bucket/path.pt

@pjbull
Copy link
Member

pjbull commented Jun 2, 2022

Thanks for the report @pbridger.

Keeping the files in the temp folder is by design, since having the cache means that you won't need to download them if they get used again. See the docs page on caching for more information.

We use the OS supplied temp directory by default. It varies from OS to OS, but this should get cleared by the operating system with some frequency. It seems like whatever system your container is running either doesn't clear the cache explicitly or you haven't hit the OS conditions where it empties the temp directory for you (e.g., for some OSes system reboot, which might not happen in a container).

Which of these would be better for your use case?

Either way, removing the files from the tmp dir will have no adverse affects once the file is uploaded. The fspath property will get you the path to the cached version on the local filesystem as a string. After uploading you could explicitly delete the cache file with Path(dest_path.fspath).unlink().

@pbridger
Copy link
Author

pbridger commented Jun 2, 2022

Thanks for your great response! I should have read the docs more thoroughly. My current work-around is roughly what you describe.

For our use case the ability to turn off the cache is simplest from the API client perspective: write files from many places in the code, and never think about cleaning up cloudpathlib internals.

Part of me thinks if a remote path is used in a create, write, close pattern then caching is unlikely to be useful. But OTOH you probably just want a consistent, simple caching behaviour.

@pjbull pjbull changed the title /tmp file remains after writing to AnyPath s3 destination Add ability to turn off caching (or automatically remove it once files are uploaded or loaded into memory) Jun 2, 2022
@pjbull
Copy link
Member

pjbull commented Jun 2, 2022

For our use case the ability to turn off the cache is simplest from the API client perspective: write files from many places in the code, and never think about cleaning up cloudpathlib internals.

Updated title to reflect this. I think the simplest implementation would be to check an envvar and/or kwarg in CloudPath.__init__ and set a ._clean_cache flag on the object. Then we can do Path(self.fspath).unlink() within CloudPath.__del__ so it is removed on GC.

Part of me thinks if a remote path is used in a create, write, close pattern then caching is unlikely to be useful. But OTOH you probably just want a consistent, simple caching behaviour.

Yeah, something like that is going to be too complex to bookkeep, so not worth the implementation/maintenance burden IMO. It may be the case that when we implement #9 we can have a mode where nothing ever gets written to a cache on disk.

@pjbull pjbull added enhancement New feature or request good first issue Good for newcomers labels Jun 2, 2022
@david-woelfle
Copy link

+1 for this feature! Took me over two hours of debugging to find the reason why the root volume on my linux VM is running out of space. Similar to OP I run cloudpathlib in a docker container too, which basically means that the temporary directory of the container is never cleared by the system as long as the container lives (many months in my case), not even on VM reboots.

Besides, I didn't even expect cloudpathlib to store the files locally, for me it feels like I open a file on the cloud server and have thus expected that I have to care for caching myself.

@b-rad-c
Copy link

b-rad-c commented Dec 5, 2022

I think the simplest implementation would be to check an envvar and/or kwarg in CloudPath.init and set a ._clean_cache flag on the object.

I would like to fully disable the cache instead of auto-cleaning it. I wasn't expecting there to be a cache (my fault for not fully reading the manual) and this now complicates my testing. I have a large scale pipeline for the media industry with files usually in the 100s of GBs. We're hybrid on-prem and aws, our app runs in a container but when its on-prem it will also have local network storage mounted, and a locally deployed s3 object store. We will support every combination of s3:// and file:// sources and destinations. I think I can figure out how to hack around the cache system but it won't be trivial to validate this for production. It's currently looking like it will make more sense to use cloudpathlib for the pydantic integration and bootstrapping s3 clients but to actually boto for the file transfers. However if I could do CLOUDPATHLIB_DISABLE_CACHE=true I wouldn't have to think about anything.

This would be better than auto deleting the cache because even with that option I still have to make sure my cache is on the external mounted storage /mnt/file_server/file.txt vs the container's local storage /tmp/file.txt so that I don't get performance hits.

That said, I think the cache is pretty cool and am thinking about personal backup use cases, having an on-off switch would make this a really powerful library.

@pjbull
Copy link
Member

pjbull commented Dec 18, 2022

Consolidating into #10.

@pjbull
Copy link
Member

pjbull commented Feb 16, 2023

@pbridger @b-rad-c @david-woelfle We recently updated the file cache clearing options, which may be of interest for your use cases. Documentation here: https://cloudpathlib.drivendata.org/stable/caching/#clearing-the-file-cache

@david-woelfle
Copy link

@pbridger @b-rad-c @david-woelfle We recently updated the file cache clearing options, which may be of interest for your use cases. Documentation here: https://cloudpathlib.drivendata.org/stable/caching/#clearing-the-file-cache

Thanks a lot for implementing this @pjbull! Looks good and I am looking forward to use the new feature :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants