Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement smart_open instead of .open() to allow efficient streaming (saving/loading) of large files to cloud bucket. #264

Open
hugolytics opened this issue Aug 19, 2022 · 4 comments

Comments

@hugolytics
Copy link

hugolytics commented Aug 19, 2022

The current implementation of the .open methods consists of a local cache which is then synchronized with the cloud.

This method can be replaced by smart_open, to allow for a more efficient mechanism.

One can take inspiration from aws' S3PathLib, (however, that library handles boto session in a way that is not thread-safe, which has made me switch to this library).

Currently, I subclassed S3Path and implemented the aforementioned S3Pathlib's implementation as follows:

from cloudpathlib import S3Path as BaseS3Path
import smart_open

# replace the .open method of S3Path with smart_open
class S3Path(BaseS3Path):
    def open(
            self,
            mode="r",
            buffering=-1,
            encoding=None,
            errors=None,
            newline=None,
            closefd=True,
            opener=None,
            ignore_ext=False,
            compression=None,
            api_kwargs: dict = None, # type: ignore
        ):
            """
            Open S3Path as a file-liked object.
            :return: a file-like object.
            See https://github.com/RaRe-Technologies/smart_open for more info.
            """
            
            kwargs = dict(
                uri=self.as_uri(),
                mode=mode,
                buffering=buffering,
                encoding=encoding,
                errors=errors,
                newline=newline,
                closefd=closefd,
                opener=opener,
                transport_params={"client": self.client}
            )
            return smart_open.open(**kwargs)

    def read_text(
        self,
        encoding="utf-8",
        errors=None,
    ) -> str:
        with self.open(
            mode="r",
            encoding=encoding,
            errors=errors,

        ) as f:
            return f.read()

    def read_bytes(self, ) -> bytes:
        with self.open(mode="rb") as f:
            return f.read()

    def write_text(
        self,
        data: str,
        encoding="utf-8",
        errors=None,
        newline=None,

    ):
        with self.open(
            mode="w",
            encoding=encoding,
            errors=errors,
            newline=newline,
        ) as f:
            f.write(data)

    def write_bytes(self, data: bytes):
        with self.open(mode="wb") as f:
            f.write(data)

However, I could also open a pull request to merge this with the cloudpath definition, since smart_open is cloud-agnostic.

@pjbull
Copy link
Member

pjbull commented Aug 19, 2022

Thanks @hugolytics for your thoughts here. It's definitely interesting to think about ways to leverage smart_open, especially given the breadth of backends it supports. We have a number of features we are considering that it could help with (#9, #10, #29).

To me, this issue is similar to the discussion in #96 and #109. There are certainly backend packages like this that handle the operational side of things that we should consider, since our primary purpose is supporting the pathlib API.

That said, we won't merge the PR (#265 ) as is for a number of reasons, so I'm going to close it:

  • smart_open is designed explicitly for streaming, but there are workflows that benefit from the local cache architecture instead. Ideally we'd support both.
  • We've been wary of taking additional dependencies beyond the "official" SDKs for cloud providers.
  • I think we'd favor an implementation that leverages smart_open but keep the consistency that our current *Client/*Path APIs support.

Note that it is worth bumping #92 that lists these alternatives

@msmitherdc
Copy link

I’d love to see smart_open added here also, as an alternative to the cache concept. We have to open large zip files (10s-100s of gb) just to read some content in place and this all works well with cloudpathlib and smart-open (in my fork).

@pjbull
Copy link
Member

pjbull commented Apr 22, 2023

@msmitherdc Thanks for the comment. To better understand your use case, what are the specific things that you want smart_open for? Is it streaming/partial reads/writes or something beyond that case?

@msmitherdc
Copy link

we are opening 3dtiles and i3s (slpk) mesh files. These are large zip files that we read json files out of. We use reads of the files out of the zip to get info about the mesh. For serving them out for cesium, we read byte ranges and serve them out to the client. So streaming and partial reads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants