Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add git to filesystem source 301 devel #893

Closed

Conversation

deanja
Copy link
Contributor

@deanja deanja commented Jan 14, 2024

Support git filesystem Resource per dlt-hub/verified-sources#301

May also include general improvements to dlt fsspec handling.

Copy link

netlify bot commented Jan 14, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 801a008
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/65a493fa222c2e0008363747

@deanja
Copy link
Contributor Author

deanja commented Jan 14, 2024

Supporting parameters in fsspec URLs requires some changes in dlt.common.storages . Some comments on those changes below and remaining issues.

class FileItemDict(DictStrAny):
    """A FileItem dictionary with additional methods to get fsspec filesystem, open and read files."""

    def __init__(
        self,
        mapping: FileItem,
        credentials: Optional[Union[FileSystemCredentials, AbstractFileSystem]] = None,
    ):

show code

Points re supporting fsspec url arguments:

FileItem.file_url has conflicting responsibilities:

  1. parameter to instantiate an fsspec client, for which it would need netloc params intact, eg <repo_path>:<ref>@
  2. path for open() . fsspec will strip out the protocol and netloc args.
  3. path for readbytes(). fsspec will strip out the protocol and netloc args.
  4. dlt destination column - where it needs to be nicely formatted and secrets masked.

FileItemDict gets its file_url from the output of case 4 - glob_files() . So if the netloc is stripped there, it will cause unpredictable results for cases 1, 2, 3.

Options for resolving that:

  • don't strip file_url in glob_files(). Will need to redact secrets somehow (more in comment below). Need to address difficulty of building file_url due to unescaped characters in netloc params.
  • for case 1, filesystem Resource must instantiate FileItemDict with an fs client, not just a file_url. That may be needed anyway for new kwargs/client_kwargs to work - need to learn more about 'credentials' object.

I can see the file_url is a lightweight way to pass fsspec context between Source, Resource, transformer etc. But it is fragile.

In general for FileItemDict, not sure about this factory: questions below

    @property
    def fsspec(self) -> AbstractFileSystem:
        """The filesystem client is based on the available instance,  or url and credentials.

        Returns:
            AbstractFileSystem: The fsspec client.
        """
        if isinstance(self.fsspec_instance_or_credentials, AbstractFileSystem):
            return self.credentials
        else:
            return fsspec_filesystem(self["file_url"], self.credentials)[0
  • Are there any current use cases for constructing with url + FileSystemCredentials? Yes, filesystem Resource.
  • If yes, does FileItemDict also need to pass through the [kwargs and client_kwargs recently supported in Enhancements in Filesystem Configuration #869 (This is independent of the fsspec urls requirement).
  • The original bucket_url could help too, to solve url confusion noted above.
  • Then it starts to look very similar to fssspec_filesystem(), so I wonder about refactoring somehow. Perhaps callers (dlt Sources) should be responsible for passing in a working fs client instance, with testing for that happening in the source/resource/transformer.

@deanja
Copy link
Contributor Author

deanja commented Jan 14, 2024

Fsspec urls can contain some strange things that urlparse() can't understand. For example, a path containing / in the username: element. parsing this with a general algorithm is very tricky, even if parse.quote() unquote() is used.

Workaround here is to call fs_client._strip_protocol, which is extended by quite a few fsspec implementations.

def glob_files(
    fs_client: AbstractFileSystem, bucket_url: str, file_glob: str = "**"
) -> Iterator[FileItem]:
    """Get the files from the filesystem client.

    Args:
        fs_client (AbstractFileSystem): The filesystem client.
        bucket_url (str): The url to the bucket.
        file_glob (str): A glob for the filename filter.

    Returns:
        Iterable[FileItem]: The list of files.
    """
    import os

    protocol = urlparse(bucket_url).scheme
    fsspec_path = fs_client._strip_protocol(bucket_url)
    bucket_url = f"{protocol}://{fsspec_path}"

    bucket_url_parsed = urlparse(bucket_url)

show code

But _strip_protocol is an _internal method.

The only public way to get fsspec to return a stripped url is via url_to_fs(), which returns a cleaned path (as well as the fs instance). Using url_to_fs() could be elegant and simplify this glob_files() method, but may require some breaking changes to filesystem Resource. WIP on a branch

There's similar url parsing issue in FileSystemConfiguration class.

   def on_resolved(self) -> None:
        url = urlparse(self.bucket_url)

view code

That url is used for validation rules. It happens to work now, but it could be a problem for and any new rules that might be added to on_resolved(). This one is harder to address with _strip_protocol() because there is no fs client available. We could do something like a class factory for fsspec implementations, to get at the _strip_protocol() @classmethod, without having to instantiate.

@deanja
Copy link
Contributor Author

deanja commented Jan 14, 2024

Final little issue with fsspec url parameters. How to redact sensitive info.

    def __str__(self) -> str:
        """Return displayable destination location"""
        url = urlparse(self.bucket_url)
        # do not show passwords
        if url.password:
            new_netloc = f"{url.username}:****@{url.hostname}"
            if url.port:

show code

An fssepec implementation might:

  1. Put sensitive info in a generally non-sensitive url element. eg a password in the port. Recall we are considering hostname:port as a shorthand to clone/fetch a repo from github, bitbucket etc.
  2. Put non-sensitive info in a generally sensitive element, then user can't see it! eg, git ref in the password element, as currently happens with gitpythonfs.

@deanja deanja closed this Jan 29, 2024
@deanja deanja deleted the add-git-to-filesystem-source-301-devel branch January 29, 2024 01:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant