Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minio local endpoint is broken #2303

Open
smasyutin opened this issue Feb 13, 2025 · 8 comments · May be fixed by #2343
Open

Minio local endpoint is broken #2303

smasyutin opened this issue Feb 13, 2025 · 8 comments · May be fixed by #2343
Assignees
Labels
bug Something isn't working

Comments

@smasyutin
Copy link

dlt version

1.6.1

Describe the problem

I want to read a pipeline dataset from my local minio instance. I create pipeline in code as

pipeline = dlt.pipeline(
            pipeline_name="mypipeline",
            destination=dlt.destinations.filesystem(
                bucket_url="mybucket",
                credentials=AwsCredentials(
                    aws_access_key_id="minioadmin",
                    aws_secret_access_key="minioadmin",
                    endpoint_url="http://localhost:9000",
                ),
            ),
            dataset_name="written_successfully",
        )

But then when I try to read it as

pipeline.dataset()["written_successfully"].df()

I get an error IO Error: Could not establish connection error for HTTP GET to '//localhost:9000/?encoding-type=url&list-type=2&prefix=...'

The "http:" was cut out of the URL.

Expected behavior

I work flawlessly with dlthub regardless if it is local, cloud or self-hosted storage like minio.

Steps to reproduce

I want to read a pipeline dataset from my local minio instance. I create pipeline in code as

pipeline = dlt.pipeline(
            pipeline_name="mypipeline",
            destination=dlt.destinations.filesystem(
                bucket_url="mybucket",
                credentials=AwsCredentials(
                    aws_access_key_id="minioadmin",
                    aws_secret_access_key="minioadmin",
                    endpoint_url="http://localhost:9000",
                ),
            ),
            dataset_name="written_successfully",
        )

But then when I try to read it as

pipeline.dataset()["written_successfully"].df()

I get an error IO Error: Could not establish connection error for HTTP GET to '//localhost:9000/?encoding-type=url&list-type=2&prefix=...'

The "http:" was cut out of the URL.

Operating system

macOS

Runtime environment

Local

Python version

3.11

dlt data source

No response

dlt destination

Filesystem & buckets

Other deployment details

No response

Additional information

No response

@sh-rp
Copy link
Collaborator

sh-rp commented Feb 17, 2025

minio is an s3 dropin replacement, so your bucket_url needs to follow s3 notation: "s3://my_bucket"

@sh-rp sh-rp self-assigned this Feb 17, 2025
@sh-rp sh-rp added the wontfix This will not be worked on label Feb 17, 2025
@smasyutin
Copy link
Author

sorry @sh-rp. I was not showing my bucket actual value. The value was with "s3://" prefix. And DLT was OK to write to it. But it failed to read. And that's the issue.

@smasyutin
Copy link
Author

@sh-rp you may be right that the issue could be somewhere else. I cannot share all of my code how I run DLT in a concurrent environment where I see that DLT struggles: it works locally on Mac and fails to find tables in kubernetes. Now this thing with being able to write, but not to read....

While debugging DLT code I saw that you pass all DuckDB secret settings in a right way. But when interacting with DLT it fails to get a correct endpoint url. I'm not sure if that's something related to DuckDB thread safety specifics or anything else.

Anyway, I dropped DLT for reading and replaced with plain pyarrow dataset. I will likely get back to DLT when I'd need a more complex data relationships than a single table.

Sorry for too many details. IMO the important bit is that DLT dataset does not work reliably in concurrent environment

@sh-rp
Copy link
Collaborator

sh-rp commented Feb 17, 2025

OK. I am not quite sure what the problem is though, maybe there is something going on with your minio server and it only accepts one connection at a time or something like that? I would need some code that is locally reproducible.

@smasyutin
Copy link
Author

@sh-rp the thing is that endpoint_url ("http://localhost:9000") is cut off protocol when making request to minio. Please see the error "...HTTP GET to '//localhost:9000/ ...". See the missing "http:" piece here.

I will do my best to provide a workable code to reproduce when I have some time.

@sh-rp
Copy link
Collaborator

sh-rp commented Feb 24, 2025

@smasyutin I have attached a PR which should fix this problem for your. Now http (without ssl) urls are interpreted correctly. You may also need to set the url_style to path, depending on wether you have set up your minio to support vhost path styles or not. Can you let me know if this fixes your problem?

@sh-rp sh-rp added bug Something isn't working and removed wontfix This will not be worked on labels Feb 24, 2025
@sh-rp sh-rp moved this from Todo to In Progress in dlt core library Feb 24, 2025
@smasyutin
Copy link
Author

Hi @sh-rp . Thanks for the update and sorry for the delay.
Based on the PR it seems to be doing exactly what was needed.
Is there a pre-built package that I can install and run in my env for the code change?

@smasyutin
Copy link
Author

@sh-rp I tested it with local dlt from the branch and it works for me with s3_url_style: "path".
Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

2 participants