Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filesystem destination does not respect preferred_loader_file_format for dlt metadata #1631

Open
Nintorac opened this issue Jul 24, 2024 · 5 comments
Assignees
Labels
bug Something isn't working community This issue came from slack community workspace

Comments

@Nintorac
Copy link

dlt version

dlt==0.5.1

Describe the problem

Configuring the preferred_loader_file_format for the filesystem destination does not respect preferred_loader_file_format kwarg

Further discussion here

Expected behavior

When configuring preferred_loader_file_format="parquet" I expect the metadata files to be in parquet format, instead they are jsonl.

Steps to reproduce

  1. Run this code
from dlt.destinations import filesystem
parquet_file_system = filesystem(
    preferred_loader_file_format="parquet"
)

pipeline = dlt.pipeline(
    pipeline_name='pipeline',
    destination=parquet_file_system,
    dataset_name='dataset',
)

data = [ {'b': 2} ]
pipeline_run = pipeline_local.run(
    data, 
    table_name='repro',
)
  1. Observe metadata files are jsonl, rather than the expected parquet

Operating system

Linux

Runtime environment

Local

Python version

3.10

dlt data source

No response

dlt destination

Filesystem & buckets

Other deployment details

No response

Additional information

No response

@rudolfix rudolfix moved this from Todo to Planned in dlt core library Jul 29, 2024
@rudolfix rudolfix added bug Something isn't working community This issue came from slack community workspace labels Jul 29, 2024
@rudolfix rudolfix moved this from Planned to In Progress in dlt core library Jul 29, 2024
@sh-rp
Copy link
Collaborator

sh-rp commented Jul 29, 2024

Hey @Nintorac this is an implementation decision and not a bug, I agree though that we should probably add a note about it in the docs. Is the fact that the metadata tables are stored as jsonl posing a problem for you at this time?

@Nintorac
Copy link
Author

Mainly my aversion to jsonl for now aha, but some issues I forsee

  • it will make my reads slow in the long run
  • can't do efficient column level selects.
  • higher storage requirements due to lack of efficient compression
  • I have to treat the metadata tables differently in my code as well

Would be interested to know why the metadata table write mechanism doesn't use the same pathway as data table write though? from my limited perspective it seems like this functionality should be implemented at the abstract destination level

@sh-rp
Copy link
Collaborator

sh-rp commented Jul 30, 2024

@Nintorac ok I understand. So you are actually reading the metadata files in your code? I was more or less working under the assumption that they are for internal dlt use only. But it is a fair point.

@Nintorac
Copy link
Author

I was intending to use it for change data capture for scd2 type tables (since this isn't supported natively)

But I wasn't aware they were meant for internal use only.

@sh-rp
Copy link
Collaborator

sh-rp commented Jul 31, 2024

I'd say they are not strictly meant for internal use, I just didn't expect anyone wanting to query them in the way you describe. scd2 tables currently are not supported for the filesystem by the way (although with the delta tables it should actually work). Could you explain in a bit more detail what you want to do? I'd like to understand the use-case and maybe offer some help or take some inspirations for further work on the filesystem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community This issue came from slack community workspace
Projects
Status: Todo
Development

No branches or pull requests

3 participants