Filesystem destination does not respect preferred_loader_file_format for dlt metadata #1631

Nintorac · 2024-07-24T10:44:17Z

dlt version

dlt==0.5.1

Describe the problem

Configuring the preferred_loader_file_format for the filesystem destination does not respect preferred_loader_file_format kwarg

Further discussion here

Expected behavior

When configuring preferred_loader_file_format="parquet" I expect the metadata files to be in parquet format, instead they are jsonl.

Steps to reproduce

Run this code

from dlt.destinations import filesystem
parquet_file_system = filesystem(
    preferred_loader_file_format="parquet"
)

pipeline = dlt.pipeline(
    pipeline_name='pipeline',
    destination=parquet_file_system,
    dataset_name='dataset',
)

data = [ {'b': 2} ]
pipeline_run = pipeline_local.run(
    data, 
    table_name='repro',
)

Observe metadata files are jsonl, rather than the expected parquet

Operating system

Linux

Runtime environment

Local

Python version

3.10

dlt data source

No response

dlt destination

Filesystem & buckets

Other deployment details

No response

Additional information

No response

The text was updated successfully, but these errors were encountered:

sh-rp · 2024-07-29T10:45:40Z

Hey @Nintorac this is an implementation decision and not a bug, I agree though that we should probably add a note about it in the docs. Is the fact that the metadata tables are stored as jsonl posing a problem for you at this time?

Nintorac · 2024-07-30T07:06:00Z

Mainly my aversion to jsonl for now aha, but some issues I forsee

it will make my reads slow in the long run
can't do efficient column level selects.
higher storage requirements due to lack of efficient compression
I have to treat the metadata tables differently in my code as well

Would be interested to know why the metadata table write mechanism doesn't use the same pathway as data table write though? from my limited perspective it seems like this functionality should be implemented at the abstract destination level

sh-rp · 2024-07-30T08:36:05Z

@Nintorac ok I understand. So you are actually reading the metadata files in your code? I was more or less working under the assumption that they are for internal dlt use only. But it is a fair point.

Nintorac · 2024-07-30T11:14:20Z

I was intending to use it for change data capture for scd2 type tables (since this isn't supported natively)

But I wasn't aware they were meant for internal use only.

sh-rp · 2024-07-31T09:10:22Z

I'd say they are not strictly meant for internal use, I just didn't expect anyone wanting to query them in the way you describe. scd2 tables currently are not supported for the filesystem by the way (although with the delta tables it should actually work). Could you explain in a bit more detail what you want to do? I'd like to understand the use-case and maybe offer some help or take some inspirations for further work on the filesystem.

github-project-automation bot added this to dlt core library Jul 24, 2024

github-project-automation bot moved this to Todo in dlt core library Jul 24, 2024

rudolfix moved this from Todo to Planned in dlt core library Jul 29, 2024

rudolfix added bug Something isn't working community This issue came from slack community workspace labels Jul 29, 2024

rudolfix moved this from Planned to In Progress in dlt core library Jul 29, 2024

rudolfix assigned sh-rp Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filesystem destination does not respect preferred_loader_file_format for dlt metadata #1631

Filesystem destination does not respect preferred_loader_file_format for dlt metadata #1631

Nintorac commented Jul 24, 2024

sh-rp commented Jul 29, 2024

Nintorac commented Jul 30, 2024

sh-rp commented Jul 30, 2024

Nintorac commented Jul 30, 2024

sh-rp commented Jul 31, 2024

Filesystem destination does not respect preferred_loader_file_format for dlt metadata #1631

Filesystem destination does not respect preferred_loader_file_format for dlt metadata #1631

Comments

Nintorac commented Jul 24, 2024

dlt version

Describe the problem

Expected behavior

Steps to reproduce

Operating system

Runtime environment

Python version

dlt data source

dlt destination

Other deployment details

Additional information

sh-rp commented Jul 29, 2024

Nintorac commented Jul 30, 2024

sh-rp commented Jul 30, 2024

Nintorac commented Jul 30, 2024

sh-rp commented Jul 31, 2024