Save compressed load files with `.gz` extension #925

steinitzu · 2024-02-01T01:35:56Z

Problem description

Load files are always saved with the extension of the file format, regardless of whether compression is enabled.
I.e. s3://path/to/load/file.jsonl may or may not be compressed.
This causes issues with e.g. databricks loader which can't parse gzip files without .gz extension so compression must be disabled. Possibly affects snowflake and redshift too, I think we assume json files are always compressed.

Solution

The data writer should add .gz extension when compression is enabled.
There are few places where we parse filenames to detect format, usually relying on filename.endswith('.<file_format>') or os.path.splitext. These need to be refactored, we need a clean way to get format/compression from filename.

The text was updated successfully, but these errors were encountered:

elviskahoro · 2024-11-29T17:47:39Z

+1 this!

You can disable compression by setting the env variable:
os.environ["DATA_WRITER__DISABLE_COMPRESSION"] = str(True)

rudolfix · 2024-12-01T16:13:07Z

@elviskahoro there are many ways to disable compression in code. all of them change the configuration though ie.
os.environ["DATA_WRITER__DISABLE_COMPRESSION"] = "True"
dlt.config["data_writer.disable_compression"] = True

kiernan · 2025-02-10T08:29:25Z

I've also noticed this when attempting to writing jsonl to cloud storage and then read from it in another pipeline.

Despite using the built-in filesystem destination module to write it, the filesystem source module and read_jsonl() transformer will fail to parse it.

It does work when adding another transformer like this in-between:

    @dlt.transformer()
    def set_file_type(
        items: Iterator[FileItemDict],
        encoding: Optional[str] = None,
        mime_type: Optional[str] = None,
    ) -> Iterator[FileItemDict]:
        for file_obj in items:
            if encoding is not None:
                file_obj["encoding"] = encoding
            if mime_type is not None:
                file_obj["mime_type"] = mime_type
        return items

github-project-automation bot added this to dlt core library Feb 1, 2024

github-project-automation bot moved this to Todo in dlt core library Feb 1, 2024

rudolfix moved this from Todo to Planned in dlt core library Feb 6, 2024

steinitzu self-assigned this May 16, 2024

rudolfix unassigned steinitzu May 27, 2024

rudolfix moved this from Planned to Todo in dlt core library May 27, 2024

rudolfix mentioned this issue Jun 19, 2024

Add extension .gz when filesystem outputs is compressed #1487

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save compressed load files with `.gz` extension #925

Save compressed load files with `.gz` extension #925

steinitzu commented Feb 1, 2024

elviskahoro commented Nov 29, 2024 •

edited

Loading

rudolfix commented Dec 1, 2024

kiernan commented Feb 10, 2025

Save compressed load files with .gz extension #925

Save compressed load files with .gz extension #925

Comments

steinitzu commented Feb 1, 2024

Problem description

Solution

elviskahoro commented Nov 29, 2024 • edited Loading

rudolfix commented Dec 1, 2024

kiernan commented Feb 10, 2025

Save compressed load files with `.gz` extension #925

Save compressed load files with `.gz` extension #925

elviskahoro commented Nov 29, 2024 •

edited

Loading