Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save compressed load files with .gz extension #925

Open
steinitzu opened this issue Feb 1, 2024 · 3 comments
Open

Save compressed load files with .gz extension #925

steinitzu opened this issue Feb 1, 2024 · 3 comments

Comments

@steinitzu
Copy link
Collaborator

Problem description

Load files are always saved with the extension of the file format, regardless of whether compression is enabled.
I.e. s3://path/to/load/file.jsonl may or may not be compressed.
This causes issues with e.g. databricks loader which can't parse gzip files without .gz extension so compression must be disabled. Possibly affects snowflake and redshift too, I think we assume json files are always compressed.

Solution

The data writer should add .gz extension when compression is enabled.
There are few places where we parse filenames to detect format, usually relying on filename.endswith('.<file_format>') or os.path.splitext. These need to be refactored, we need a clean way to get format/compression from filename.

@elviskahoro
Copy link

elviskahoro commented Nov 29, 2024

+1 this!

You can disable compression by setting the env variable:
os.environ["DATA_WRITER__DISABLE_COMPRESSION"] = str(True)

@rudolfix
Copy link
Collaborator

rudolfix commented Dec 1, 2024

@elviskahoro there are many ways to disable compression in code. all of them change the configuration though ie.
os.environ["DATA_WRITER__DISABLE_COMPRESSION"] = "True"
dlt.config["data_writer.disable_compression"] = True

@kiernan
Copy link

kiernan commented Feb 10, 2025

I've also noticed this when attempting to writing jsonl to cloud storage and then read from it in another pipeline.

Despite using the built-in filesystem destination module to write it, the filesystem source module and read_jsonl() transformer will fail to parse it.

It does work when adding another transformer like this in-between:

    @dlt.transformer()
    def set_file_type(
        items: Iterator[FileItemDict],
        encoding: Optional[str] = None,
        mime_type: Optional[str] = None,
    ) -> Iterator[FileItemDict]:
        for file_obj in items:
            if encoding is not None:
                file_obj["encoding"] = encoding
            if mime_type is not None:
                file_obj["mime_type"] = mime_type
        return items

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

4 participants