-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save compressed load files with .gz
extension
#925
Comments
+1 this! You can disable compression by setting the env variable: |
@elviskahoro there are many ways to disable compression in code. all of them change the configuration though ie. |
I've also noticed this when attempting to writing jsonl to cloud storage and then read from it in another pipeline. Despite using the built-in filesystem destination module to write it, the filesystem source module and It does work when adding another transformer like this in-between:
|
Problem description
Load files are always saved with the extension of the file format, regardless of whether compression is enabled.
I.e.
s3://path/to/load/file.jsonl
may or may not be compressed.This causes issues with e.g. databricks loader which can't parse gzip files without
.gz
extension so compression must be disabled. Possibly affects snowflake and redshift too, I think we assume json files are always compressed.Solution
The data writer should add
.gz
extension when compression is enabled.There are few places where we parse filenames to detect format, usually relying on
filename.endswith('.<file_format>')
oros.path.splitext
. These need to be refactored, we need a clean way to get format/compression from filename.The text was updated successfully, but these errors were encountered: