Feature request: support for Yaml column renaming #585

adrianbr · 2023-08-24T18:20:36Z

A user already expressed interest in this feature for renaming columns coming from apis as hashes.

This would be an easy way for a user to reconfigure field names already listed to them, instead of them having to figure out the names upfront in python.

This is an overlap with the previously discussed name mapping via the schema for long column names for databases with short names support such as Postgres.

dapomeranz · 2023-08-24T18:25:28Z

As a novice dlt user, this is a feature I was expecting and how I thought it might be implemented.
I would strongly consider adding this for table names as well. Many services have unique identifiers for constructs which may have their actual name as renamable objects. Airtable, Google Sheets, etc.

I think there are other ideas for implementation. Any would be good.

Pass in a python dictionary
Separate yaml file specifically for hard coded name adjustments

rudolfix · 2023-08-27T18:28:20Z

@dapomeranz @adrianbr
we'll go for Python for now. renaming directly in yaml requires a full lineage data to be kept - in essence dlt would need to generate unique id for every named entity in schema (tables and columns) based on the names in source (ie. API endpoints and filed names in json). if we are able to map each single data item into an entity in schema via separate automatic id, the names are free to be changed. until now they function as ids.

we have plenty of ticket requesting the behavior above. this will be quite a big implementation step

what I plan for now

renaming of the table will be released on Monday (resource.table_name ="xxx")
renaming of columns: what I'm missing is some really neat interface to do that. ie
resource.rename_columns(list of mappings)

do you think it makes sense to do (2) it in Python? any ideas for better interface?

dapomeranz · 2023-08-28T19:21:42Z

If it is eventually going to be possible in the schema, then maybe it makes sense to delay this effort in favor of waiting. I can't speak very well to prioritites. I do think this is an important feature but maybe if it was easier to chain a dbt transformation immediately after running a pipeline then this feature isn't as necessary.
If we do want to implement it now, I imagine a parameter for the pipeline could just be a dictionary for table names. Then in processing the pipeline:
if table name exists in dictionary, use the value from that dictionary instead of the table name
It is a simple implementation but should be effective for the problem at hand.

github-project-automation bot added this to dlt core library Aug 24, 2023

github-project-automation bot moved this to Todo in dlt core library Aug 24, 2023

rudolfix moved this from Todo to Planned in dlt core library Aug 26, 2023

rudolfix moved this from Planned to In Progress in dlt core library Aug 27, 2023

rudolfix moved this from In Progress to Planned in dlt core library Aug 28, 2023

rudolfix moved this from Planned to Todo in dlt core library Dec 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: support for Yaml column renaming #585

Feature request: support for Yaml column renaming #585

adrianbr commented Aug 24, 2023

dapomeranz commented Aug 24, 2023

rudolfix commented Aug 27, 2023

dapomeranz commented Aug 28, 2023

Feature request: support for Yaml column renaming #585

Feature request: support for Yaml column renaming #585

Comments

adrianbr commented Aug 24, 2023

dapomeranz commented Aug 24, 2023

rudolfix commented Aug 27, 2023

dapomeranz commented Aug 28, 2023