Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: support for Yaml column renaming #585

Open
adrianbr opened this issue Aug 24, 2023 · 3 comments
Open

Feature request: support for Yaml column renaming #585

adrianbr opened this issue Aug 24, 2023 · 3 comments

Comments

@adrianbr
Copy link
Contributor

A user already expressed interest in this feature for renaming columns coming from apis as hashes.

This would be an easy way for a user to reconfigure field names already listed to them, instead of them having to figure out the names upfront in python.

This is an overlap with the previously discussed name mapping via the schema for long column names for databases with short names support such as Postgres.

@dapomeranz
Copy link

image
As a novice dlt user, this is a feature I was expecting and how I thought it might be implemented.
I would strongly consider adding this for table names as well. Many services have unique identifiers for constructs which may have their actual name as renamable objects. Airtable, Google Sheets, etc.

I think there are other ideas for implementation. Any would be good.

  • Pass in a python dictionary
  • Separate yaml file specifically for hard coded name adjustments

@rudolfix rudolfix moved this from Todo to Planned in dlt core library Aug 26, 2023
@rudolfix rudolfix moved this from Planned to In Progress in dlt core library Aug 27, 2023
@rudolfix
Copy link
Collaborator

@dapomeranz @adrianbr
we'll go for Python for now. renaming directly in yaml requires a full lineage data to be kept - in essence dlt would need to generate unique id for every named entity in schema (tables and columns) based on the names in source (ie. API endpoints and filed names in json). if we are able to map each single data item into an entity in schema via separate automatic id, the names are free to be changed. until now they function as ids.

we have plenty of ticket requesting the behavior above. this will be quite a big implementation step

what I plan for now

  1. renaming of the table will be released on Monday (resource.table_name ="xxx")
  2. renaming of columns: what I'm missing is some really neat interface to do that. ie
    resource.rename_columns(list of mappings)

do you think it makes sense to do (2) it in Python? any ideas for better interface?

@rudolfix rudolfix moved this from In Progress to Planned in dlt core library Aug 28, 2023
@dapomeranz
Copy link

  1. If it is eventually going to be possible in the schema, then maybe it makes sense to delay this effort in favor of waiting. I can't speak very well to prioritites. I do think this is an important feature but maybe if it was easier to chain a dbt transformation immediately after running a pipeline then this feature isn't as necessary.
  2. If we do want to implement it now, I imagine a parameter for the pipeline could just be a dictionary for table names. Then in processing the pipeline:
    if table name exists in dictionary, use the value from that dictionary instead of the table name
    It is a simple implementation but should be effective for the problem at hand.

@rudolfix rudolfix moved this from Planned to Todo in dlt core library Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

3 participants