-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spike: proposal for better default node names #3575
Comments
Don't node names need to be unique? Even if not ideal, the current approach guarantees uniqueness, while the latter very frequently won't. |
Correct: the current approach guarantees uniqueness under the current assumptions of "no two nodes can output the same dataset". More strawman proposals:
More ideas? |
It may be obvious, but can I ask why node name need to be unique? I feel like they exists for different purposes, I would like to map them out and see if there are opportunities to combine them. There would be at least 2 different purpose, one is for internal working, the other for human-readable. Currently there are different names.
Noted some of this already exist in Node API, though it may not be what you expected.
Re: @deepyaman point, I think is fair for any medium size pipeline using p.s. I am asking this just because I am lazy, I will do the digging at some point, but if someone already know that could save some time. |
Agreed with the sentiment. I imagine a "node ID" should be unique, but not sure why we want names to be unique. |
This is my understanding:
This definition is a bit leaky, because I will think combination of (func, inputs + outputs) is already an unique key. I test this and Kedro thinks they are valid:
I tend to say it's NO, they are the same. Back to the original idea, If we offload this responsibility to What if user provide a name but there are multiple nodes return with the same name? This affect things such as 2 options:
Con:
Pro:
Note that having duplicate names doesn't mean that you need to change the name immediately, you only have to do so if you have to specify it as an argument (smaller chance) |
Maybe this is a good place to start?
Now that we switched to graphlib #3728, would it be interesting to try to do some "tree-shaking" and see if we can remove (after deprecation) some of these methods? |
The comment was mostly based on the investigation that I have done 2 months ago, they are documented a bit more in details here: https://noklam.github.io/blog/posts/default_node_name/2024-02-08-default-node-name.html
We can definitely try this, but this wouldn't address the problem described in this issue, it may helps with removing the API surface. The main proposal of #3575 (comment) is that can we loosen up to make |
Related: one user told me today that he found node names "weird" and "useless", given that they were inferred from function names already. |
When discussing kedro-org/kedro-viz#2036 today, we realised that a few more surprising things about names:
This is on top of what we already knew:
I think we need to plan what to do about this. |
Also adding a similar use where it doesnt work if name is not specified: From slack chat: I noticed something that when i define the following node:
and then run this test:
I get an error saying name doesnt exists but when i specify name parameter in node itself it works. |
Another way this manifests is when using
|
Description
When a node has no name, the default is to use a string representation for it. For example:
This poses a series of problems:
kedro airflow create
produces very long task ids when using unnamed nodes kedro-plugins#397 "and node.name defaults to the signature"func.__name__
for a more convenient node name https://linen-slack.kedro.org/t/16107421/a-tip-i-wanted-to-share-you-can-name-a-node-using-the-dunder#0e36ce67-23d8-4638-9e79-113d66b2229dContext
I contend that
'split_data([model_input_table; params: model_options]) -> [X_train;X _test;y_train;y_test]'
is a bad name because:__repr__
rather than a "name"Plus all the problems discussed above.
Task
Come up with a proposal for a better default node name, taking into account:
Possible Implementation
Use
func.__name__
as default node names.The text was updated successfully, but these errors were encountered: