Add support for multi-node for `%load_node` or `%load_pipeline` #4170

noklam · 2024-09-16T11:13:30Z

Description

%load_node assume error comes from the node but this is not always the case.

This happens more commonly in Data Engineering pipeline, where you apply a series of transformation, aggregation on a set of table and pass it to next node. For example, you may get a "Column not found" error. loading the node that thrown out an error is only the first step to inspect the data, but there are still couple of manual steps to figure out where is the source of error.

The process roughly work as a binary search of the upstream nodes.

Context

This augment the existing debugging feature of Kedro and making this much easier for DS & DE

Runner is an abstraction that is powerful but not beginner friendly, bring the execution explicitly into a notebook cell is helpful

It's not a trivial task to figure out the correct execution order from a Kedro pipeline to a imperative manner (i.e. cells run sequentially in a notebook). The abstraction is a distraction mostly during debugging.

Possible Implementation

Limitation: Creating multiple cells is not easy, I tried in %load_node the first time but settle with the current solution because IPython do have limitations. We may be able to do this in Jupyter Notebook (not VSCode notebook) because there are better support

Possible Alternatives

The text was updated successfully, but these errors were encountered:

dundermain · 2024-11-15T18:55:19Z

Hey @merelcht and @noklam , I would like to work on this issue. Let me know if that is okay with you.

noklam · 2024-11-16T00:50:19Z

@dundermain awesome, we haven't started with this ticket, but if this is something you would like to take a stab, go for it.

dundermain · 2024-11-16T11:48:47Z

Thanks a lot @noklam

dundermain · 2024-11-18T15:08:49Z

@noklam I am thinking of some approaches to solve this issue. Let me know if I am making any mistakes.

Approach: Since pipelines are made of nodes, I am planning to write a magic function "load_pipeline" that will take the pipeline name as input and return the contents of all the nodes in that pipeline. The only problem that I can see, and which you have already mentioned, is creating multiple cells. Since the pipeline can have multiple nodes, it might complicate it further.

Another approach that I can think of is instead of giving the pipeline name, the user can pass the node name to the load_pipeline function just like they do for the load_node function. Then, a function like _find_pipeline will search for the pipeline to which the given node belongs and then execute the load_pipeline function.

Let me know your thoughts on this. I am still experimenting with handling that multiple-cell part and will give an update soon.

noklam added the Issue: Feature Request New feature or improvement to existing feature label Sep 16, 2024

noklam added this to Kedro Framework Sep 16, 2024

merelcht added this to the Improve the usability and debugging experience for Jupyter notebooks milestone Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for multi-node for `%load_node` or `%load_pipeline` #4170

Add support for multi-node for `%load_node` or `%load_pipeline` #4170

noklam commented Sep 16, 2024 •

edited

Loading

dundermain commented Nov 15, 2024

noklam commented Nov 16, 2024

dundermain commented Nov 16, 2024

dundermain commented Nov 18, 2024

Add support for multi-node for %load_node or %load_pipeline #4170

Add support for multi-node for %load_node or %load_pipeline #4170

Comments

noklam commented Sep 16, 2024 • edited Loading

Description

Context

Possible Implementation

Possible Alternatives

dundermain commented Nov 15, 2024

noklam commented Nov 16, 2024

dundermain commented Nov 16, 2024

dundermain commented Nov 18, 2024

Add support for multi-node for `%load_node` or `%load_pipeline` #4170

Add support for multi-node for `%load_node` or `%load_pipeline` #4170

noklam commented Sep 16, 2024 •

edited

Loading