Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error messages for modular pipelines #2633

Closed
astrojuanlu opened this issue Jun 2, 2023 · 5 comments · Fixed by #3716
Closed

Improve error messages for modular pipelines #2633

astrojuanlu opened this issue Jun 2, 2023 · 5 comments · Fixed by #3716
Assignees
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@astrojuanlu
Copy link
Member

Description

Some errors that happen when using modular pipelines could be more helpful.

For example: if a node has a non-existing dataset as input, the ModularPipelineError will actually include the name of the existing dataset in the catalog. You can reproduce that by taking the code of our tutorial:

inputs=["model_input_table", "params:model_options"],

And changing one of the inputs to model_input_table_NOT_FOUND will raise this error:

ModularPipelineError: Failed to map datasets and/or parameters: model_input_table

(source: https://www.linen.dev/s/kedro/t/12314014/hi-everyone-i-am-trying-to-use-the-modular-pipeline-module-b#27abe516-0718-43d4-8924-2bd965f64d22)

Another one that recently confused an internal user: the "Inputs should be free inputs to the pipeline". A free input is "not an output from another node, thus unbound or free" (@idanov).

https://github.com/kedro-org/kedro/blob/f8230cdbc653f4c66194c34b91fd74b919ae7183/kedro/pipeline/modular_pipeline.py#L54C1-L55

Possible Implementation

In the first case, maybe the error checking code should first check the nodes inputs, to give a more helpful error message.

In the second case, the text could say for example "Inputs must not be outputs from another node".

Possible Alternatives

(Optional) Describe any alternative solutions or features you've considered.

@astrojuanlu astrojuanlu added the Issue: Feature Request New feature or improvement to existing feature label Jun 2, 2023
@merelcht
Copy link
Member

@astrojuanlu Can you add some steps to reproduce these confusing error message(s)?

@astrojuanlu
Copy link
Member Author

I haven't worked on a reproducer yet but here's another user puzzled by the error message https://www.linen.dev/s/kedro/t/13226590/hi-everyone-i-m-having-a-bit-of-hard-time-understanding-what#055add79-fc39-450f-98eb-b8a8746cd2e7

Could some kindly unpack / explain it ?

@datajoely
Copy link
Contributor

So I have lived experience teaching people to use the feature and this not being straightforward - a fuzzy suggestion workflow would go a long way I think.

@astrojuanlu
Copy link
Member Author

So, the way to reproduce this error is, starting from the spaceflights tutorial, to create a data_science/pipeline.py as follows:

from kedro.pipeline import Pipeline, node, pipeline

from .nodes import evaluate_model, split_data, train_model


def create_pipeline(**kwargs) -> Pipeline:
    pipeline_instance = pipeline(
        [
            node(
                func=split_data,
                # Note wrong input name
                inputs=["model_input_table_NOT_FOUND", "params:model_options"],
                outputs=["X_train", "X_test", "y_train", "y_test"],
                name="split_data_node",
            ),
            node(
                func=train_model,
                inputs=["X_train", "y_train"],
                outputs="regressor",
                name="train_model_node",
            ),
            node(
                func=evaluate_model,
                inputs=["regressor", "X_test", "y_test"],
                outputs=None,
                name="evaluate_model_node",
            ),
        ]
    )

    ds_pipeline_1 = pipeline(
        pipe=pipeline_instance,
        inputs="model_input_table",
        namespace="active_modelling_pipeline",
    )
    ds_pipeline_2 = pipeline(
        pipe=pipeline_instance,
        inputs="model_input_table",
        namespace="candidate_modelling_pipeline",
    )

    return ds_pipeline_1 + ds_pipeline_2

And then the error will be

ModularPipelineError: Failed to map datasets and/or parameters: model_input_table

Why is this confusing? Because model_input_table is a well-defined dataset in the catalog. But the error actually means that there's a mismatch between the input declared in the namepaced pipeline and the one originally declared in the pipeline instance.

@datajoely
Copy link
Contributor

I think it would be very helpful to suggest:

  • partial matches which helps solves the namespace mismatch issue
  • fuzzy matches for typos

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants