Skip to content

Commit

Permalink
Revise modular pipelines docs (#3948)
Browse files Browse the repository at this point in the history
---------

Signed-off-by: Dmitry Sorokin <[email protected]>
Signed-off-by: Dmitry Sorokin <[email protected]>
Co-authored-by: Merel Theisen <[email protected]>
Co-authored-by: Ankita Katiyar <[email protected]>
Co-authored-by: Nok Lam Chan <[email protected]>
Co-authored-by: Juan Luis Cano Rodríguez <[email protected]>
Co-authored-by: Jo Stichbury <[email protected]>
Co-authored-by: Yury Fedotov <[email protected]>
  • Loading branch information
7 people authored Jul 2, 2024
1 parent 0b9fe32 commit c269cde
Show file tree
Hide file tree
Showing 12 changed files with 313 additions and 308 deletions.
4 changes: 2 additions & 2 deletions docs/source/faq/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,8 +52,8 @@ This is a growing set of technical FAQs. The [product FAQs on the Kedro website]

## Nodes and pipelines

* [How do I create a modular pipeline](../nodes_and_pipelines/modular_pipelines.md#how-do-i-create-a-modular-pipeline)?

* [How can I create a new blank pipeline](../nodes_and_pipelines/modular_pipelines.md#how-to-create-a-new-blank-pipeline-using-the-kedro-pipeline-create-command)?
* [How can I reuse my pipelines](../nodes_and_pipelines/namespaces.md)?
* [Can I use generator functions in a node](../nodes_and_pipelines/nodes.md#how-to-use-generator-functions-in-a-node)?

## What is data engineering convention?
Expand Down
Binary file modified docs/source/meta/images/cook_disjointed.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/source/meta/images/cook_joined.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/source/meta/images/cook_no_namespace.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/source/meta/images/cook_params.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/meta/images/namespaces_collapsed.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/source/nodes_and_pipelines/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
nodes
pipeline_introduction
modular_pipelines
namespaces
pipeline_registry
micro_packaging
run_a_pipeline
Expand Down
335 changes: 84 additions & 251 deletions docs/source/nodes_and_pipelines/modular_pipelines.md

Large diffs are not rendered by default.

166 changes: 166 additions & 0 deletions docs/source/nodes_and_pipelines/namespaces.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
# Reuse pipelines with namespaces

## How to reuse your pipelines

If you want to create a new pipeline that performs similar tasks with different inputs/outputs/parameters as your `existing_pipeline`, you can use the same `pipeline()` creation function as described in [How to structure your pipeline creation](modular_pipelines.md#how-to-structure-your-pipeline-creation). This function allows you to overwrite inputs, outputs, and parameters. Your new pipeline creation code should look like this:

```python
def create_pipeline(**kwargs) -> Pipeline:
return pipeline(
existing_pipeline, # Name of the existing Pipeline object
inputs = {"old_input_df_name" : "new_input_df_name"}, # Mapping existing Pipeline input to new input
outputs = {"old_output_df_name" : "new_output_df_name"}, # Mapping existing Pipeline output to new output
parameters = {"params: model_options": "params: new_model_options"}, # Updating parameters
)
```

This means you can create multiple pipelines based on the `existing_pipeline` pipeline to test different approaches with various input datasets and model training parameters. For example, for the `data_science` pipeline from our [Spaceflights tutorial](../tutorial/add_another_pipeline.md#data-science-pipeline), you can restructure the `src/project_name/pipelines/data_science/pipeline.py` file by separating the `data_science` pipeline creation code into a separate `base_data_science` pipeline object, then reusing it inside the `create_pipeline()` function:

```python
#src/project_name/pipelines/data_science/pipeline.py

from kedro.pipeline import Pipeline, node, pipeline
from .nodes import evaluate_model, split_data, train_model

base_data_science = pipeline(
[
node(
func=split_data,
inputs=["model_input_table", "params:model_options"],
outputs=["X_train", "X_test", "y_train", "y_test"],
name="split_data_node",
),
node(
func=train_model,
inputs=["X_train", "y_train"],
outputs="regressor",
name="train_model_node",
),
node(
func=evaluate_model,
inputs=["regressor", "X_test", "y_test"],
outputs=None,
name="evaluate_model_node",
),
]
) # Creating a base data science pipeline that will be reused with different model training parameters

# data_science pipeline creation function
def create_pipeline(**kwargs) -> Pipeline:
return pipeline(
[base_data_science], # Creating a new data_science pipeline based on base_data_science pipeline
parameters={"params:model_options": "params:model_options_1"}, # Using a new set of parameters to train model
)
```

To use a new set of parameters, you should create a second parameters file to ovewrite parameters specified in `conf/base/parameters.yml`. To overwrite the parameter `model_options`, create a file `conf/base/parameters_data_science.yml` and add a parameter called `model_options_1`:

```python
#conf/base/parameters.yml
model_options_1:
test_size: 0.15
random_state: 3
features:
- passenger_capacity
- crew
- d_check_complete
- moon_clearance_complete
- company_rating
```

> In Kedro, you cannot run pipelines with the same node names. In this example, both pipelines have nodes with the same names, so it's impossible to execute them together. However, `base_data_science` is not registered and will not be executed with the `kedro run` command. The `data_science` pipeline, on the other hand, will be executed during `kedro run` because it will be autodiscovered by Kedro, as it was created inside the `create_pipeline()` function.
If you want to execute `base_data_science` and `data_science` pipelines together or reuse `base_data_science` a few more times, you need to modify the node names. The easiest way to do this is by using namespaces.

## What is a namespace

A namespace is a way to isolate nodes, inputs, outputs, and parameters inside your pipeline. If you put `namespace="namespace_name"` attribute inside the `pipeline()` creation function, it will add the `namespace_name.` prefix to all nodes, inputs, outputs, and parameters inside your new pipeline.

> If you don't want to change the names of your inputs, outputs, or parameters with the `namespace_name.` prefix while using a namespace, you should list these objects inside the corresponding parameters of the `pipeline()` creation function like this:
> `inputs={"input_that_should_not_be_prefixed"}`
Let's extend our previous example and try to reuse the `base_data_science` pipeline one more time by creating another pipeline based on it. First, we should use the `kedro pipeline create` command to create a new blank pipeline named `data_science_2`:

```python
kedro pipeline create data_science_2
```
Then, we need to modify the `src/project_name/pipelines/data_science_2/pipeline.py` file to create a pipeline in a similar way to the example above. We will import `base_data_science` from the code above and use a namespace to isolate our nodes:

```python
#src/project_name/pipelines/data_science_2/pipeline.py
from kedro.pipeline import Pipeline, pipeline
from ..data_science.pipeline import base_data_science # Import pipeline to create a new one based on it

def create_pipeline() -> Pipeline:
return pipeline(
base_data_science, # Creating a new data_science_2 pipeline based on base_data_science pipeline
namespace = "ds_2", # With that namespace, "ds_2." prefix will be added to inputs, outputs, params, and node names
parameters={"params:model_options": "params:model_options_2"}, # Using a new set of parameters to train model
inputs={"model_input_table"}, # Inputs remain the same, without namespace prefix
)
```

To use a new set of parameters, copy `model_options` from `conf/base/parameters_data_science.yml` to `conf/base/parameters_data_science_2.yml` and modify it slightly to try new model training parameters, such as test size and a different feature set. Call it `model_options_2`:

```python
#conf/base/parameters.yml
model_options_2:
test_size: 0.3
random_state: 3
features:
- d_check_complete
- moon_clearance_complete
- iata_approved
- company_rating
```

In this example, all nodes inside the `data_science_2` pipeline will be prefixed with `ds_2`: `ds_2.split_data`, `ds_2.train_model`, `ds_2.evaluate_model`. Parameters will be used from `model_options_2` because we overwrite `model_options` with them. The input for that pipeline will be `model_input_table` as it was previously, because we mentioned that in the inputs parameter (without that, the input would be modified to `ds_2.model_input_table`, but we don't have that table in the pipeline).

Since the node names are unique now, we can run the project with:

```python
kedro run
```

Logs show that `data_science` and `data_science_2` pipelines were executed successfully with different R2 results. Now, we can see how Kedro-viz renders namespaced pipelines in collapsible "super nodes":

```python
kedro viz run
```

After running viz, we can see two equal pipelines: `data_science` and `data_science_2`:

![namespaces uncollapsed](../meta/images/namespaces_uncollapsed.png)

We can collapse all namespaced pipelines (in our case, it's only `data_science_2`) with a special button and see that the `data_science_2` pipeline was collapsed into one super node called `Ds 2`:

![namespaces collapsed](../meta/images/namespaces_collapsed.png)

> Tip: You can use `kedro run --namespace=namespace_name` to run only the specific namespace

### How to namespace all pipelines in a project

If we want to make all pipelines in this example fully namespaced, we should:

Modify the `data_processing` pipeline by adding to the `pipeline()` creation function in `src/project_name/pipelines/data_processing/pipeline.py` with the following code:
```python
namespace="data_processing",
inputs={"companies", "shuttles", "reviews"}, # Inputs remain the same, without namespace prefix
outputs={"model_input_table"}, # Outputs remain the same, without namespace prefix
```
Modify the `data_science` pipeline by adding namespace and inputs in the same way as it was done in `data_science_2` pipeline:

```python
def create_pipeline(**kwargs) -> Pipeline:
return pipeline(
base_data_science,
namespace="ds_1",
parameters={"params:model_options": "params:model_options_1"},
inputs={"model_input_table"},
)
```

After executing the pipeline with `kedro run`, the visualisation with `kedro viz run` after collapsing will look like this:

![namespaces collapsed all](../meta/images/namespaces_collapsed_all.png)
Loading

0 comments on commit c269cde

Please sign in to comment.