Skip to content
This repository has been archived by the owner on Mar 3, 2023. It is now read-only.

Commit

Permalink
Merge pull request #18 from AntonyMilneQB/update-for-0.17.2
Browse files Browse the repository at this point in the history
Update to kedro 0.17.2
  • Loading branch information
antonymilne authored Apr 12, 2021
2 parents d4291d6 + b723533 commit a276935
Show file tree
Hide file tree
Showing 18 changed files with 248 additions and 298 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This repository contains training materials that will teach you how to use [Kedro](https://github.com/quantumblacklabs/kedro/). This content is based on the standard [spaceflights tutorial described in the Kedro documentation](https://kedro.readthedocs.io/en/stable/03_tutorial/01_spaceflights_tutorial.html).

The training documentation was most recently updated against Kedro 0.17.0 in February 2021.
The training documentation was most recently updated against Kedro 0.17.2 in March 2021.

To get started, navigate to the [training_docs](./training_docs/01_welcome.md) to see what is covered in the training, and how to ensure you get the most out of the time you set aside for it.

Expand Down
Binary file removed img/airflow_ui.png
Binary file not shown.
Binary file removed img/context.png
Binary file not shown.
Binary file removed img/context_catalog_load.png
Binary file not shown.
Binary file removed img/custom_command.png
Binary file not shown.
Binary file removed img/enable_tags.png
Binary file not shown.
Binary file removed img/reload_kedro.png
Binary file not shown.
Binary file removed img/tag_nb_cell.png
Binary file not shown.
6 changes: 3 additions & 3 deletions training_docs/01_welcome.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,13 @@ Welcome! We are so pleased you are starting your Kedro journey.

## What you'll cover

* [Training prerequisites](./02_prerequisites.md) << Read this before training starts
* :arrow_right: [Training prerequisites](./02_prerequisites.md) :arrow_left: **Read this before training starts**
* [Create a new Kedro project](./03_new_project.md)
* [Project dependencies](./04_dependencies.md)
* [Add a data source](./05_connect_data_sources.md)
* [Jupyter notebook workflow](./06_jupyter_notebook_workflow.md)
* [Kedro pipelines](./07_pipelines.md)
* [Pipeline visualisatopm](./08_visualisation.md)
* [Pipeline visualisation](./08_visualisation.md)
* [Versioning](./09_versioning.md)
* [Package your project](./10_package_project.md)
* [Configuration](./11_configuration.md)
Expand Down Expand Up @@ -51,4 +51,4 @@ These training materials assume some level of technical understanding, such as k


_[Go to the next page](./02_prerequisites.md)_
_[Go to the next page](./02_prerequisites.md)_
1 change: 0 additions & 1 deletion training_docs/02_prerequisites.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,4 +78,3 @@ If you are able to complete all of the above, you are ready for the training!

_[Go to the next page](./03_new_project.md)_


14 changes: 6 additions & 8 deletions training_docs/03_new_project.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,16 @@

This section mirrors the [spaceflights tutorial in the Kedro documentation](https://kedro.readthedocs.io/en/stable/03_tutorial/01_spaceflights_tutorial.html).

As we work with the spaceflights tutorial, will follow these steps:
As we work with the spaceflights tutorial, we will follow these steps:

### 1. Set up the project template

* Create a new project with `kedro new`
* Install project dependencies with `kedro install`
* Configure the following in the `conf` folder:
* Logging
* Credentials
* Any other sensitive / personal content
* Install project dependencies with `kedro install`

### 2. Set up the data

Expand Down Expand Up @@ -52,9 +52,9 @@ Follow one or other of these instructions to create the project:

- Keep the default names for the `repo_name` and `python_package` when prompted.

- The project will be populated with the template code from the [Kedro starter for the spaceflights tutorial](https://github.com/quantumblacklabs/kedro-starters/tree/master/spaceflights). It means that you can follow the tutorial without any of the copy/pasting.
- The project will be populated with the template code from the [Kedro starter for the spaceflights tutorial](https://github.com/quantumblacklabs/kedro-starters/tree/master/spaceflights). This means that you can follow the tutorial without any of the copy/pasting.

* If you prefer to create an empty tutorial and cut and paste the code to follow along with the steps, you should instead run the following to [create a new empty Kedro project](https://kedro.readthedocs.io/en/stable/02_get_started/04_new_project.html#create-a-new-project-interactively) using the default interactive prompts: `kedro new`
* If you prefer to create an empty tutorial and copy and paste the code to follow along with the steps, you should instead run `kedro new` to [create a new empty Kedro project](https://kedro.readthedocs.io/en/stable/02_get_started/04_new_project.html#create-a-new-project-interactively).

- Feel free to name your project as you like, but this guide will assume the project is named **`Kedro Training`**, and that your project is in a sub-folder in your working directory that was created by `kedro new`, named `kedro-training`.

Expand All @@ -81,11 +81,9 @@ dev_s3:
```
For security reasons, we strongly recommend not committing any credentials or other secrets to the Version Control System. By default any file inside the `conf` folder (and subfolders) in your Kedro project containing `credentials` word in its name will be ignored and not committed to your repository.

Please bear it in mind when you start working with Kedro project that you have cloned from GitHub, for example, as you might need to configure required credentials first.
For security reasons, we strongly recommend not committing any credentials or other secrets to the Version Control System. By default any file inside the `conf` folder (and subfolders) in your Kedro project containing `credentials` in its name will be ignored and not committed to your repository.

>**Note**: If you maintain a project, you should document how to configure any required credentials in your project's documentation.
> Note: If you maintain a project, you should document how to configure any required credentials in your project's documentation.

The Kedro documentation lists some [best practices to avoid leaking confidential data](https://kedro.readthedocs.io/en/stable/02_get_started/05_example_project.html#what-best-practice-should-i-follow-to-avoid-leaking-confidential-data).

Expand Down
8 changes: 4 additions & 4 deletions training_docs/04_dependencies.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ isort>=4.3.21, <5.0 # Used for linting code with `kedro lint`
jupyter~=1.0 # Used to open a Kedro-session in Jupyter Notebook & Lab
jupyter_client>=5.1.0, <7.0 # Used to open a Kedro-session in Jupyter Notebook & Lab
jupyterlab==0.31.1 # Used to open a Kedro-session in Jupyter Lab
kedro==0.17.0
kedro==0.17.2
nbstripout==0.3.3 # Strips the output of a Jupyter Notebook and writes the outputless version to the original file
pytest-cov~=2.5 # Produces test coverage reports
pytest-mock>=1.7.1,<2.0 # Wrapper around the mock package for easier use with pytest
Expand All @@ -30,10 +30,10 @@ The dependencies above may be sufficient for some projects, but for the spacefli
pip install kedro[pandas.CSVDataSet,pandas.ExcelDataSet]
```

Alternatively, if you need to, you can edit `src/requirements.txt` directly to modify your list of dependencies by replacing the requirement `kedro==0.17.0` with the following (your version of Kedro may be different):
Alternatively, if you need to, you can edit `src/requirements.txt` directly to modify your list of dependencies by replacing the requirement `kedro==0.17.2` with the following (your version of Kedro may be different):

```text
kedro[pandas.CSVDataSet,pandas.ExcelDataSet]==0.17.0
kedro[pandas.CSVDataSet,pandas.ExcelDataSet]==0.17.2
```

Then run the following:
Expand All @@ -42,7 +42,7 @@ Then run the following:
kedro build-reqs
```

[`kedro build-reqs`](https://kedro.readthedocs.io/en/stable/09_development/03_commands_reference.html#build-the-project-s-dependency-tree) takes `requirements.in` file (or `requirements.txt` if it does not yet exist), resolves all package versions and 'freezes' them by putting pinned versions back into `requirements.txt`. It significantly reduces the chances of dependencies issues due to downstream changes as you would always install the same package versions.
[`kedro build-reqs`](https://kedro.readthedocs.io/en/stable/09_development/03_commands_reference.html#build-the-project-s-dependency-tree) takes the `requirements.in` file (or `requirements.txt` if it does not yet exist), resolves all package versions and 'freezes' them by putting pinned versions back into `requirements.txt`. This significantly reduces the chances of dependencies issues due to downstream changes as you would always install the same package versions.


You can find out more about [how to work with project dependencies](https://kedro.readthedocs.io/en/stable/04_kedro_project_setup/01_dependencies.html) in the Kedro project documentation.
Expand Down
16 changes: 9 additions & 7 deletions training_docs/05_connect_data_sources.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,14 @@
In this section, we discuss the data set-up phase. The steps are as follows:

* Add datasets to your `data/` folder, according to [data engineering convention](https://kedro.readthedocs.io/en/stable/12_faq/01_faq.html#what-is-data-engineering-convention)
* Register the datasets with the Data Catalog, which is the registry of all data sources available for use by the project `conf/base/catalog.yml`. This ensures that your code is reproducible when it references datasets in different locations and/or environments.
* Register the datasets with the Data Catalog in `conf/base/catalog.yml`, which is the registry of all data sources available for use by the project. This ensures that your code is reproducible when it references datasets in different locations and/or environments.

You can find further information about [the Data Catalog](https://kedro.readthedocs.io/en/stable/05_data/01_data_catalog.html) in specific documentation covering advanced usage.


## Add your datasets to `data`

The spaceflights tutorial makes use of fictional datasets of companies shuttling customers to the Moon and back. You will use the data to train a model to predict the price of shuttle hire. However, before you get to train the model, you will need to prepare the data by doing some data engineering, which is the process of preparing data for model building by creating a master table.
The spaceflights tutorial makes use of fictional datasets of companies shuttling customers to the Moon and back. You will use the data to train a model to predict the price of shuttle hire. However, before you get to train the model, you will need to prepare the data for model building by creating a master table.

The spaceflight tutorial has three files and uses two data formats: `.csv` and `.xlsx`. Download and save the files to the `data/01_raw/` folder of your project directory:

Expand All @@ -20,7 +20,7 @@ The spaceflight tutorial has three files and uses two data formats: `.csv` and `

Here are some examples of how you can [download the files from GitHub](https://www.quora.com/How-do-I-download-something-from-GitHub) to the `data/01_raw` directory inside your project:

Using [cURL in a Unix terminal](https://curl.haxx.se/download.html):
Using [cURL in a Unix terminal](https://curl.se/download.html):

<details>
<summary><b>Click to expand</b></summary>
Expand Down Expand Up @@ -103,10 +103,11 @@ reviews:
To check whether Kedro can load the data correctly, open a `kedro ipython` session and run:

```python
catalog.load("companies").head()
companies = catalog.load("companies")
companies.head()
```

The command loads the dataset named `companies` (as per top-level key in `catalog.yml`), from the underlying filepath `data/01_raw/companies.csv`. It displays the first five rows of the dataset, and is loaded into a `pandas` DataFrame for you to experiment with the data.
The command loads the dataset named `companies` (as per top-level key in `catalog.yml`) from the underlying filepath `data/01_raw/companies.csv` into the variable `companies`, which is of type `pandas.DataFrame`. The `head` method from `pandas` then displays the first five rows of the DataFrame.

When you have finished, close `ipython` session as follows:

Expand All @@ -124,10 +125,11 @@ shuttles:
filepath: data/01_raw/shuttles.xlsx
```

To test that everything works as expected, load the dataset within a _new_ `kedro ipython` session:
To test that everything works as expected, load the dataset within a _new_ `kedro ipython` session and display its first five rows:

```python
catalog.load("shuttles").head()
shuttles = catalog.load("shuttles")
shuttles.head()
```
When you have finished, close `ipython` session as follows:

Expand Down
31 changes: 17 additions & 14 deletions training_docs/06_jupyter_notebook_workflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,8 @@ exit()
To test the IPython session, load a dataset defined in your `conf/base/catalog.yml`, by simply executing the following:

```python
df = catalog.load("companies")
df.head()
companies = catalog.load("companies")
companies.head()
```

#### Dataset versioning
Expand Down Expand Up @@ -73,9 +73,9 @@ Navigate to the `notebooks` folder of your Kedro project and create a new notebo
Every time you start or restart a Jupyter or IPython session in the CLI using a `kedro` command, a startup script in `.ipython/profile_default/startup/00-kedro-init.py` is executed. It adds the following variables in scope:

* `catalog` (`DataCatalog`) - Data catalog instance that contains all defined datasets; this is a shortcut for `context.catalog`, but it's only created at startup time, whereas `context.catalog` is rebuilt everytime.
* `context` (`KedroContext`) - Kedro project context that provides access to Kedro's library components.
* `session` (`KedroSession`) - Session data (static and dynamic) for the Kedro run.
* `catalog` (`DataCatalog`) - Data catalog instance that contains all defined datasets; this is a shortcut for `context.catalog`
* `session` (`KedroSession`) - Kedro session that orchestrates the run
* `startup_error` (`Exception`) - An error that was raised during the execution of the startup script or `None` if no errors occurred

## How to use `context`
Expand All @@ -90,17 +90,14 @@ With `context`, you can access the following variables and methods:
- `context.project_name` (`str`) - Project folder name
- `context.catalog` (`DataCatalog`) - An instance of DataCatalog
- `context.config_loader` (`ConfigLoader`) - An instance of ConfigLoader
- `context.pipeline` (`Pipeline`) - Defined pipeline
- `context.pipeline` (`Pipeline`) - The `__default__` pipeline

### Run the pipeline

If you wish to run the whole main pipeline within a notebook cell, you can do so by instantiating a `Session`:
If you wish to run the whole main pipeline within a notebook cell, you can do so by running:

```python
from kedro.framework.session import KedroSession

with KedroSession.create("<your-kedro-project-package-name>") as session:
session.run()
session.run()
```

The command runs the nodes from your default project pipeline in a sequential manner.
Expand Down Expand Up @@ -169,17 +166,19 @@ You can also specify the following optional arguments for `session.run()`:
+---------------+----------------+-------------------------------------------------------------------------------+
| to_nodes | Iterable[str] | A list of node names which should be used as an end point |
+---------------+----------------+-------------------------------------------------------------------------------+
| from_inputs | Iterable[str] | A list of dataset names which should be used as a starting point |
| from_inputs | Iterable[str] | A list of dataset names which should be used as a starting point | |
+---------------+----------------+-------------------------------------------------------------------------------+
| to_outputs | Iterable[str] | A list of dataset names which should be used as an end point |
+---------------+----------------+-------------------------------------------------------------------------------+
| load_versions | Dict[str, str] | A mapping of a dataset name to a specific dataset version (timestamp) |
| | | for loading - this applies to the versioned datasets only |
+---------------+----------------+-------------------------------------------------------------------------------+
| pipeline_name | str | Name of the modular pipeline to run - must be one of those returned |
| | | by register_pipelines function from src/<package_name>/hooks.py |
| | | by register_pipelines function from src/<package_name>/pipeline_registry.py |
+---------------+----------------+-------------------------------------------------------------------------------+
```

This list of options is fully compatible with the list of CLI options for the `kedro run` command. In fact, `kedro run` is calling `context.run()` behind the scenes.
This list of options is fully compatible with the list of CLI options for the `kedro run` command. In fact, `kedro run` is calling `session.run()` behind the scenes.


## Global variables
Expand All @@ -199,7 +198,9 @@ def reload_kedro(project_path, line=None):
context = session.load_context()
parameters = context.params
# ...
logging.info("Defined global variable `context`, `session`, `catalog` and `parameters`")
logging.info(
"Defined global variable `context`, `session`, `catalog` and `parameters`"
)
except:
pass
```
Expand Down Expand Up @@ -316,6 +317,8 @@ To reload these variables at any point (e.g., if you update `catalog.yml`), use

![reload kedro line magic graphic](./images/jupyter_notebook_loading_context.png)

Note that if you want to pass an argument to `reload_kedro` line magic function, you should call it like a normal Python function (e.g `reload_kedro(extra_params=extra_params)` rather than using `%reload_kedro` in a notebook cell (e.g. `%reload_kedro(extra_params=extra_params)` wouldn't work).

If the `KEDRO_ENV` environment variable is specified, the startup script loads that environment, otherwise it defaults to `local`. Instructions for setting the environment variable can be found in the [Kedro configuration documentation](https://kedro.readthedocs.io/en/stable/04_kedro_project_setup/02_configuration.html#additional-configuration-environments).


Expand Down
Loading

0 comments on commit a276935

Please sign in to comment.