Skip to content
This repository has been archived by the owner on Mar 3, 2023. It is now read-only.

Update readme and add new training materials #15

Merged
merged 10 commits into from
Feb 11, 2021
15 changes: 3 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,11 @@
# Kedro Training

This repository contains training materials that will teach you how to use [Kedro](https://github.com/quantumblacklabs/kedro/). This content is based on the [Spaceflights tutorial](https://kedro.readthedocs.io/en/stable/03_tutorial/02_tutorial_template.html) using Kedro 0.16.5 specified in our documentation.
This repository contains training materials that will teach you how to use [Kedro](https://github.com/quantumblacklabs/kedro/). This content is based on the standard [spaceflights tutorial described in the Kedro documentation](https://kedro.readthedocs.io/en/stable/03_tutorial/01_spaceflights_tutorial.html).

## Scenario
The training documentation was most recently updated against Kedro 0.17.0 in February 2021.

Our project will be based on the following scenario:
> It is 2160 and the space tourism industry is booming. Globally, there are thousands of space shuttle companies taking tourists to the Moon and back. You have been able to source amenities offered in each space shuttle, customer reviews and company information. You want to construct a model for predicting the price for each trip to the Moon and the corresponding return flight. 🚀
To get started, navigate to the [training_docs](./training_docs/01_welcome.md) to see what is covered in the training, and how to ensure you get the most out of the time you set aside for it.

## Agenda

This tutorial covers:
- Project setup
- [Setting up a new Kedro project](docs/04_new_project.md)
- Using the Data Catalog to connect to data sources
- Creating, running and visualising a pipeline
- Advanced functionality in Kedro

## License

Expand Down
56 changes: 56 additions & 0 deletions training_docs/01_welcome.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Welcome to the Kedro training!
Welcome! We are so pleased you are starting your Kedro journey.

## What you'll cover

* [Training prerequisites](./02_prerequisites.md) << Read this before training starts
* [Create a new Kedro project](./03_new_project.md)
* [Project dependencies](./04_dependencies.md)
* [Add a data source](./05_connect_data_sources.md)
* [Jupyter notebook workflow](./06_jupyter_notebook_workflow.md)
* [Kedro pipelines](./07_pipelines.md)
* [Pipeline visualisatopm](./08_visualisation.md)
* [Versioning](./09_versioning.md)
* [Package your project](./10_package_project.md)
* [Configuration](./11_configuration.md)
* [Transcoding](./12_transcoding.md)
* [Custom datasets](./13_custom_datasets.md)
* [Custom CLI commands](./14_custom_cli_commands.md)
* [Kedro plugins](./15_plugins.md)


## Before your training session

These training materials assume some level of technical understanding.

To optimise your experience and learn the most you can from the Kedro training, please review the following and then take a look at the [prerequisites page](./02_prerequisites.md) to get the necessary software, including Kedro, installed and running before your training session.

## Prerequisite knowledge: Python, YAML and the CLI

- You should be familiar with Python basics. [Take a look at this tutorial to confirm you are comfortable with](https://docs.python.org/3/tutorial/):

- Functions, loops, conditional statements and IO operation
- Common data structures including lists, dictionaries and tuples

- You should also:
- Be able to [install Python packages using `pip`](https://pip.pypa.io/en/stable/quickstart/)
- Understand the basics of [dependency management with `requirements.txt`](https://pip.pypa.io/en/latest/user_guide/#requirements-files)
- Know about [Python modules](https://docs.python.org/3/tutorial/modules.html) (e.g how to use `__init__.py` and relative and absolute imports)
- Have [familiarity with Python data science libraries](https://towardsdatascience.com/top-10-python-libraries-for-data-science-cd82294ec266), especially `Pandas` and `scikit-learn`
- Understand how to use [Jupyter Notebook/Lab](https://www.dataquest.io/blog/jupyter-notebook-tutorial/) and [iPython](https://www.codecademy.com/articles/how-to-use-ipython))
- Be able to [use a virtual environment](https://docs.python.org/3/tutorial/venv.html) (we recommend using `conda`, but you can also use `venv` or `pipenv`)

- You should know [basic YAML syntax](https://yaml.org/)

- When working with the command line, you should be familiar with:

- `cd` to navigate directories
- `ls` to list files and directories
- [Executing a command and Python program from the command line](https://realpython.com/run-python-scripts/#how-to-run-python-scripts-using-the-command-line)


>**Note**: If you have any problems or questions, please contact an instructor before the training.



_[Go to the next page](./02_prerequisites.md)_
82 changes: 82 additions & 0 deletions training_docs/02_prerequisites.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Training prerequisites
## Operating system
Kedro supports macOS, Linux and Windows (7 / 8 / 10 and Windows Server 2016+).

## Command line
You will need to use the command line interface, or CLI, which is a text-based application to view, navigate and manipulate files on your computer.

[Find out more about working with the command line](https://tutorial.djangogirls.org/en/intro_to_command_line/) (also known as cmd, CLI, prompt, console or terminal).

- [Terminal on macOS](https://support.apple.com/en-gb/guide/terminal/welcome/mac)
- [Windows command line](https://www.computerhope.com/issues/chusedos.htm)

## Python
Kedro supports [Python 3.6, 3.7 or 3.8](https://www.python.org/downloads/) so you should make sure that it is installed on your laptop.

We recommend using [Anaconda](https://www.anaconda.com/download) (Python 3.7 version) to install Python packages. You can also use `conda`, which comes with Anaconda, as a virtual environment manager when you install Kedro.

## Kedro
Follow the official [Kedro prerequisites documentation](https://kedro.readthedocs.io/en/latest/02_get_started/01_prerequisites.html) and the page that follows it to [install Kedro](https://kedro.readthedocs.io/en/latest/02_get_started/02_install.html).

A clean install of Kedro is designed to be lightweight. If you later need additional tools, it is possible to install them later.

>**Note**: If you encounter any problems, please engage Kedro community support on [Stack Overflow](https://stackoverflow.com/questions/tagged/kedro).


## Code editor
There are many code editors to choose from. Here are some we recommend:

- [PyCharm](https://www.jetbrains.com/pycharm/download/)
- [VS Code IDE](https://code.visualstudio.com/)
- [Atom](https://atom.io/)

## Kedro training code
Download the [Kedro training repository](https://github.com/quantumblacklabs/kedro-training) by following [these instructions](https://stackoverflow.com/questions/2751227/how-to-download-source-in-zip-format-from-github).

## Git (optional)

Git is a version control system that records changes to files as you work on them. Git is especially helpful for software developers as it allows changes to be tracked (including who and when) across a project.

When you [download `git`](https://git-scm.com/downloads), be sure to choose the correct version for your operating system:

### Installing Git on Windows
Download the [`git` for Windows installer](https://gitforwindows.org/). Make sure to select **Use `git` from the Windows command prompt** this will ensure that `git` is permanently added to your PATH.

Also select **Checkout Windows-style, commit Unix-style line endings** selected and click on **Next**.

This will provide you both `git` and `git bash`, which you will find useful during the training.

### GitHub
GitHub is a web-based service for version control using Git. To use it, you will need to [set up an account](https://github.com).


## Checklist
Please use this checklist to make sure you have everything necessary to participate in the Kedro training.

- [ ] You have [Python 3 (either 3.6, 3.7 or 3.8)](https://www.python.org/downloads/) installed on your laptop

- [ ] You have Anaconda or an alternative [virtual environment manager](https://kedro.readthedocs.io/en/stable/02_get_started/01_prerequisites.html#virtual-environments) virtual environment manager

- [ ] You have installed [Kedro](#kedro)

- [ ] You have [downloaded the `kedro-training` repository](#kedro-training-code)

- [ ] You have [a code editor](#code-editor) installed for writing Python code

- [ ] You have a [command line](#command-line) installed

Having completed the above checklist, make sure that you are able to execute the following commands from your command line interface:

- [ ] `python --version` or `python3 --version` returns a correct Python version (either 3.6, 3.7 or 3.8).

- [ ] `kedro --version` shows [the latest Kedro version](https://pypi.org/project/kedro/).


If you are able to complete all of the above, you are ready for the training!

>**Note**: If you have any problems or questions in any of the above checklist, please contact an instructor and resolve the issues before the training.


_[Go to the next page](./03_new_project.md)_


106 changes: 106 additions & 0 deletions training_docs/03_new_project.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Create a new project

This section mirrors the [spaceflights tutorial in the Kedro documentation](https://kedro.readthedocs.io/en/stable/03_tutorial/01_spaceflights_tutorial.html).

As we work with the spaceflights tutorial, will follow these steps:

### 1. Set up the project template

* Create a new project with `kedro new`
* Configure the following in the `conf` folder:
* Logging
* Credentials
* Any other sensitive / personal content
* Install project dependencies with `kedro install`

### 2. Set up the data

* Add data to the `data/` folder
* Reference all datasets for the project in `conf/base/catalog.yml`

### 3. Create the pipeline

* Create the data transformation steps as Python functions
* Construct the pipeline by adding your functions as nodes
* Choose how to run the pipeline: sequentially or in parallel

### 4. Package the project

* Build the project documentation
* Package the project for distribution


## Create and set up a new project

In this section, we discuss the project set-up phase, with the following steps:


* Create a new project
* Install dependencies
* Configure the project

----
In the text, we assume that you create an empty project and follow the flow of the tutorial by copying and pasting the example code into the project as we describe. This tutorial will take approximately 2 hours and you will learn each step of the Kedro project development workflow, by working on an example to construct nodes and pipelines for the price-prediction model.

However, you may prefer to get up and running more swiftly so we provide the full spaceflights example project as a [Kedro starter](https://kedro.readthedocs.io/en/stable/02_get_started/06_starters.html).

If you decide to create the example project fully populated with code, navigate to your chosen working directory and run the following:

```bash
kedro new --starter=spaceflights
```

This will generate a project from the [Kedro starter for the spaceflights tutorial](https://github.com/quantumblacklabs/kedro-starters/tree/master/spaceflights) so you can follow the tutorial without any of the copy/pasting.

----

If you prefer to create an empty tutorial and cut and paste the code to follow along with the steps, you should instead run the following to [create a new empty Kedro project](https://kedro.readthedocs.io/en/stable/02_get_started/04_new_project.html#create-a-new-project-interactively) using the default interactive prompts:

```bash
kedro new
```

Feel free to name your project as you like, but this guide will assume the project is named **`Kedro Training`**.

Keep the default names for the `repo_name` and `python_package` when prompted.


### Project structure
Take a few minutes to familiarise yourself with the [Kedro project structure](https://kedro.readthedocs.io/en/stable/02_get_started/05_example_project.html#project-directory-structure) by exploring the contents of `kedro-training` folder.


## Configure the project

You may optionally add in any credentials to `conf/local/credentials.yml` that you would need to load specific data sources like usernames and passwords. Some examples are given within the file to illustrate how you store credentials. Additional information can be found in a later page on [advanced configuration](./11_configuration.md).

When it runs, Kedro automatically reads credentials from the `conf` folder and feeds them into the Data Catalog, which is responsible for loading and saving data as inputs and outputs of pipeline nodes. You can configure your credentials once and then reuse them in multiple datasets.

Example of `conf/local/credentials.yml`:

```yaml
dev_s3:
client_kwargs:
aws_access_key_id: key
aws_secret_access_key: secret
```


For security reasons, we strongly recommend not committing any credentials or other secrets to the Version Control System. By default any file inside the `conf` folder (and subfolders) in your Kedro project containing `credentials` word in its name will be ignored and not committed to your repository.

Please bear it in mind when you start working with Kedro project that you have cloned from GitHub, for example, as you might need to configure required credentials first.

>**Note**: If you maintain a project, you should document how to configure any required credentials in your project's documentation.

The Kedro documentation lists some [best practices to avoid leaking confidential data](https://kedro.readthedocs.io/en/stable/02_get_started/05_example_project.html#what-best-practice-should-i-follow-to-avoid-leaking-confidential-data).


Example of the dataset using those credentials defined in `conf/base/catalog.yml`:

```yaml
cars:
type: pandas.CSVDataSet
filepath: s3://my_bucket/data/02_intermediate/company/cars.csv
credentials: dev_s3
```

_[Go to the next page](./04_dependencies.md)_
60 changes: 60 additions & 0 deletions training_docs/04_dependencies.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Dependencies

Up to this point, we haven't discussed project dependencies, so now is a good time to introduce them. Specifying a project's dependencies in Kedro makes it easier for others to run your project; it avoids version conflicts by use of the same Python packages.

The generic project template bundles some typical dependencies, in `src/requirements.txt`. Here's a typical example, although you may find that the version numbers are slightly different depending on the version of Kedro that you are using:

```text
black==v19.10b0 # Used for formatting code with `kedro lint`
flake8>=3.7.9, <4.0 # Used for linting code with `kedro lint`
ipython==7.0 # Used for an IPython session with `kedro ipython`
isort>=4.3.21, <5.0 # Used for linting code with `kedro lint`
jupyter~=1.0 # Used to open a Kedro-session in Jupyter Notebook & Lab
jupyter_client>=5.1.0, <7.0 # Used to open a Kedro-session in Jupyter Notebook & Lab
jupyterlab==0.31.1 # Used to open a Kedro-session in Jupyter Lab
kedro==0.17.0
nbstripout==0.3.3 # Strips the output of a Jupyter Notebook and writes the outputless version to the original file
pytest-cov~=2.5 # Produces test coverage reports
pytest-mock>=1.7.1,<2.0 # Wrapper around the mock package for easier use with pytest
pytest~=6.1.2 # Testing framework for Python code
wheel==0.32.2 # The reference implementation of the Python wheel packaging standard
```

> Note: If your project has `conda` dependencies, you can create a `src/environment.yml` file and list them there.

### Add and remove project-specific dependencies

The dependencies above may be sufficient for some projects, but for the spaceflights project, you need to add a requirement for the `pandas` project because you are working with CSV and Excel files. You can add the necessary dependencies for these files types as follows:

```bash
pip install kedro[pandas.CSVDataSet,pandas.ExcelDataSet]
```

Alternatively, if you need to, you can edit `src/requirements.txt` directly to modify your list of dependencies by replacing the requirement `kedro==0.17.0` with the following (your version of Kedro may be different):

```text
kedro[pandas.CSVDataSet,pandas.ExcelDataSet]==0.17.0
```

Then run the following:

```bash
kedro build-reqs
```

[`kedro build-reqs`](https://kedro.readthedocs.io/en/stable/09_development/03_commands_reference.html#build-the-project-s-dependency-tree) takes `requirements.in` file (or `requirements.txt` if it does not yet exist), resolves all package versions and 'freezes' them by putting pinned versions back into `requirements.txt`. It significantly reduces the chances of dependencies issues due to downstream changes as you would always install the same package versions.


You can find out more about [how to work with project dependencies](https://kedro.readthedocs.io/en/stable/04_kedro_project_setup/01_dependencies.html) in the Kedro project documentation.

## `kedro install`

To install the project-specific dependencies, navigate to the root directory of the project and run:

```bash
kedro install
```

This command is roughly equivalent to `pip install -r src/requirements.txt`, however `kedro install` is a bit smarter on Windows when it needs to upgrade its version. It also makes sure that the dependencies are always installed in the same virtual environment as Kedro.

_[Go to the next page](./05_connect_data_sources.md)_
Loading