diff --git a/README.md b/README.md index 1b5f219..b982020 100644 --- a/README.md +++ b/README.md @@ -1,20 +1,11 @@ # Kedro Training -This repository contains training materials that will teach you how to use [Kedro](https://github.com/quantumblacklabs/kedro/). This content is based on the [Spaceflights tutorial](https://kedro.readthedocs.io/en/stable/03_tutorial/02_tutorial_template.html) using Kedro 0.16.5 specified in our documentation. +This repository contains training materials that will teach you how to use [Kedro](https://github.com/quantumblacklabs/kedro/). This content is based on the standard [spaceflights tutorial described in the Kedro documentation](https://kedro.readthedocs.io/en/stable/03_tutorial/01_spaceflights_tutorial.html). -## Scenario +The training documentation was most recently updated against Kedro 0.17.0 in February 2021. -Our project will be based on the following scenario: -> It is 2160 and the space tourism industry is booming. Globally, there are thousands of space shuttle companies taking tourists to the Moon and back. You have been able to source amenities offered in each space shuttle, customer reviews and company information. You want to construct a model for predicting the price for each trip to the Moon and the corresponding return flight. πŸš€ +To get started, navigate to the [training_docs](./training_docs/01_welcome.md) to see what is covered in the training, and how to ensure you get the most out of the time you set aside for it. -## Agenda - -This tutorial covers: - - Project setup - - [Setting up a new Kedro project](docs/04_new_project.md) - - Using the Data Catalog to connect to data sources - - Creating, running and visualising a pipeline - - Advanced functionality in Kedro ## License diff --git a/docs/01_prerequisites.md b/docs/01_prerequisites.md deleted file mode 100644 index 5654f3c..0000000 --- a/docs/01_prerequisites.md +++ /dev/null @@ -1,102 +0,0 @@ -# Training pre-requisites -The Kedro training materials assume a pre-requisite level of technical understanding. - -To optimise your experience and learn the most you can from the Kedro training, please review the following before your training session. We provide external resource links for each topic. - -- [Introduction to Python](https://docs.python.org/3/tutorial/) - - Functions, loops, conditional statements and IO operation - - Common data structures including lists, dictionaries and tuples -- Intermediate Python - - [Installing Python packages using `pip`](https://pip.pypa.io/en/stable/quickstart/) - - [Dependency management with `requirements.txt`](https://pip.pypa.io/en/latest/user_guide/#requirements-files) - - [Python modules](https://docs.python.org/3/tutorial/modules.html) (e.g how to use `__init__.py` and relative and absolute imports) - - [Familiarity with Python data science libraries](https://towardsdatascience.com/top-10-python-libraries-for-data-science-cd82294ec266), especially `Pandas` and `scikit-learn` - - An understanding of how to use [Jupyter Notebook/Lab](https://www.dataquest.io/blog/jupyter-notebook-tutorial/) and [iPython](https://www.codecademy.com/articles/how-to-use-ipython)) - - [Using a virtual environment](https://docs.python.org/3/tutorial/venv.html) (we recommend using `conda`, but you can also use `venv` or `pipenv`) -- [Basic YAML syntax](https://yaml.org/) -- [Working with the command line](https://tutorial.djangogirls.org/en/intro_to_command_line/) (also known as cmd, CLI, prompt, console or terminal) - - `cd` to navigate directories - - `ls` to list files and directories - - [Executing a command and Python program from a command line](https://realpython.com/run-python-scripts/#how-to-run-python-scripts-using-the-command-line) - -The following lists software that Kedro integrates with; however, during training these are optional requirements. -- [Version Control with Git](https://git-scm.com/doc) -- Cloud storage ([S3](https://aws.amazon.com/s3/), [Azure Blob](https://azure.microsoft.com/en-gb/services/storage/blobs/) and [GCS](https://cloud.google.com/storage)) -- [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) -- [Docker](https://docs.docker.com/) as a deployment option -- [Airflow](https://airflow.apache.org/docs/stable/tutorial.html) for scheduling pipeline execution - -# Checklist -Please use this checklist to make sure you have everything necessary to participate in the Kedro training. - -- [ ] You have [Python 3 (either 3.6, 3.7 or 3.8)](https://www.python.org/downloads/) installed on your laptop -- [ ] You have Anaconda or an [alternative](https://github.com/quantumblacklabs/kedro-training/blob/master/docs/02_virtual-environment.md) virtual environment manager -- [ ] You have [a code editor](#text-editors-and-ides) installed for writing Python code -- [ ] You have a [command line](#command-line) installed - -Having completed the above checklist, make sure that you are able to execute the following commands in your Terminal: -- [ ] `python --version` or `python3 --version` returns a correct Python version (either 3.6, 3.7 or 3.8). -- [ ] Download the [Kedro training repository](https://github.com/quantumblacklabs/kedro-training) by following [these instructions](https://stackoverflow.com/questions/2751227/how-to-download-source-in-zip-format-from-github). - -- [ ] `conda create --name=kedro-environment python=3.6 -y && conda activate kedro-environment` creates an virtual environment called `kedro-environment` and activates the environment (you can find how to do this with `venv` or `pipenv` in [here](https://github.com/quantumblacklabs/kedro-training/blob/master/docs/02_virtual-environment.md)). - -- [ ] `pip install kedro` installs Kedro in your conda environment. - -- [ ] `kedro --version` shows [the latest Kedro version](https://pypi.org/project/kedro/). - - -If you are able to complete all of the above, you are ready for the training! In case you have any problems or questions in any of the above checklist, please contact an instructor and resolve the issues before the training. - -# Installation prerequisites - -Kedro supports macOS, Linux and Windows (7 / 8 / 10 and Windows Server 2016+). If you encounter any problems on these platforms please engage Kedro community support on [Stack Overflow](https://stackoverflow.com/questions/tagged/kedro). - -## Python - -Kedro supports Python 3.6, 3.7 and 3.8. We recommend using [Anaconda](https://www.anaconda.com/download) (Python 3.7 version) to install Python packages. However, if Anaconda is not preferred then this [tutorial](https://realpython.com/installing-python/) can help you with installation and setup. - -## Command line -The command line works as a text-based application for viewing, navigating and manipulating files on your computer. You can find more information for each OS platform: -- [Terminal in macOS](https://support.apple.com/en-gb/guide/terminal/welcome/mac) -- [Windows command line](https://www.computerhope.com/issues/chusedos.htm) - -### Text editors and IDEs -Text editors are tools with powerful features designed to optimize writing code. There are many text editors that you can choose from. Here are some we recommend: - -- [PyCharm](https://www.jetbrains.com/pycharm/download/) -- [VS Code](https://code.visualstudio.com/) -- [Atom](https://atom.io/) - -## Optional tools - -### `PySpark` - -[Java 8](https://www.oracle.com/technetwork/java/javase/downloads/index.html) will need to be installed if `PySpark` is a workflow requirement. - -> _Note:_ Windows users will require admin rights to install [Anaconda](https://www.anaconda.com/download) and [Java](https://www.oracle.com/technetwork/java/javase/downloads/index.html). - - -### Git -Git is a version control software that records changes to a file or set of files. Git is especially helpful for software developers as it allows changes to be tracked (including who and when) when working on a project. - -To download `git`, go to the following link and choose the correct version for your operating system: https://git-scm.com/downloads. - -#### Installing Git on Windows -Download the [`git` for Windows installer](https://gitforwindows.org/). Make sure to select **Use `git` from the Windows command prompt** this will ensure that `git` is permanently added to your PATH. - -Also select **Checkout Windows-style, commit Unix-style line endings** selected and click on **Next**. - -This will provide you both `git` and `git bash`. We might have a few exercises using the command line quite a lot during the workshop so using `git bash` is a good option. - -### GitHub -GitHub is a web-based service for version control using Git. You will need to set up an account at: https://github.com. - -Basic GitHub accounts are free and you can now also have private repositories. - -### Docker -Docker is a tool that makes it easier to create, deploy and run applications. It uses containers to package an application along with its dependencies and then runs the application in an isolated virtualised environment. - -If Docker is a tool that you use internally then make sure to carefully read the prerequisites and instructions here: https://docs.docker.com/install/. - -### Next section -[Go to the next section](./02_virtual-environment.md) diff --git a/docs/02_virtual-environment.md b/docs/02_virtual-environment.md deleted file mode 100644 index 6b7da40..0000000 --- a/docs/02_virtual-environment.md +++ /dev/null @@ -1,84 +0,0 @@ -# Working with virtual environments - -> The main purpose of Python virtual environments is to create an isolated environment for Python projects. This means that each project can have its own dependencies, regardless of what dependencies every other project has. Read more about Python Virtual Environments [**here**](https://realpython.com/python-virtual-environments-a-primer/). - -Follow the instructions that best suit your Python installation preference. Virtual environment setups for `conda`, `venv` and `pipenv` are presented here: - - `conda` used with an Anaconda (Python 3.7 version) installation - - `venv` or `pipenv` used when Anaconda is not preferred - -## Anaconda - -Let us create a new Python virtual environment using `conda`: - -```bash -conda create --name kedro-environment python=3.7 -y -``` - -This will create an isolated environment with Python 3.7. - -To activate it, run: - -```bash -conda activate kedro-environment -``` - -To exit the environment you can run: - -```bash -deactivate kedro-environment -``` - -### `venv` - -If you are using Python 3.0+, then you should already have the `venv` module from the standard library installed. However, for completeness you can install `venv` with: - -```bash -pip install virtualenv -``` - -Create a directory for your virtual environment with: - -```bash -mkdir kedro-environment && cd kedro-environment -``` - -This will create a `kedro-environment` directory for your `virtualenv` in your current working directory. - -Create a new virtual environment in this directory by running: - -```bash -python -m venv env/kedro-environment # macOS / Linux -python -m venv env\kedro-environment # Windows -``` - -We can activate this virtual environment with: - -```bash -source env/bin/activate # macOS / Linux -.\env\Scripts\activate # Windows -``` - -To exit the environment you can run: - -```bash -deactivate -``` - -### `pipenv` - -You will need to install `pipenv` with: - -```bash -pip install pipenv -``` - -Then create a directory for the virtual environment and change to this working directory with: - -```bash -mkdir kedro-environment && cd kedro-environment -``` - -Once all the dependencies are installed you can run `pipenv shell` which will start a session with the correct virtual environment activated. To exit the shell session using `exit`. - -### Next section -[Go to the next section](./03_install_kedro.md) diff --git a/docs/03_install_kedro.md b/docs/03_install_kedro.md deleted file mode 100644 index ae168fd..0000000 --- a/docs/03_install_kedro.md +++ /dev/null @@ -1,30 +0,0 @@ -# Install Kedro - -Your installation instructions will be virtual environment dependent. - -## `venv` - -Install Kedro using `pip`: - -```bash -pip install kedro -``` - -## Using `conda` - -Install Kedro using `conda` from the `conda-forge` channel: - -```bash -conda install -c conda-forge kedro -``` - -## `pipenv` - -Install Kedro using `pipenv`: - -```bash -pipenv install kedro -``` - -### Next section -[Go to the next section](./04_new_project.md) diff --git a/docs/04_new_project.md b/docs/04_new_project.md deleted file mode 100644 index 2ee9abc..0000000 --- a/docs/04_new_project.md +++ /dev/null @@ -1,77 +0,0 @@ -# Create a new project - -Create a new project in your current working directory: - -```bash -kedro new -``` - -This will ask you to specify: -1. Project name - you can call it `Kedro Training` -2. Repository name - accept the default by pressing the `Enter` key -3. Python package name - accept the default by pressing the `Enter` key - -Change your working directory so that you are in your newly created project folder with: - -```bash -cd kedro-training -``` - -## Project structure - -The [project structure](https://kedro.readthedocs.io/en/stable/02_get_started/05_example_project.html#project-directory-structure) is explained in the Kedro documentation. - -## Running Kedro commands - -The list and the behaviour of Kedro CLI commands may vary depending of the working directory where Kedro command is executed. Kedro has 2 command types: - -* global commands (e.g., `kedro new`, `kedro info`) which work regardless of the current working directory -* local or project-specific commands (e.g., `kedro run`, `kedro install`) that require the current working directory to be the root of your Kedro project - -To see the full list of available commands, you can always run `kedro --help`. - -### `kedro install` - -This command allows you to easily install or update all your project third-party Python package dependencies. This is roughly equivalent to `pip install -r src/requirements.txt`, however `kedro install` is a bit smarter on Windows when it needs to upgrade its version. It also makes sure that the dependencies are always installed in the same virtual environment as Kedro. - -One more very useful command is `kedro build-reqs`, which takes `requirements.in` file (or `requirements.txt` if the first one does not exist), resolves all package versions and 'freezes' them by putting pinned versions back into `requirements.txt`. It significantly reduces the chances of dependencies issues due to downstream changes as you would always install the same package versions. - -#### Example - -Let's install and try the [Kedro Viz](https://github.com/quantumblacklabs/kedro-viz) - the plugin that helps a lot visualising your Kedro pipelines. You can do this by running the following commands from the terminal: - -```bash -echo "kedro-viz>=3.0" >> src/requirements.txt # src\requirements.txt on Windows -kedro build-reqs # creates src/requirements.in and pins package versions in src/requirements.txt -kedro install # installs packages from src/requirements.txt -kedro viz # start Kedro Viz server -``` - -## Credentials management - -For security reasons, we strongly recommend not committing any credentials or other secrets to the Version Control System. By default any file inside the `conf` folder (and subfolders) in your Kedro project containing `credentials` word in its name will be gitignored and not committed to your repository. - -Please bear it in mind when you start working with Kedro project that you have cloned from GitHub, for example, as you might need to configure required credentials first. If you are a project maintainer, you should document it in project prerequisites. - -On run Kedro automatically reads the credentials from the `conf` folder and feeds them into the DataCatalog - a Kedro component responsible for loading and saving of the data that comes to and out of pipeline nodes. Shortly, you just need to configure your credentials once and then you can reuse them in multiple datasets. - -Example of `conf/local/credentials.yml`: - -```yaml -dev_s3: - client_kwargs: - aws_access_key_id: key - aws_secret_access_key: secret -``` - -Example of the dataset using those credentials defined in `conf/base/catalog.yml`: - -```yaml -cars: - type: pandas.CSVDataSet - filepath: s3://my_bucket/data/02_intermediate/company/cars.csv - credentials: dev_s3 -``` - -### Next section -[Go to the next section](./05_connecting-data-sources.md) diff --git a/docs/05_connecting-data-sources.md b/docs/05_connecting-data-sources.md deleted file mode 100644 index 9ace07a..0000000 --- a/docs/05_connecting-data-sources.md +++ /dev/null @@ -1,115 +0,0 @@ -## Adding your datasets to `data` - -Before you start a Kedro project, you need to prepare the datasets. This tutorial will make use of fictional datasets for spaceflight companies shuttling customers to the Moon and back. You will use the data to train a model to predict the price of shuttle hire. - -The spaceflight tutorial has three files and uses two data formats: `.csv` and `.xlsx`. Download and save the files to the `data/01_raw/` folder of your project directory: - -* [reviews.csv](https://quantumblacklabs.github.io/kedro/reviews.csv) -* [companies.csv](https://quantumblacklabs.github.io/kedro/companies.csv) -* [shuttles.xlsx](https://quantumblacklabs.github.io/kedro/shuttles.xlsx) - -Here is an example of how you can [download the files from GitHub](https://www.quora.com/How-do-I-download-something-from-GitHub) to `data/01_raw` directory inside your project using [cURL](https://curl.haxx.se/download.html) in a Unix terminal: - -
-CLICK TO EXPAND - -```bash -# reviews -curl -o data/01_raw/reviews.csv https://raw.githubusercontent.com/quantumblacklabs/kedro-examples/master/kedro-tutorial/data/01_raw/reviews.csv -# companies -curl -o data/01_raw/companies.csv https://raw.githubusercontent.com/quantumblacklabs/kedro-examples/master/kedro-tutorial/data/01_raw/companies.csv -# shuttles -curl -o data/01_raw/shuttles.xlsx https://raw.githubusercontent.com/quantumblacklabs/kedro-examples/master/kedro-tutorial/data/01_raw/shuttles.xlsx -``` -
- -Or through using [Wget](https://www.gnu.org/software/wget/): - -
-CLICK TO EXPAND - -```bash -# reviews -wget -O data/01_raw/reviews.csv https://raw.githubusercontent.com/quantumblacklabs/kedro-examples/master/kedro-tutorial/data/01_raw/reviews.csv -# companies -wget -O data/01_raw/companies.csv https://raw.githubusercontent.com/quantumblacklabs/kedro-examples/master/kedro-tutorial/data/01_raw/companies.csv -# shuttles -wget -O data/01_raw/shuttles.xlsx https://raw.githubusercontent.com/quantumblacklabs/kedro-examples/master/kedro-tutorial/data/01_raw/shuttles.xlsx -``` -
- -Alternatively, if you are a Windows user, try [Wget for Windows](https://eternallybored.org/misc/wget/) and the following commands instead: - -
-CLICK TO EXPAND - -```bat -# reviews -wget -O data\01_raw\reviews.csv https://raw.githubusercontent.com/quantumblacklabs/kedro-examples/master/kedro-tutorial/data/01_raw/reviews.csv -# companies -wget -O data\01_raw\companies.csv https://raw.githubusercontent.com/quantumblacklabs/kedro-examples/master/kedro-tutorial/data/01_raw/companies.csv -# shuttles -wget -O data\01_raw\shuttles.xlsx https://raw.githubusercontent.com/quantumblacklabs/kedro-examples/master/kedro-tutorial/data/01_raw/shuttles.xlsx -``` -
- -or [cURL for Windows](https://curl.haxx.se/windows/): - -
-CLICK TO EXPAND - -```bat -# reviews -curl -o data\01_raw\reviews.csv https://raw.githubusercontent.com/quantumblacklabs/kedro-examples/master/kedro-tutorial/data/01_raw/reviews.csv -# companies -curl -o data\01_raw\companies.csv https://raw.githubusercontent.com/quantumblacklabs/kedro-examples/master/kedro-tutorial/data/01_raw/companies.csv -# shuttles -curl -o data\01_raw\shuttles.xlsx https://raw.githubusercontent.com/quantumblacklabs/kedro-examples/master/kedro-tutorial/data/01_raw/shuttles.xlsx -``` -
- -## Using the Data Catalog with `catalog.yml` - -To work with the datasets provided you need to make sure they can be loaded by Kedro. - -All Kedro projects have a `conf/base/catalog.yml` file where users register the datasets they use. Registering a dataset is as simple as adding a named entry into the `.yml` file, which includes: - -* File location (path) -* Type of data -* Versioning (optional) -* Any additional arguments (optional) - -Kedro supports a number of different data types, such as `csv`, which is implemented by `pandas.CSVDataSet`. You can find all supported datasets in the API docs for [datasets](https://kedro.readthedocs.io/en/stable/kedro.extras.datasets.html). - -Let’s start this process by registering the `csv` datasets by copying the following to the end of the `conf/base/catalog.yml` file: - -```yaml -companies: - type: pandas.CSVDataSet - filepath: data/01_raw/companies.csv - -reviews: - type: pandas.CSVDataSet - filepath: data/01_raw/reviews.csv - -shuttles: - type: pandas.ExcelDataSet - filepath: data/01_raw/shuttles.xlsx -``` - -If you want to check whether Kedro loads the data correctly, open a `kedro ipython` session and run: - -```python -catalog.load('companies').head() -``` - -This will load the dataset named `companies` (as per top-level key in `catalog.yml`), from the underlying filepath `data/01_raw/companies.csv`, and show you the first five rows of the dataset. It is loaded into a `pandas` DataFrame and you can play with it as you wish. - -When you have finished, simply close `ipython` session by typing the following: - -```python -exit() -``` - -### Next section -[Go to the next section](./06_jupyter-notebook-workflow.md) diff --git a/docs/06_jupyter-notebook-workflow.md b/docs/06_jupyter-notebook-workflow.md deleted file mode 100644 index 2f5aa95..0000000 --- a/docs/06_jupyter-notebook-workflow.md +++ /dev/null @@ -1,142 +0,0 @@ -# Jupyter Notebook Workflow - -In order to experiment with the code interactively, you may want to use a Python kernel inside a Jupyter Notebook. To start, run this in your terminal from the root of your Kedro project: - -```bash -kedro jupyter notebook -``` - -This will start a Jupyter server and navigate you to `http://127.0.0.1:8888/tree` in your default browser. - -> Note: If you want Jupyter to listen to a different port number, then run `kedro jupyter notebook --port ` - -## Startup script - -Every time you start/restart a Jupyter or IPython session using Kedro command, a startup script in `.ipython/profile_default/startup/00-kedro-init.py` is being executed. It adds the following variables in scope: - -* `context` (`KedroContext`) - Kedro project context which holds the configuration -* `catalog` (`DataCatalog`) - Data catalog instance which contains all defined datasets; this is a shortcut for `context.catalog` -* `startup_error` (`Exception`) - An error that was raised during the execution of the startup script or `None` if no errors occurred - -To reload these at any point use the line magic `%reload_kedro`. This magic can also be used to see the error message if any of the variables above are undefined. - -![](../img/reload_kedro.png) - -## What if I cannot run `kedro jupyter notebook`? - -In certain cases, you may not be able to run `kedro jupyter notebook` and have to work in a standard Jupyter session. An example of this may be because you don't have a CLI access to the machine where the Jupyter server is running. In that case, you can create a `context` variable yourself by running the following block of code at the top of your notebook: - -```python -from pathlib import Path -from kedro.context import load_context - -current_dir = Path.cwd() # this points to 'notebooks/' folder -proj_path = current_dir.parent # point back to the root of the project -context = load_context(proj_path) -``` - -## Using `context` variable - -![](../img/context.png) - -As mentioned earlier in the project overview section, `KedroContext` represents the main application entry point, so having `context` variable available in Jupyter Notebook gives a lot of flexibility in interaction with your project components. - -### Loading a dataset - -You can load a dataset defined in your `conf/base/catalog.yml`, by simply executing the following: - -```python -df = catalog.load("companies") -df.head() -``` - -![](../img/context_catalog_load.png) - -### Saving a data - -Saving operation in the example below is analogous to the load. - -Let's put the following dataset entry in `conf/base/catalog.yml`: - -```yaml -my_dataset: - type: pandas.JSONDataSet - filepath: data/01_raw/my_dataset.json -``` - -Next, you need to reload Kedro variables by calling `%reload_kedro` line magic in your Jupyter Notebook. - -Finally, you can save the data by executing the following command: - -```python -import pandas -my_dict = {"key1": "some_value", "key2": None} -df = pandas.DataFrame([my_dict]) -context.catalog.save("my_dataset", df) -``` - -### Using parameters - -`context` object also exposes `params` property, which allows you to easily access all project parameters: - -```python -parameters = context.params # type: Dict -parameters["test_size"] # returns the value of 'test_size' key from 'parameters.yml' -``` - -> Note: You need to reload Kedro variables by calling `%reload_kedro` and re-run the code snippet from above if you change the contents of `parameters.yml`. - -### Running the pipeline - -As already mentioned, `KedroContext` represents the main application entry point, which in practice means that you can use `context` object to run your Kedro project pipelines. - -If you wish to run the whole 'master' pipeline within a notebook cell, you can do it by just calling - -```python -context.run() -``` - -which will run all the nodes from your default project pipeline in a sequential manner. - -If you, however, want to parameterize the run, you can also specify the following optional arguments for `context.run()`: - -| Argument name | Accepted types | Description | -| :-------------: | :--------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `tags` | `Iterable[str]` | Construct the pipeline using only nodes which have this tag attached. A node is included in the resulting pipeline if it contains _any_ of those tags | -| `runner` | `AbstractRunner` | An instance of Kedro [AbstractRunner](https://kedro.readthedocs.io/en/stable/kedro.runner.AbstractRunner.html); for example, can be an instance of a [ParallelRunner](https://kedro.readthedocs.io/en/stable/kedro.runner.ParallelRunner.html) | -| `node_names` | `Iterable[str]` | Run only nodes with specified names | -| `from_nodes` | `Iterable[str]` | A list of node names which should be used as a starting point | -| `to_nodes` | `Iterable[str]` | A list of node names which should be used as an end point | -| `from_inputs` | `Iterable[str]` | A list of dataset names which should be used as a starting point | -| `load_versions` | `Dict[str, str]` | A mapping of a dataset name to a specific dataset version (timestamp) for loading - this applies to the versioned datasets only | -| `pipeline_name` | `str` | Name of the modular pipeline to run - must be one of those returned by `create_pipelines` function from `src//pipeline.py` | - -This list of options is fully compatible with the list of CLI options for `kedro run` command. In fact, `kedro run` is calling `context.run()` behind the scenes. - -### Converting functions from Jupyter Notebooks into Kedro nodes - -Another useful built-in feature in Kedro Jupyter workflow is the ability to convert multiple functions defined in the Jupyter Notebook(s) into Kedro nodes using a single CLI command. - -Here is how it works: - -* start a Jupyter server, if you haven't done so already, by running `kedro jupyter notebook` -* create a new notebook and paste the following code into the first cell: - -```python -def convert_me(): - print("This function came from `notebooks/my_notebook.ipynb`") -``` - -* enable tags toolbar: `View` menu -> `Cell Toolbar` -> `Tags` -![](../img/enable_tags.png) -* add the `node` tag to the cell containing your function -![](../img/tag_nb_cell.png) -> Tip: The notebook can contain multiple functions tagged as `node`, each of them will be exported into the resulting Python file - -* save your Jupyter Notebook to `notebooks/my_notebook.ipynb` -* run from your terminal: `kedro jupyter convert notebooks/my_notebook.ipynb` - this will create a Python file `src//nodes/my_notebook.py` containing `convert_me` function definition -> Tip: You can also convert all your notebooks at once by calling `kedro jupyter convert --all` -* now `convert_me` function can be used in your Kedro pipelines - -### Next section -[Go to the next section](./07_pipelines.md) diff --git a/docs/08_transformers.md b/docs/08_transformers.md deleted file mode 100644 index a5d40ee..0000000 --- a/docs/08_transformers.md +++ /dev/null @@ -1,180 +0,0 @@ -# Kedro transformers - -[Transformers](https://kedro.readthedocs.io/en/stable/05_data/01_data_catalog.html#transforming-datasets) intercept the load and save operations on Kedro `DataSet`s. Some use cases that transformers enable include: performing data validation, tracking operation performance and converting a data format (although we would recommend [Transcoding](https://kedro.readthedocs.io/en/stable/05_data/01_data_catalog.html#transcoding-datasets) for this). We will cover _tracking operation performance_ with the following: -1. Applying built-in transformers for monitoring load and save operation latency -2. Developing our own transformer for tracking memory consumption - -## Applying built-in transformers - -Transformers are applied at the `DataCatalog` level. To apply the built-in `ProfileTimeTransformer`, you need to: -1. Navigate to `src/kedro_training/run.py` -2. Override `_create_catalog` method for your `ProjectContext` class using the following: - -```python -from typing import Dict, Any - -from kedro.context import KedroContext -from kedro.extras.transformers import ProfileTimeTransformer # new import -from kedro.io import DataCatalog -from kedro.versioning import Journal - - -class ProjectContext(KedroContext): - - ... - - def _create_catalog( - self, - conf_catalog: Dict[str, Any], - conf_creds: Dict[str, Any], - save_version: str = None, - journal: Journal = None, - load_versions: Dict[str, str] = None, - ) -> DataCatalog: - catalog = DataCatalog.from_config( - conf_catalog, - conf_creds, - save_version=save_version, - journal=journal, - load_versions=load_versions, - ) - profile_time = ProfileTimeTransformer() # instantiate a build-in transformer - catalog.add_transformer(profile_time) # apply it to the catalog - return catalog -``` - -Once complete, rerun the pipeline from the terminal and you should see the following logging output: - -```bash -$ kedro run - -... -2019-11-13 15:09:01,784 - kedro.io.data_catalog - INFO - Loading data from `companies` (CSVDataSet)... -2019-11-13 15:09:01,827 - ProfileTimeTransformer - INFO - Loading companies took 0.043 seconds -2019-11-13 15:09:01,828 - kedro.pipeline.node - INFO - Running node: preprocess1: preprocess_companies([companies]) -> [preprocessed_companies] -2019-11-13 15:09:01,880 - kedro_tutorial.nodes.data_engineering - INFO - Running 'preprocess_companies' took 0.05 seconds -2019-11-13 15:09:01,880 - kedro_tutorial.nodes.data_engineering - INFO - Running 'preprocess_companies' took 0.05 seconds -2019-11-13 15:09:01,880 - kedro.io.data_catalog - INFO - Saving data to `preprocessed_companies` (CSVDataSet)... -2019-11-13 15:09:02,112 - ProfileTimeTransformer - INFO - Saving preprocessed_companies took 0.232 seconds -2019-11-13 15:09:02,113 - kedro.runner.sequential_runner - INFO - Completed 1 out of 6 tasks -... -``` - -You can notice 2 new `INFO` level log messages from `ProfileTimeTransformer`, which report the corresponding dataset load and save operation latency. - -> Pro Tip: You can narrow down the application of the transformer by specifying an optional list of the datasets in `add_transformer`. For example, the command `catalog.add_transformer(profile_time, ["dataset1", "dataset2"])` will apply `profile_time` transformer _only_ to the datasets named `dataset1` and `dataset2`. This may be useful when you need to apply a transformer only to a subset of datasets, rather than all of them. - -## Developing your own transformer - -Let's create our own transformer using [memory-profiler](https://github.com/pythonprofilers/memory_profiler). Custom transformer should: -1. Inherit the `kedro.io.AbstractTransformer` base class -2. Implement the `load` and `save` method (as show in the example below) - -Now please create `src/kedro_training/memory_profile.py` and then paste the following code into it: - -```python -import logging -from typing import Callable, Any - -from kedro.io import AbstractTransformer -from memory_profiler import memory_usage - - -def _normalise_mem_usage(mem_usage): - # memory_profiler < 0.56.0 returns list instead of float - return mem_usage[0] if isinstance(mem_usage, (list, tuple)) else mem_usage - - -class ProfileMemoryTransformer(AbstractTransformer): - """ A transformer that logs the maximum memory consumption during load and save calls """ - - @property - def _logger(self): - return logging.getLogger(self.__class__.__name__) - - def load(self, data_set_name: str, load: Callable[[], Any]) -> Any: - mem_usage, data = memory_usage( - (load, [], {}), - interval=0.1, - max_usage=True, - retval=True, - include_children=True, - ) - # memory_profiler < 0.56.0 returns list instead of float - mem_usage = _normalise_mem_usage(mem_usage) - - self._logger.info( - "Loading %s consumed %2.2fMiB memory at peak time", data_set_name, mem_usage - ) - return data - - def save(self, data_set_name: str, save: Callable[[Any], None], data: Any) -> None: - mem_usage = memory_usage( - (save, [data], {}), - interval=0.1, - max_usage=True, - retval=False, - include_children=True, - ) - mem_usage = _normalise_mem_usage(mem_usage) - - self._logger.info( - "Saving %s consumed %2.2fMiB memory at peak time", data_set_name, mem_usage - ) -``` - -Finally, you need to update `ProjectContext._create_catalog` method definition to apply your custom transformer: - -```python - -... -from .memory_profile import ProfileMemoryTransformer # new import - - -class ProjectContext(KedroContext): - - ... - - def _create_catalog( - self, - conf_catalog: Dict[str, Any], - conf_creds: Dict[str, Any], - save_version: str = None, - journal: Journal = None, - load_versions: Dict[str, str] = None, - ) -> DataCatalog: - catalog = DataCatalog.from_config( - conf_catalog, - conf_creds, - save_version=save_version, - journal=journal, - load_versions=load_versions, - ) - profile_time = ProfileTimeTransformer() - catalog.add_transformer(profile_time) - - profile_memory = ProfileMemoryTransformer() # instantiate our custom transformer - # as memory tracking is quite time-consuming, for the demonstration purposes - # let's apply profile_memory only to the master_table - catalog.add_transformer(profile_memory, "master_table") - return catalog -``` - -And rerun the pipeline: - -```bash -$ kedro run - -... -2019-11-13 15:55:01,674 - kedro.io.data_catalog - INFO - Saving data to `master_table` (CSVDataSet)... -2019-11-13 15:55:12,322 - ProfileMemoryTransformer - INFO - Saving master_table consumed 606.98MiB memory at peak time -2019-11-13 15:55:12,322 - ProfileTimeTransformer - INFO - Saving master_table took 10.648 seconds -2019-11-13 15:55:12,357 - kedro.runner.sequential_runner - INFO - Completed 3 out of 6 tasks -2019-11-13 15:55:12,358 - kedro.io.data_catalog - INFO - Loading data from `master_table` (CSVDataSet)... -2019-11-13 15:55:13,933 - ProfileMemoryTransformer - INFO - Loading master_table consumed 533.05MiB memory at peak time -2019-11-13 15:55:13,933 - ProfileTimeTransformer - INFO - Loading master_table took 1.576 seconds -... -``` - -### Next section -[Go to the next section](./09_versioning.md) diff --git a/docs/09_versioning.md b/docs/09_versioning.md deleted file mode 100644 index 992008d..0000000 --- a/docs/09_versioning.md +++ /dev/null @@ -1,111 +0,0 @@ - -# Versioning - -## Data versioning -Making a simple addition to your Data Catalog allows you to perform versioning of datasets and machine learning models. - -Suppose you want to version `master_table`. To enable versioning, simply add a `versioned` entry in `catalog.yml` as follows: - -```yaml -master_table: - type: pandas.CSVDataSet - filepath: data/03_primary/master_table.csv - versioned: true -``` - -The `DataCatalog` will create a versioned `CSVDataSet` called `master_table`. The actual csv file location will look like `data/03_primary/master_table.csv//master_table.csv`, where the first `/master_table.csv/` is a directory and `` corresponds to a global save version string formatted as `YYYY-MM-DDThh.mm.ss.sssZ`. - -With the similar way, you can version your machine learning model. Enable versioning for `regressor` as follow: - -```yaml -regressor: - type: pickle.PickleDataSet - filepath: data/06_models/regressor.pickle - versioned: true -``` - -This will save versioned pickle models everytime you run the pipeline. - -> *Note:* The list of the datasets supporting versioning can be find in [the documentation](https://kedro.readthedocs.io/en/stable/05_data/02_kedro_io.html#supported-datasets). - -## Loading a versioned dataset -By default, the `DataCatalog` will load the latest version of the dataset. However, you can run the pipeline with a particular versioned data set with `--load-version` flag as follows: - -```bash -kedro run --load-version="master_table:YYYY-MM-DDThh.mm.ss.sssZ" -``` -where `--load-version` contains a dataset name and a version timestamp separated by `:`. - - -## Journal (code versioning) - -Journal in Kedro allows you to save the history of pipeline runs. This functionality helps you reproduce results and gives you an ability to investigate failures in your workflow. -Each pipeline run creates a log file formatted as `journal_YYYY-MM-DDThh.mm.ss.sssZ.log`, which is saved in the `logs/journals` directory. The log file contains two kinds of journal records. - -### Context journal record - -A context journal record captures all the necessary information to reproduce the pipeline run and has the following JSON format: - -```json -{ - "type": "ContextJournalRecord", - "run_id": "2019-10-01T09.15.57.289Z", - "project_path": "/src/kedro-tutorial", - "env": "local", - "kedro_version": "0.15.4", - "tags": [], - "from_nodes": [], - "to_nodes": [], - "node_names": [], - "from_inputs": [], - "load_versions": {}, - "pipeline_name": null, - "git_sha": "48dd0d3" -} -``` - -You will observe `run_id`, a unique timestamp used to identify a pipeline run, in the context journal record, as well as a `git_sha`, that corresponds to the current git commit hash when your project is tracked by `git`. If your project is not tracked by `git`, then the `git_sha` will be `null`, and you'll see a warning message in your `logs/info.log` as follows: - -```bash -2019-10-01 10:31:13,352 - kedro.versioning.journal - WARNING - Unable to git describe //src/kedro-tutorial -``` - -### Dataset journal record - -A dataset journal record tracks versioned dataset `load` and `save` operations, it is tied to the dataset name and `run_id`. The `version` attribute stores the exact timestamp used by the `load` or `save` operation. Dataset journal currently records `load` and `save` operations only for the datasets with enabled versioning. - -The dataset journal record has the following JSON format: - -```json -{ - "type": "DatasetJournalRecord", - "run_id": "2019-10-01T09.15.57.289Z", - "name": "example_train_x", - "operation": "load", - "version": "2019-10-01T09.15.57.289Z" -} -``` - -> ❗While the context journal record is always logged at every run time of your pipeline, dataset journal record is only logged when `load` or `save` method is invoked for versioned dataset in `DataCatalog`. - -## Steps to manually reproduce your code and run the previous pipeline - -Journals must be persisted to manually reproduce your specific pipeline run. You can keep journals corresponding to checkpoints in your development workflow in your source control repo. Once you have found a version you would like to revert to, follow the below steps: - -1. Checkout a commit from `git_sha` in the context journal record by running the following `git` command in your terminal: -```bash -git checkout -``` -> *Note:* If you want to go back to the latest commit in the current branch, you can run `git checkout `. - -2. Verify that the installed Kedro version is the same as the `project_version` in `src//run.py` by running `kedro --version`. - - If the installed Kedro version does not match the `project_version`, verify that there are no changes that affect your project between the different Kedro versions by looking at [`RELEASE.md`](https://github.com/quantumblacklabs/kedro/blob/master/RELEASE.md), then update the Kedro version by pinning the `kedro==project_version` in `requirements.txt` and run `kedro install` in your terminal. - -3. Run the pipeline with the corresponding versioned datasets' load versions fixed. Open the corresponding journal log file found in `logs/journals`, find dataset journal record, list all the dataset load versions and run the following command in your terminal: -```bash -kedro run --load-version="dataset1:YYYY-MM-DDThh.mm.ss.sssZ" --load-version="dataset2:YYYY-MM-DDThh.mm.ss.sssZ" -``` -where `--load-version` should contain a dataset name and load version timestamp separated by `:`. - -### Next section -[Go to the next section](./10_package-project.md) diff --git a/docs/10_package-project.md b/docs/10_package-project.md deleted file mode 100644 index b8ab18a..0000000 --- a/docs/10_package-project.md +++ /dev/null @@ -1,32 +0,0 @@ -# Distributing a project - -In this part of the training, you will learn how to distribute your data project. - -## Add documentation to your project - -While Kedro documentation can be found by running `kedro docs` from the command line, project-specific documentation can be generated by running `kedro build-docs` in the project's root directory. - -This will create documentation based on the code structure of your project. Documentation will also include the [`docstrings`](https://www.datacamp.com/community/tutorials/docstrings-python) defined in the project code. The resulting HTML files can be found in `docs/build/html/`. - -`kedro build-docs` uses the [Sphinx](https://www.sphinx-doc.org) framework to build your project documentation, so if you want to customise it, please refer to `docs/source/conf.py` and the [corresponding section](http://www.sphinx-doc.org/en/master/usage/configuration.html) of the Sphinx documentation. - ->Note: If you would like to open your documentation from the CLI then you need to run `kedro build-docs --open`. - -## Package your project - -Once a project is ready, most people just use the Python package without further editing it. In this case, the package can be conveniently installed in a single line of code. Kedro can automatically generate egg (`.egg`) and wheel (`.whl`) files for your package. - -To do this, run `kedro package` from the command line, which will create the wheel and egg artefacts and save them to `src/dist/`. For further information about packaging for Python, documentation is provided [here](https://packaging.python.org/overview/). - -## Manage dependencies - -Ensuring that you have accounted for all Python package versions that your project relies on encourages reproducibility of your Kedro project. Use the `kedro build-reqs` CLI command to pin package versions. It works by taking a `requirements.in` file (or `requirements.txt` if the first one does not exist), resolving all package versions using [pip compile](https://github.com/jazzband/pip-tools#example-usage-for-pip-compile) and _freezing_ them by putting pinned versions back into `requirements.txt`. It significantly reduces the chances of dependencies issues due to downstream changes as you would always install the same package versions using `kedro install`. - -## Extend your project - -- You can also check out [Kedro-Docker](https://github.com/quantumblacklabs/kedro-docker), an officially supported Kedro plugin for packaging and shipping Kedro projects within [Docker](https://www.docker.com/) containers. - -- We also support converting your Kedro project into an Airflow project with the [Kedro-Airflow](https://github.com/quantumblacklabs/kedro-airflow) plugin. - -### Next section -[Go to the next section](./11_configuration.md) diff --git a/docs/11_configuration.md b/docs/11_configuration.md deleted file mode 100644 index 5b62ddc..0000000 --- a/docs/11_configuration.md +++ /dev/null @@ -1,128 +0,0 @@ -# Configuration - -We recommend that you keep all configuration files in the `conf/` directory of a Kedro project. However, if you prefer, you may point Kedro to any other directory and change the configuration paths by overriding `CONF_ROOT` variable from the derived ProjectContext class in `src/kedro_training/run.py` as follows: - -```python -class ProjectContext(KedroContext): - CONF_ROOT = "new_conf_root" - - ... -``` - -## Loading configuration - -Kedro ships a purpose-built `ConfigLoader` class that helps you load configuration from various file formats including: YAML, JSON, INI, Pickle, XML and more. - -When searching for the configs, `ConfigLoader` does that in the specified config environments, `base` and `local` by default, which represent the directories inside your root config directory. - -Here is an example of how to load configuration for your `DataCatalog`: - -```python -from kedro.config import ConfigLoader - -conf_envs = ["conf/base", "conf/local"] # ConfigLoader will search for configs in these directories -conf_loader = ConfigLoader(conf_envs) -conf_catalog = conf_loader.get("catalog*", "catalog*/**") # returns a dictionary -``` - -`ConfigLoader` will recursively scan for configuration files firstly in `conf/base/` and then in `conf/local/` directory according to the following rules: -1. The filename starts with `catalog` OR the file is located in a sub-directory which has a name that is prefixed with `catalog` -2. AND the file extension is one of the following: `yaml`, `yml`, `json`, `ini`, `pickle`, `xml`, `properties` or `shellvars` - -Configuration data from the files that match these rules will be merged at runtime and returned in the form of a Python dictionary. - -> Note: Any top-level keys that start with `_` character are considered hidden (or reserved) and therefore are ignored right after the config load. Those keys will neither trigger a key duplication error mentioned above, nor will they appear in the resulting configuration dictionary. However, you may still use such keys for various purposes. For example, as [YAML anchors and aliases](https://confluence.atlassian.com/bitbucket/yaml-anchors-960154027.html) - -* If any 2 different config files located inside the _same_ environment (`base` or `local` here) contain the same top-level key, load_config will raise a `ValueError` indicating that the duplicates are not allowed. -* If 2 different config files have duplicate top-level keys, but are located in _different_ environments (one in `base`, another in `local`, for example) then the last loaded path (`local` in this case) takes precedence and _overrides_ that key value. No errors are raised in this case, however a DEBUG level log message will be emitted with the information on the over-ridden keys. -* If the same environment path is passed multiple times, a `UserWarning` will be emitted to draw attention to the duplicate loading attempt, and any subsequent loading after the first one will be skipped. - -## Additional config environments - -In addition to the 2 built-in configuration environments, it is possible to create your own. Your project loads `base` as the bottom-level configuration environment but allows you to overwrite it with any other environments that you create. Any additional configuration environments can be created inside `conf` folder and applied to your pipeline run as follows: - -```bash -kedro run --env -``` - -If no `env` option is specified, this will default to `local` environment to overwrite `base`. - -## Templating configuration - -Kedro also provides an extension `kedro.config.TemplatedConfigLoader` class that allows to template values in your configuration files. To apply `TemplatedConfigLoader` to your `ProjectContext` in `src/kedro_training/run.py`, you will need to overwrite the `_create_config_loader` method as follows: - -```python -... -from kedro.config import TemplatedConfigLoader # new import - - -class ProjectContext(KedroContext): - ... - - def _create_config_loader(self, conf_paths: Iterable[str]) -> TemplatedConfigLoader: - return TemplatedConfigLoader( - conf_paths, - globals_pattern="*globals.yml", # read the globals dictionary from project config - globals_dict={ # extra keys to add to the globals dictionary, take precedence over globals_pattern - "bucket_name": "another_bucket_name", - "non_string_key": 10 - } - ) -``` - -Let's assume the project contains a `conf/base/globals.yml` file with the following contents: - -```yaml -bucket_name: "my_s3_bucket" -key_prefix: "my/key/prefix/" - -datasets: - csv: "pandas.CSVDataSet" - spark: "spark.SparkDataSet" - -folders: - raw: "01_raw" - int: "02_intermediate" - pri: "03_primary" - fea: "04_features" -``` - -The contents of the dictionary resulting from the `globals_pattern` get merged with the `globals_dict`. In case of conflicts, the keys from the `globals_dict` take precedence. The resulting global dictionary prepared by `TemplatedConfigLoader` will look like this: - -```python -{ - "bucket_name": "another_bucket_name", - "non_string_key": 10, - "key_prefix": "my/key/prefix", - "datasets": { - "csv": "pandas.CSVDataSet", - "spark": "spark.SparkDataSet" - }, - "folders": { - "raw": "01_raw", - "int": "02_intermediate", - "pri": "03_primary", - "fea": "04_features" - } -} -``` - -Now the templating can be applied to the configs. Here is an example of templated `catalog.yml`: - -```yaml -raw_boat_data: - type: "${datasets.spark}" # nested paths into global dict are allowed - filepath: "s3a://${bucket_name}/${key_prefix}/${folders.raw}/boats.csv" - file_format: parquet - -raw_car_data: - type: "${datasets.csv}" - filepath: "data/${key_prefix}/${folders.raw}/cars.csv" - bucket_name: "${bucket_name}" - file_format: "${non.existent.key|parquet}" # default to 'parquet' if the key is not found -``` - -> Note: `TemplatedConfigLoader` uses `jmespath` package in the background to extract elements from global dictionary. For more information about JMESPath syntax please see: https://github.com/jmespath/jmespath.py. - -### Next section -[Go to the next section](./12_transcoding.md) diff --git a/docs/12_transcoding.md b/docs/12_transcoding.md deleted file mode 100644 index e33a6a3..0000000 --- a/docs/12_transcoding.md +++ /dev/null @@ -1,46 +0,0 @@ -# Transcoding - -## What is this? - -You may come across a situation where you would like to read the same file using two different dataset implementations. Use transcoding when you want to load and save the same file, via its specified `filepath`, using different `DataSet` implementations. - -## A typical example of transcoding - -For instance, parquet files can not only be loaded via the `ParquetLocalDataSet`, but also directly by `SparkDataSet` using `pandas`. This conversion is typical when coordinating a `Spark` to `pandas` workflow. - -To enable transcoding, you will need to define two `DataCatalog` entries for the same dataset in a common format (Parquet, JSON, CSV, etc.) in your `conf/base/catalog.yml`: - -```yaml -my_dataframe@spark: - type: kedro.extras.datasets.spark.SparkDataSet - filepath: data/02_intermediate/data.parquet - -my_dataframe@pandas: - type: pandas.ParquetDataSet - filepath: data/02_intermediate/data.parquet -``` - -These entries will be used in the pipeline like this: - -```python -Pipeline([ - node(func=my_func1, - inputs="spark_input", - outputs="my_dataframe@spark" - ), - node(func=my_func2, - inputs="my_dataframe@pandas", - outputs="pipeline_output" - ), -]) -``` - -## How does it work? - -In this example, Kedro understands that `my_dataframe` is the same dataset in its `SparkDataSet` and `ParquetDataSet` formats and helps resolve the node execution order. - -In the pipeline, Kedro uses the `SparkDataSet` implementation for saving and `ParquetDataSet` -for loading, so the first node should output a `pyspark.sql.DataFrame`, while the second node would receive a `pandas.Dataframe`. - -### Next section -[Go to the next section](./13_custom-datasets.md) diff --git a/docs/13_custom-datasets.md b/docs/13_custom-datasets.md deleted file mode 100644 index 15e4a8d..0000000 --- a/docs/13_custom-datasets.md +++ /dev/null @@ -1,150 +0,0 @@ -# Creating custom datasets - -Often, real world data is stored in formats that are not supported by Kedro. We will illustrate this with `shuttles.xlsx`. In fact, Kedro has built-in support for Microsoft Excel files that includes support for versioning and writer arguments. This example explains a simplified implementation. - -Let’s create a custom dataset class which will allow you to load and save `.xlsx` files. - -To keep your code well-structured you should create a Python sub-package called **`kedro_tutorial.io`**. You can do that by running this in your Unix terminal: - -```bash -mkdir -p src/kedro_tutorial/io && touch src/kedro_tutorial/io/__init__.py -``` - -Or, if you are a Windows user: - -```bat -mkdir src\kedro_tutorial\io && type nul > src\kedro_tutorial\io\__init__.py -``` - -Creating new custom dataset implementations is done by creating a class that extends and implements all abstract methods from `AbstractDataSet`. To implement a class that will allow you to load and save Excel files locally, you need to create the file `src/kedro_tutorial/io/xls_local.py` by running in your Unix terminal: - -```bash -touch src/kedro_tutorial/io/xls_local.py -``` -For Windows, try: -```bat -type nul > src\kedro_tutorial\io\xls_local.py -``` - -and paste the following into the newly created file: - -```python -"""ExcelLocalDataSet loads and saves data to a local Excel file. The -underlying functionality is supported by pandas, so it supports all -allowed pandas options for loading and saving Excel files. -""" -from os.path import isfile -from typing import Any, Union, Dict - -import pandas as pd - -from kedro.io import AbstractDataSet - - -class ExcelLocalDataSet(AbstractDataSet): - """``ExcelLocalDataSet`` loads and saves data to a local Excel file. The - underlying functionality is supported by pandas, so it supports all - allowed pandas options for loading and saving Excel files. - - Example: - :: - - >>> import pandas as pd - >>> - >>> data = pd.DataFrame({'col1': [1, 2], 'col2': [4, 5], - >>> 'col3': [5, 6]}) - >>> data_set = ExcelLocalDataSet(filepath="test.xlsx", - >>> load_args={'sheet_name':"Sheet1"}, - >>> save_args=None) - >>> data_set.save(data) - >>> reloaded = data_set.load() - >>> - >>> assert data.equals(reloaded) - - """ - - def _describe(self) -> Dict[str, Any]: - return dict( - filepath=self._filepath, - engine=self._engine, - load_args=self._load_args, - save_args=self._save_args, - ) - - def __init__( - self, - filepath: str, - engine: str = "xlsxwriter", - load_args: Dict[str, Any] = None, - save_args: Dict[str, Any] = None, - ) -> None: - """Creates a new instance of ``ExcelLocalDataSet`` pointing to a concrete - filepath. - - Args: - engine: The engine used to write to excel files. The default - engine is 'xlswriter'. - - filepath: path to an Excel file. - - load_args: Pandas options for loading Excel files. - Here you can find all available arguments: - https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html - The default_load_arg engine is 'xlrd', all others preserved. - - save_args: Pandas options for saving Excel files. - Here you can find all available arguments: - https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_excel.html - All defaults are preserved. - - """ - self._filepath = filepath - default_save_args = {} - default_load_args = {"engine": "xlrd"} - - self._load_args = ( - {**default_load_args, **load_args} - if load_args is not None - else default_load_args - ) - self._save_args = ( - {**default_save_args, **save_args} - if save_args is not None - else default_save_args - ) - self._engine = engine - - def _load(self) -> Union[pd.DataFrame, Dict[str, pd.DataFrame]]: - return pd.read_excel(self._filepath, **self._load_args) - - def _save(self, data: pd.DataFrame) -> None: - writer = pd.ExcelWriter(self._filepath, engine=self._engine) - data.to_excel(writer, **self._save_args) - writer.save() - - def _exists(self) -> bool: - return isfile(self._filepath) -``` - -And update the `conf/base/catalog.yml` file by adding the following: - -```yaml -shuttles: - type: kedro_tutorial.io.xls_local.ExcelLocalDataSet - filepath: data/01_raw/shuttles.xlsx -``` - -> *Note:* The `type` specified is `kedro_tutorial.io.xls_local.ExcelLocalDataSet` which points Kedro to use the custom dataset implementation. To use Kedro's internal support for reading Excel datasets, you can simply specify `pandas.ExcelDataSet`, which is implemented similar to the code above. - -A good way to test that everything works as expected is by trying to load the dataset within a new `kedro ipython` session: - -```python -context.catalog.load('shuttles').head() -``` - -### Contributing a custom dataset implementation - -Kedro users create many custom dataset implementations while working on real-world projects, and it makes sense that they should be able to share their work with each other. Sharing your custom datasets implementations is possibly the easiest way to contribute back to Kedro and if you are interested in doing so, you can check out the [Kedro contribution guide](https://github.com/quantumblacklabs/kedro/blob/develop/CONTRIBUTING.md) in the GitHub. - -### Next section -[Go to the next section](./14_custom-cli-commands.md) diff --git a/training_docs/01_welcome.md b/training_docs/01_welcome.md new file mode 100644 index 0000000..eebcda9 --- /dev/null +++ b/training_docs/01_welcome.md @@ -0,0 +1,54 @@ +# Welcome to the Kedro training! +Welcome! We are so pleased you are starting your Kedro journey. + +## What you'll cover + +* [Training prerequisites](./02_prerequisites.md) << Read this before training starts +* [Create a new Kedro project](./03_new_project.md) +* [Project dependencies](./04_dependencies.md) +* [Add a data source](./05_connect_data_sources.md) +* [Jupyter notebook workflow](./06_jupyter_notebook_workflow.md) +* [Kedro pipelines](./07_pipelines.md) +* [Pipeline visualisatopm](./08_visualisation.md) +* [Versioning](./09_versioning.md) +* [Package your project](./10_package_project.md) +* [Configuration](./11_configuration.md) +* [Transcoding](./12_transcoding.md) +* [Custom datasets](./13_custom_datasets.md) +* [Custom CLI commands](./14_custom_cli_commands.md) +* [Kedro plugins](./15_plugins.md) + + +## Before your training session + +These training materials assume some level of technical understanding, such as knowledge of Python and use of a command line interface. To optimise your experience and learn the most you can from the Kedro training, please review the following and then take a look at the [prerequisites page](./02_prerequisites.md) to get the necessary software, including Kedro, installed and running before your training session. + +## Prerequisite knowledge: Python, YAML and the CLI + +- You should be familiar with Python basics. [Take a look at this tutorial to confirm you are comfortable with](https://docs.python.org/3/tutorial/): + + - Functions, loops, conditional statements and IO operation + - Common data structures including lists, dictionaries and tuples + +- You should also: + - Be able to [install Python packages using `pip`](https://pip.pypa.io/en/stable/quickstart/) + - Understand the basics of [dependency management with `requirements.txt`](https://pip.pypa.io/en/latest/user_guide/#requirements-files) + - Know about [Python modules](https://docs.python.org/3/tutorial/modules.html) (e.g how to use `__init__.py` and relative and absolute imports) + - Have [familiarity with Python data science libraries](https://towardsdatascience.com/top-10-python-libraries-for-data-science-cd82294ec266), especially `Pandas` and `scikit-learn` + - Understand how to use [Jupyter Notebook/Lab](https://www.dataquest.io/blog/jupyter-notebook-tutorial/) and [iPython](https://www.codecademy.com/articles/how-to-use-ipython)) + - Be able to [use a virtual environment](https://docs.python.org/3/tutorial/venv.html) (we recommend using `conda`, but you can also use `venv` or `pipenv`) + +- You should know [basic YAML syntax](https://yaml.org/) + +- When working with the command line, you should be familiar with: + + - `cd` to navigate directories + - `ls` to list files and directories + - [Executing a command and Python program from the command line](https://realpython.com/run-python-scripts/#how-to-run-python-scripts-using-the-command-line) + + +>**Note**: If you have any problems or questions, please contact an instructor before the training. + + + +_[Go to the next page](./02_prerequisites.md)_ diff --git a/training_docs/02_prerequisites.md b/training_docs/02_prerequisites.md new file mode 100644 index 0000000..4f214ae --- /dev/null +++ b/training_docs/02_prerequisites.md @@ -0,0 +1,81 @@ +# Training prerequisites +## Operating system +Kedro supports macOS, Linux and Windows (7 / 8 / 10 and Windows Server 2016+). + +## Command line +You will need to use the command line interface, or CLI, which is a text-based application to view, navigate and manipulate files on your computer. + +[Find out more about working with the command line](https://tutorial.djangogirls.org/en/intro_to_command_line/) (also known as cmd, CLI, prompt, console or terminal). + +- [Terminal on macOS](https://support.apple.com/en-gb/guide/terminal/welcome/mac) +- [Windows command line](https://www.computerhope.com/issues/chusedos.htm) + +## Python +Kedro supports [Python 3.6, 3.7 or 3.8](https://www.python.org/downloads/) so you should make sure that it is installed on your laptop. + +We recommend using [Anaconda](https://www.anaconda.com/download) (Python 3.7 version) to install Python packages. You can also use `conda`, which comes with Anaconda, as a virtual environment manager when you install Kedro. + +## Kedro +Follow the official [Kedro prerequisites documentation](https://kedro.readthedocs.io/en/latest/02_get_started/01_prerequisites.html) and the page that follows it to [install Kedro](https://kedro.readthedocs.io/en/latest/02_get_started/02_install.html). + +A clean install of Kedro is designed to be lightweight. If you later need additional tools, it is possible to install them later. + +>**Note**: If you encounter any problems, please engage Kedro community support on [Stack Overflow](https://stackoverflow.com/questions/tagged/kedro), or refer to the [Kedro.Community Discourse channel](https://discourse.kedro.community/) that is managed by Kedroids all over the world. + +## Code editor +There are many code editors to choose from. Here are some we recommend: + +- [PyCharm](https://www.jetbrains.com/pycharm/download/) +- [VS Code IDE](https://code.visualstudio.com/) +- [Atom](https://atom.io/) + +## Kedro training code +Download the [Kedro training repository](https://github.com/quantumblacklabs/kedro-training) by following [these instructions](https://stackoverflow.com/questions/2751227/how-to-download-source-in-zip-format-from-github). + +## Git (optional) + +Git is a version control system that records changes to files as you work on them. Git is especially helpful for software developers as it allows changes to be tracked (including who and when) across a project. + +When you [download `git`](https://git-scm.com/downloads), be sure to choose the correct version for your operating system: + +### Installing Git on Windows +Download the [`git` for Windows installer](https://gitforwindows.org/). Make sure to select **Use `git` from the Windows command prompt** this will ensure that `git` is permanently added to your PATH. + +Also select **Checkout Windows-style, commit Unix-style line endings** selected and click on **Next**. + +This will provide you both `git` and `git bash`, which you will find useful during the training. + +### GitHub +GitHub is a web-based service for version control using Git. To use it, you will need to [set up an account](https://github.com). + + +## Checklist +Please use this checklist to make sure you have everything necessary to participate in the Kedro training. + +- [ ] You have [Python 3 (either 3.6, 3.7 or 3.8)](https://www.python.org/downloads/) installed on your laptop + +- [ ] You have Anaconda or an alternative [virtual environment manager](https://kedro.readthedocs.io/en/stable/02_get_started/01_prerequisites.html#virtual-environments) virtual environment manager + +- [ ] You have installed [Kedro](#kedro) + +- [ ] You have [downloaded the `kedro-training` repository](#kedro-training-code) + +- [ ] You have [a code editor](#code-editor) installed for writing Python code + +- [ ] You have a [command line](#command-line) installed + +Having completed the above checklist, make sure that you are able to execute the following commands from your command line interface: + +- [ ] `python --version` or `python3 --version` returns a correct Python version (either 3.6, 3.7 or 3.8). + +- [ ] `kedro --version` shows [the latest Kedro version](https://pypi.org/project/kedro/). + + +If you are able to complete all of the above, you are ready for the training! + +>**Note**: If you have any problems or questions in any of the above checklist, please contact an instructor and resolve the issues before the training. + + +_[Go to the next page](./03_new_project.md)_ + + diff --git a/training_docs/03_new_project.md b/training_docs/03_new_project.md new file mode 100644 index 0000000..72afff0 --- /dev/null +++ b/training_docs/03_new_project.md @@ -0,0 +1,102 @@ +# Create a new project + +This section mirrors the [spaceflights tutorial in the Kedro documentation](https://kedro.readthedocs.io/en/stable/03_tutorial/01_spaceflights_tutorial.html). + +As we work with the spaceflights tutorial, will follow these steps: + +### 1. Set up the project template + +* Create a new project with `kedro new` +* Configure the following in the `conf` folder: + * Logging + * Credentials + * Any other sensitive / personal content +* Install project dependencies with `kedro install` + +### 2. Set up the data + +* Add data to the `data/` folder +* Reference all datasets for the project in `conf/base/catalog.yml` + +### 3. Create the pipeline + +* Create the data transformation steps as Python functions +* Construct the pipeline by adding your functions as nodes +* Choose how to run the pipeline: sequentially or in parallel + +### 4. Package the project + + * Build the project documentation + * Package the project for distribution + + +## Create and set up a new project + +In this section, we discuss the project set-up phase, with the following steps: + + +* Create a new project +* Install dependencies +* Configure the project + +---- +In the text, we assume that you create an empty project and follow the flow of the tutorial by copying and pasting the example code into the project as we describe. This tutorial will take approximately 2 hours and you will learn each step of the Kedro project development workflow, by working on an example to construct nodes and pipelines for the price-prediction model. + +However, you may prefer to get up and running more swiftly so we provide the full spaceflights example project as a [Kedro starter](https://kedro.readthedocs.io/en/stable/02_get_started/06_starters.html). + +Follow one or other of these instructions to create the project: + +* If you decide to create the example project fully populated with code, navigate to your chosen working directory and run the following: `kedro new --starter=spaceflights` + + - Feel free to name your project as you like, but this guide will assume the project is named **`Kedro Training`**, and that your project is in a sub-folder in your working directory that was created by `kedro new`, named `kedro-training`. + + - Keep the default names for the `repo_name` and `python_package` when prompted. + + - The project will be populated with the template code from the [Kedro starter for the spaceflights tutorial](https://github.com/quantumblacklabs/kedro-starters/tree/master/spaceflights). It means that you can follow the tutorial without any of the copy/pasting. + +* If you prefer to create an empty tutorial and cut and paste the code to follow along with the steps, you should instead run the following to [create a new empty Kedro project](https://kedro.readthedocs.io/en/stable/02_get_started/04_new_project.html#create-a-new-project-interactively) using the default interactive prompts: `kedro new` + + - Feel free to name your project as you like, but this guide will assume the project is named **`Kedro Training`**, and that your project is in a sub-folder in your working directory that was created by `kedro new`, named `kedro-training`. + + - Keep the default names for the `repo_name` and `python_package` when prompted. + + +### Project structure +Take a few minutes to familiarise yourself with the [Kedro project structure](https://kedro.readthedocs.io/en/stable/02_get_started/05_example_project.html#project-directory-structure) by exploring the contents of the `kedro-training` folder that you created from either of the steps you chose above. + + +## Configure the project + +You may optionally add in any credentials to `conf/local/credentials.yml` that you would need to load specific data sources like usernames and passwords. Some examples are given within the file to illustrate how you store credentials. Additional information can be found in a later page on [advanced configuration](./11_configuration.md). + +When it runs, Kedro automatically reads credentials from the `conf` folder and feeds them into the Data Catalog, which is responsible for loading and saving data as inputs and outputs of pipeline nodes. You can configure your credentials once and then reuse them in multiple datasets. + +Example of `conf/local/credentials.yml`: + +```yaml +dev_s3: + client_kwargs: + aws_access_key_id: key + aws_secret_access_key: secret +``` + + +For security reasons, we strongly recommend not committing any credentials or other secrets to the Version Control System. By default any file inside the `conf` folder (and subfolders) in your Kedro project containing `credentials` word in its name will be ignored and not committed to your repository. + +Please bear it in mind when you start working with Kedro project that you have cloned from GitHub, for example, as you might need to configure required credentials first. + +>**Note**: If you maintain a project, you should document how to configure any required credentials in your project's documentation. + +The Kedro documentation lists some [best practices to avoid leaking confidential data](https://kedro.readthedocs.io/en/stable/02_get_started/05_example_project.html#what-best-practice-should-i-follow-to-avoid-leaking-confidential-data). + + +Example of the dataset using those credentials defined in `conf/base/catalog.yml`: + +```yaml +cars: + type: pandas.CSVDataSet + filepath: s3://my_bucket/data/02_intermediate/company/cars.csv + credentials: dev_s3 +``` + +_[Go to the next page](./04_dependencies.md)_ diff --git a/training_docs/04_dependencies.md b/training_docs/04_dependencies.md new file mode 100644 index 0000000..8ca4b1e --- /dev/null +++ b/training_docs/04_dependencies.md @@ -0,0 +1,60 @@ +# Dependencies + +Up to this point, we haven't discussed project dependencies, so now is a good time to introduce them. Specifying a project's dependencies in Kedro makes it easier for others to run your project; it avoids version conflicts by use of the same Python packages. + +The generic project template bundles some typical dependencies, in `src/requirements.txt`. Here's a typical example, although you may find that the version numbers are slightly different depending on the version of Kedro that you are using: + +```text +black==v19.10b0 # Used for formatting code with `kedro lint` +flake8>=3.7.9, <4.0 # Used for linting code with `kedro lint` +ipython==7.0 # Used for an IPython session with `kedro ipython` +isort>=4.3.21, <5.0 # Used for linting code with `kedro lint` +jupyter~=1.0 # Used to open a Kedro-session in Jupyter Notebook & Lab +jupyter_client>=5.1.0, <7.0 # Used to open a Kedro-session in Jupyter Notebook & Lab +jupyterlab==0.31.1 # Used to open a Kedro-session in Jupyter Lab +kedro==0.17.0 +nbstripout==0.3.3 # Strips the output of a Jupyter Notebook and writes the outputless version to the original file +pytest-cov~=2.5 # Produces test coverage reports +pytest-mock>=1.7.1,<2.0 # Wrapper around the mock package for easier use with pytest +pytest~=6.1.2 # Testing framework for Python code +wheel==0.32.2 # The reference implementation of the Python wheel packaging standard +``` + +> Note: If your project has `conda` dependencies, you can create a `src/environment.yml` file and list them there. + +### Add and remove project-specific dependencies + +The dependencies above may be sufficient for some projects, but for the spaceflights project, you need to add a requirement for the `pandas` project because you are working with CSV and Excel files. You can add the necessary dependencies for these files types as follows: + +```bash +pip install kedro[pandas.CSVDataSet,pandas.ExcelDataSet] +``` + +Alternatively, if you need to, you can edit `src/requirements.txt` directly to modify your list of dependencies by replacing the requirement `kedro==0.17.0` with the following (your version of Kedro may be different): + +```text +kedro[pandas.CSVDataSet,pandas.ExcelDataSet]==0.17.0 +``` + +Then run the following: + +```bash +kedro build-reqs +``` + +[`kedro build-reqs`](https://kedro.readthedocs.io/en/stable/09_development/03_commands_reference.html#build-the-project-s-dependency-tree) takes `requirements.in` file (or `requirements.txt` if it does not yet exist), resolves all package versions and 'freezes' them by putting pinned versions back into `requirements.txt`. It significantly reduces the chances of dependencies issues due to downstream changes as you would always install the same package versions. + + +You can find out more about [how to work with project dependencies](https://kedro.readthedocs.io/en/stable/04_kedro_project_setup/01_dependencies.html) in the Kedro project documentation. + +## `kedro install` + +To install the project-specific dependencies, navigate to the root directory of the project and run: + +```bash +kedro install +``` + +This command is roughly equivalent to `pip install -r src/requirements.txt`, however `kedro install` is a bit smarter on Windows when it needs to upgrade its version. It also makes sure that the dependencies are always installed in the same virtual environment as Kedro. + +_[Go to the next page](./05_connect_data_sources.md)_ \ No newline at end of file diff --git a/training_docs/05_connect_data_sources.md b/training_docs/05_connect_data_sources.md new file mode 100644 index 0000000..a991d9c --- /dev/null +++ b/training_docs/05_connect_data_sources.md @@ -0,0 +1,145 @@ +# Add your datasets to `data` + +In this section, we discuss the data set-up phase. The steps are as follows: + +* Add datasets to your `data/` folder, according to [data engineering convention](https://kedro.readthedocs.io/en/stable/12_faq/01_faq.html#what-is-data-engineering-convention) +* Register the datasets with the Data Catalog, which is the registry of all data sources available for use by the project `conf/base/catalog.yml`. This ensures that your code is reproducible when it references datasets in different locations and/or environments. + +You can find further information about [the Data Catalog](https://kedro.readthedocs.io/en/stable/05_data/01_data_catalog.html) in specific documentation covering advanced usage. + + +## Add your datasets to `data` + +The spaceflights tutorial makes use of fictional datasets of companies shuttling customers to the Moon and back. You will use the data to train a model to predict the price of shuttle hire. However, before you get to train the model, you will need to prepare the data by doing some data engineering, which is the process of preparing data for model building by creating a master table. + +The spaceflight tutorial has three files and uses two data formats: `.csv` and `.xlsx`. Download and save the files to the `data/01_raw/` folder of your project directory: + +* [reviews.csv](https://quantumblacklabs.github.io/kedro/reviews.csv) +* [companies.csv](https://quantumblacklabs.github.io/kedro/companies.csv) +* [shuttles.xlsx](https://quantumblacklabs.github.io/kedro/shuttles.xlsx) + +Here are some examples of how you can [download the files from GitHub](https://www.quora.com/How-do-I-download-something-from-GitHub) to the `data/01_raw` directory inside your project: + +Using [cURL in a Unix terminal](https://curl.haxx.se/download.html): + +
+Click to expand + +```bash +# reviews +curl -o data/01_raw/reviews.csv https://quantumblacklabs.github.io/kedro/reviews.csv +# companies +curl -o data/01_raw/companies.csv https://quantumblacklabs.github.io/kedro/companies.csv +# shuttles +curl -o data/01_raw/shuttles.xlsx https://quantumblacklabs.github.io/kedro/shuttles.xlsx +``` +
+ +Using [cURL for Windows](https://curl.se/windows/): + +
+Click to expand + +```bat +curl -o data\01_raw\reviews.csv https://quantumblacklabs.github.io/kedro/reviews.csv +curl -o data\01_raw\companies.csv https://quantumblacklabs.github.io/kedro/companies.csv +curl -o data\01_raw\shuttles.xlsx https://quantumblacklabs.github.io/kedro/shuttles.xlsx +``` +
+ +Using [Wget in a Unix terminal](https://www.gnu.org/software/wget/): + +
+Click to expand + +```bash +# reviews +wget -O data/01_raw/reviews.csv https://quantumblacklabs.github.io/kedro/reviews.csv +# companies +wget -O data/01_raw/companies.csv https://quantumblacklabs.github.io/kedro/companies.csv +# shuttles +wget -O data/01_raw/shuttles.xlsx https://quantumblacklabs.github.io/kedro/shuttles.xlsx +``` +
+ +Using [Wget for Windows](https://eternallybored.org/misc/wget/): + +
+Click to expand + +```bat +wget -O data\01_raw\reviews.csv https://quantumblacklabs.github.io/kedro/reviews.csv +wget -O data\01_raw\companies.csv https://quantumblacklabs.github.io/kedro/companies.csv +wget -O data\01_raw\shuttles.xlsx https://quantumblacklabs.github.io/kedro/shuttles.xlsx +``` +
+ +## Register the datasets + +You now need to register the datasets so they can be loaded by Kedro. All Kedro projects have a `conf/base/catalog.yml` file, and you register each dataset by adding a named entry into the `.yml` file. The entry should include the following: + +* File location (path) +* Parameters for the given dataset +* Type of data +* Versioning + +Kedro supports a number of different data types, and those supported can be found in the API documentation. Kedro uses [`fssspec`](https://filesystem-spec.readthedocs.io/en/latest/) to read data from a variety of data stores including local file systems, network file systems, cloud object stores and HDFS. + + +### `csv` + +For the spaceflights data, first register the `csv` datasets by adding this snippet to the end of the `conf/base/catalog.yml` file: + +```yaml +companies: + type: pandas.CSVDataSet + filepath: data/01_raw/companies.csv + +reviews: + type: pandas.CSVDataSet + filepath: data/01_raw/reviews.csv +``` + +To check whether Kedro can load the data correctly, open a `kedro ipython` session and run: + +```python +catalog.load("companies").head() +``` + +The command loads the dataset named `companies` (as per top-level key in `catalog.yml`), from the underlying filepath `data/01_raw/companies.csv`. It displays the first five rows of the dataset, and is loaded into a `pandas` DataFrame for you to experiment with the data. + +When you have finished, close `ipython` session as follows: + +```python +exit() +``` + +### `xlsx` + +Now register the `xlsx` dataset by adding this snippet to the end of the `conf/base/catalog.yml` file: + +```yaml +shuttles: + type: pandas.ExcelDataSet + filepath: data/01_raw/shuttles.xlsx +``` + +To test that everything works as expected, load the dataset within a _new_ `kedro ipython` session: + +```python +catalog.load("shuttles").head() +``` +When you have finished, close `ipython` session as follows: + +```python +exit() +``` + +## Custom data + +Kedro supports a number of datasets out of the box, but you can also add support for any proprietary data format or filesystem in your pipeline. + +You can find further information about [how to add support for custom datasets](https://kedro.readthedocs.io/en/stable/07_extend_kedro/03_custom_datasets.html) in specific documentation covering advanced usage. + + +_[Go to the next page](./06_jupyter_notebook_workflow.md)_ diff --git a/training_docs/06_jupyter_notebook_workflow.md b/training_docs/06_jupyter_notebook_workflow.md new file mode 100644 index 0000000..aeacf81 --- /dev/null +++ b/training_docs/06_jupyter_notebook_workflow.md @@ -0,0 +1,322 @@ +# Use Kedro with IPython and Jupyter Notebooks/Lab + +This section demonstrates how to use Kedro with IPython and Jupyter Notebooks / Lab. We also recommend a video that explains the transition from the use of vanilla Jupyter Notebooks to using Kedro, from [Data Engineer One](https://www.youtube.com/watch?v=dRnCovp1GRQ&t=50s&ab_channel=DataEngineerOne). + + + + +## Why use a Notebook? +There are reasons why you may want to use a Notebook, although in general, the principles behind Kedro would discourage their use because they have some [drawbacks when they are used to create production or reproducible code](https://towardsdatascience.com/5-reasons-why-you-should-switch-from-jupyter-notebook-to-scripts-cb3535ba9c95). However, there are occasions when you'd want to put some code into a Notebook, for example: + +* To conduct exploratory data analysis +* For experimentation as you create new Python functions (nodes) +* As a tool for reporting and presentations + + +## Kedro and IPython + +You may want to use a Python kernel inside a Jupyter notebook (formerly known as IPython) to experiment with your Kedro code. + +To start a standalone IPython session, run the following command in the root directory of your Kedro project: + +```bash +kedro ipython +``` +This opens an iPython session in your shell, which you can terminate, when you have finished, by typing: + +```python +exit() +``` +### Load `DataCatalog` in IPython + +To test the IPython session, load a dataset defined in your `conf/base/catalog.yml`, by simply executing the following: + +```python +df = catalog.load("companies") +df.head() +``` + +#### Dataset versioning + +If you enable [versioning](./versioning.md), you can load a particular version of a dataset. Given a catalog entry: + +```yaml +example_train_x: + type: pandas.CSVDataSet + filepath: data/02_intermediate/example_train_x.csv + versioned: true +``` + +and having run your pipeline at least once, you may specify which version to load: + +```python +catalog.load("example_train_x", version="2019-12-13T15.08.09.255Z") +``` + +## Kedro and Jupyter + +You may want to use Jupyter notebooks to experiment with your code as you develop new nodes for a pipeline, although you can write them as regular Python functions without a notebook. To use Kedro's Jupyter session: + +```bash +kedro jupyter notebook +``` + +This starts a Jupyter server and opens a window in your default browser. + +> Note: If you want Jupyter to listen to a different port number, then run `kedro jupyter notebook --port ` + +Navigate to the `notebooks` folder of your Kedro project and create a new notebook. + +![](./images/jupyter_create_new_notebook.png) + +> *Note:* The only kernel available by default has a name of the current project. If you need to access all available kernels, add `--all-kernels` to the command above. + +Every time you start or restart a Jupyter or IPython session in the CLI using a `kedro` command, a startup script in `.ipython/profile_default/startup/00-kedro-init.py` is executed. It adds the following variables in scope: + +* `context` (`KedroContext`) - Kedro project context that provides access to Kedro's library components. +* `session` (`KedroSession`) - Session data (static and dynamic) for the Kedro run. +* `catalog` (`DataCatalog`) - Data catalog instance that contains all defined datasets; this is a shortcut for `context.catalog` +* `startup_error` (`Exception`) - An error that was raised during the execution of the startup script or `None` if no errors occurred + +## How to use `context` + +The `context` variable allows you to interact with Kedro library components from within the Kedro Jupyter notebook. + +![context input graphic](./images/jupyter_notebook_showing_context.png) + +With `context`, you can access the following variables and methods: + +- `context.project_path` (`Path`) - Root directory of the project +- `context.project_name` (`str`) - Project folder name +- `context.catalog` (`DataCatalog`) - An instance of DataCatalog +- `context.config_loader` (`ConfigLoader`) - An instance of ConfigLoader +- `context.pipeline` (`Pipeline`) - Defined pipeline + +### Run the pipeline + +If you wish to run the whole main pipeline within a notebook cell, you can do so by instantiating a `Session`: + +```python +from kedro.framework.session import KedroSession + +with KedroSession.create("") as session: + session.run() +``` + +The command runs the nodes from your default project pipeline in a sequential manner. + +To parameterise your pipeline run, refer to [a later section on this page on run parameters](#additional-parameters-for-session-run) which lists all available options. + + +### Parameters + +The `context` object exposes the `params` property, which allows you to access all project parameters: + +```python +parameters = context.params # type: Dict +parameters["example_test_data_ratio"] +# returns the value of 'example_test_data_ratio' key from 'conf/base/parameters.yml' +``` + +> Note: You need to reload Kedro variables by calling `%reload_kedro` and re-run the code snippet above if you change the contents of `parameters.yml`. + +### Load/Save `DataCatalog` in Jupyter + +You can load a dataset defined in your `conf/base/catalog.yml`: + +```python +df = catalog.load("example_iris_data") +df.head() +``` + +![load the catalog and output head graphic](./images/jupyter_notebook_workflow_loading_data.png) + +The save operation in the example below is analogous to the load. + +Put the following dataset entry in `conf/base/catalog.yml`: + +```yaml +my_dataset: + type: pandas.JSONDataSet + filepath: data/01_raw/my_dataset.json +``` + +Next, you need to reload Kedro variables by calling `%reload_kedro` line magic in your Jupyter notebook. + +Finally, you can save the data by executing the following command: + +```python +my_dict = {"key1": "some_value", "key2": None} +catalog.save("my_dataset", my_dict) +``` + +### Additional parameters for `session.run()` +You can also specify the following optional arguments for `session.run()`: + +```eval_rst ++---------------+----------------+-------------------------------------------------------------------------------+ +| Argument name | Accepted types | Description | ++===============+================+===============================================================================+ +| tags | Iterable[str] | Construct the pipeline using only nodes which have this tag attached. | +| | | A node is included in the resulting pipeline if it contains any of those tags | ++---------------+----------------+-------------------------------------------------------------------------------+ +| runner | AbstractRunner | An instance of Kedro [AbstractRunner](/kedro.runner.AbstractRunner); | +| | | can be an instance of a [ParallelRunner](/kedro.runner.ParallelRunner) | ++---------------+----------------+-------------------------------------------------------------------------------+ +| node_names | Iterable[str] | Run only nodes with specified names | ++---------------+----------------+-------------------------------------------------------------------------------+ +| from_nodes | Iterable[str] | A list of node names which should be used as a starting point | ++---------------+----------------+-------------------------------------------------------------------------------+ +| to_nodes | Iterable[str] | A list of node names which should be used as an end point | ++---------------+----------------+-------------------------------------------------------------------------------+ +| from_inputs | Iterable[str] | A list of dataset names which should be used as a starting point | ++---------------+----------------+-------------------------------------------------------------------------------+ +| load_versions | Dict[str, str] | A mapping of a dataset name to a specific dataset version (timestamp) | +| | | for loading - this applies to the versioned datasets only | ++---------------+----------------+-------------------------------------------------------------------------------+ +| pipeline_name | str | Name of the modular pipeline to run - must be one of those returned | +| | | by register_pipelines function from src//hooks.py | ++---------------+----------------+-------------------------------------------------------------------------------+ +``` + +This list of options is fully compatible with the list of CLI options for the `kedro run` command. In fact, `kedro run` is calling `context.run()` behind the scenes. + + +## Global variables + +Add customised global variables to `.ipython/profile_default/startup/00-kedro-init.py`. For example, if you want to add a global variable for `parameters` from `parameters.yml`, update `reload_kedro()` as follows: + +```python +@register_line_magic +def reload_kedro(project_path, line=None): + """"Line magic which reloads all Kedro default variables.""" + # ... + global parameters + try: + # ... + session = KedroSession.create("", project_path) + _activate_session(session) + context = session.load_context() + parameters = context.params + # ... + logging.info("Defined global variable `context`, `session`, `catalog` and `parameters`") + except: + pass +``` + + +## Convert functions from Jupyter Notebooks into Kedro nodes + +Built into the Kedro Jupyter workflow is the ability to convert multiple functions defined in the Jupyter notebook(s) into Kedro nodes. You need a single CLI command. + +Here is how it works: + +* Start a Jupyter notebook session: `kedro jupyter notebook` +* Create a new notebook and paste the following code into the first cell: + +```python +def some_action(): + print("This function came from `notebooks/my_notebook.ipynb`") +``` + +* Enable tags toolbar: `View` menu -> `Cell Toolbar` -> `Tags` +![Enable the tags toolbar graphic](./images/jupyter_notebook_workflow_activating_tags.png) + +* Add the `node` tag to the cell containing your function +![Add the node tag graphic](./images/jupyter_notebook_workflow_tagging_nodes.png) + +> Tip: The notebook can contain multiple functions tagged as `node`, each of them will be exported into the resulting Python file + +* Save your Jupyter notebook to `notebooks/my_notebook.ipynb` +* Run `kedro jupyter convert notebooks/my_notebook.ipynb` from the terminal to create a Python file `src//nodes/my_notebook.py` containing `some_action` function definition + +> Tip: You can also convert all your notebooks at once by calling `kedro jupyter convert --all` + +* The `some_action` function can now be used in your Kedro pipelines + +## IPython loader + +The script `tools/ipython/ipython_loader.py` helps to locate IPython startup directory and run all Python scripts in it when working with Jupyter notebooks and IPython sessions. It should work identically not just within a Kedro project, but also with any project that contains IPython startup scripts. + +The script automatically locates the `.ipython/profile_default/startup` directory by starting from the current working directory and going up the directory tree. If the directory is found, all Python scripts in it are executed. + +> *Note:* This script will only run startup scripts from the first encountered `.ipython/profile_default/startup` directory. All consecutive `.ipython` directories higher up in the directory tree will be disregarded. + +### Installation + +To install this script simply download it into your default IPython config directory: + +```bash +mkdir -p ~/.ipython/profile_default/startup +wget -O ~/.ipython/profile_default/startup/ipython_loader.py https://raw.githubusercontent.com/quantumblacklabs/kedro/master/tools/ipython/ipython_loader.py +``` + +### Prerequisites + +For this script to work, the following conditions must be met: + +* Your project must contain the `.ipython/profile_default/startup` folder in its root directory. +* The Jupyter notebook should be saved inside the project root directory or within any nested subfolder of the project directory. +* An IPython interactive session should be started with the working directory pointing to the project root directory or any nested subdirectory. + +For example, given the following project structure: + +```console +new-kedro-project/ +β”œβ”€β”€ .ipython +β”‚Β Β  └── profile_default +β”‚Β Β  └── startup +β”‚Β Β  └── 00-kedro-init.py +β”œβ”€β”€ conf/ +β”œβ”€β”€ data/ +β”œβ”€β”€ docs/ +β”œβ”€β”€ logs/ +β”œβ”€β”€ notebooks +β”‚Β Β  └── subdir1 +β”‚Β Β  └── subdir2 +└── src/ +``` + +If your `Notebook.ipynb` is placed anywhere in the following, `.ipython/profile_default/startup/00-kedro-init.py` will automatically be executed on every notebook startup: + +* `new-kedro-project/notebooks/` +* `new-kedro-project/notebooks/subdir1/` +* `new-kedro-project/notebooks/subdir1/subdir2/` +* or even `new-kedro-project/` (although this is strongly discouraged). + +> *Note:* Given the example structure above, this script *will not* load your IPython startup scripts if the notebook is saved anywhere *outside* `new-kedro-project` directory. + +### Troubleshooting and FAQs + +#### How can I stop my notebook terminating? + +If you close the notebook and its kernel is idle, it will be automatically terminated by the Jupyter server after 30 seconds of inactivity. However, if the notebook kernel is busy, it won't be automatically terminated by the server. + +You can change the timeout by passing `--idle-timeout=` option to `kedro jupyter notebook` or `kedro jupyter lab` call. If you set `--idle-timeout=0`, this will disable automatic termination of idle notebook kernels. + +#### Why can't I run `kedro jupyter notebook`? + +In certain cases, you may not be able to run `kedro jupyter notebook`, which means that you have to work in a standard Jupyter session. This may be because you don't have a CLI access to the machine where the Jupyter server is running or you've opened a Jupyter notebook by running `jupyter notebook` from the terminal. In that case, you can create a `context` variable yourself by running the following block of code at the top of your notebook: + +```python +from pathlib import Path +from kedro.framework.session import KedroSession +from kedro.framework.session.session import _activate_session + +current_dir = Path.cwd() # this points to 'notebooks/' folder +project_path = current_dir.parent # point back to the root of the project +session = KedroSession.create("", project_path) +_activate_session(session) +context = session.load_context() +``` + +#### How can I reload the `session`, `context`, `catalog` and `startup_error` variables? + +To reload these variables at any point (e.g., if you update `catalog.yml`), use the [line magic](https://ipython.readthedocs.io/en/stable/interactive/magics.html) `%reload_kedro`. This magic can also be used to see the error message if any of the variables above are undefined. + +![reload kedro line magic graphic](./images/jupyter_notebook_loading_context.png) + +If the `KEDRO_ENV` environment variable is specified, the startup script loads that environment, otherwise it defaults to `local`. Instructions for setting the environment variable can be found in the [Kedro configuration documentation](https://kedro.readthedocs.io/en/stable/04_kedro_project_setup/02_configuration.html#additional-configuration-environments). + + +_[Go to the next page](./07_pipelines.md)_ diff --git a/docs/07_pipelines.md b/training_docs/07_pipelines.md similarity index 64% rename from docs/07_pipelines.md rename to training_docs/07_pipelines.md index 1df5c49..82c269a 100644 --- a/docs/07_pipelines.md +++ b/training_docs/07_pipelines.md @@ -1,8 +1,16 @@ # Kedro pipelines -## Node basics +It is time to introduce the most basic elements of Kedro before we dive into the spaceflights pipelines. -A `Node` in Kedro represents a class that facilitates the operations required to run user-provided functions as part of Kedro pipelines. +## Introduction to nodes and pipelines + +A `node` is a Kedro concept. It is a wrapper for a Python function that names the inputs and outputs of that function. It is the building block of a pipeline. Nodes can be linked when the output of one node is the input of another. + +A pipeline organises the dependencies and execution order of a collection of nodes, and connects inputs and outputs while keeping your code modular. The pipeline determines the node execution order by resolving dependencies and does *not* necessarily run the nodes in the order in which they are passed in. + +A Runner is an object that runs the pipeline once Kedro resolves the order in which the nodes are executed. + +## Spaceflights nodes Let's create a file `src/kedro_training/pipelines/data_engineering/nodes.py` and add the following functions: @@ -23,6 +31,10 @@ def _parse_percentage(x): def _parse_money(x): return float(x.replace("$", "").replace(",", "")) +``` +You should also add these empty functions and follow the instructions to complete them: + +```python def preprocess_companies(companies: pd.DataFrame) -> pd.DataFrame: """Preprocess the data for companies. @@ -50,7 +62,7 @@ def preprocess_shuttles(shuttles: pd.DataFrame) -> pd.DataFrame: Preprocessed data. """ - + # This function should preprocess the 'shuttles' DataFrame by doing the following: # 1. Convert 'd_check_complete' and 'moon_clearance_complete' columns to boolean # by applying _is_true function inplace @@ -58,17 +70,18 @@ def preprocess_shuttles(shuttles: pd.DataFrame) -> pd.DataFrame: return shuttles - def create_master_table( shuttles: pd.DataFrame, companies: pd.DataFrame, reviews: pd.DataFrame ) -> pd.DataFrame: """Combines all data to create a master table. + Args: shuttles: Preprocessed data for shuttles. companies: Preprocessed data for companies. reviews: Source data for reviews. Returns: Master table. + """ # This function should prepare the master table by doing the following: @@ -86,7 +99,6 @@ def create_master_table( ```python import pandas as pd - def _is_true(x): return x == "t" @@ -101,6 +113,7 @@ def _parse_money(x): return float(x.replace("$", "").replace(",", "")) + def preprocess_companies(companies: pd.DataFrame) -> pd.DataFrame: """Preprocess the data for companies. @@ -142,12 +155,14 @@ def create_master_table( shuttles: pd.DataFrame, companies: pd.DataFrame, reviews: pd.DataFrame ) -> pd.DataFrame: """Combines all data to create a master table. + Args: shuttles: Preprocessed data for shuttles. companies: Preprocessed data for companies. reviews: Source data for reviews. Returns: Master table. + """ rated_shuttles = shuttles.merge(reviews, left_on="id", right_on="shuttle_id") @@ -159,22 +174,23 @@ def create_master_table( master_table = master_table.dropna() return master_table ``` + + ## Assemble nodes into a modular pipeline -### Creating the data engineering pipeline +### Create the data engineering pipeline You have utility functions and two processing functions, `preprocess_companies` and `preprocess_shuttles`, which take Pandas dataframes for `companies` and `shuttles` respectively and output preprocessed versions of those dataframes. -Next you should create the Data Engineering pipeline, which represents a collection of `Node` objects. To do so, add the following code to `src/kedro_training/pipelines/data_engineering/pipeline.py`: +Next you should create the data engineering pipeline, which represents a collection of `Node` objects. To do so, add the following code to `src/kedro_training/pipelines/data_engineering/pipeline.py` and follow the instructions to complete it: ```python from kedro.pipeline import Pipeline, node from .nodes import preprocess_companies, preprocess_shuttles, create_master_table - def create_pipeline(**kwargs) -> Pipeline: """Create the project's pipeline. @@ -186,7 +202,7 @@ def create_pipeline(**kwargs) -> Pipeline: """ - # Here you need to construct a Data Engineering ('de_pipeline') object, which + # Here you need to construct a data engineering ('de_pipeline') object, which # satisfies the following requirements: # 1. Is an instance of a Pipeline class # 2. Contains 3 pipeline nodes: @@ -209,8 +225,7 @@ from kedro.pipeline import Pipeline, node from .nodes import preprocess_companies, preprocess_shuttles, create_master_table - -def create_pipeline(**kwargs) -> Pipeline: +def create_pipeline(**kwargs): """Create the project's pipeline. Args: @@ -220,19 +235,20 @@ def create_pipeline(**kwargs) -> Pipeline: Pipeline object. """ - de_pipeline = Pipeline( + + return Pipeline( [ node( func=preprocess_companies, inputs="companies", outputs="preprocessed_companies", - name="preprocessing_companies" + name="preprocessing_companies", ), node( func=preprocess_shuttles, inputs="shuttles", outputs="preprocessed_shuttles", - name="preprocessing_shuttles" + name="preprocessing_shuttles", ), node( func=create_master_table, @@ -241,8 +257,6 @@ def create_pipeline(**kwargs) -> Pipeline: ), ] ) - - return de_pipeline ``` @@ -250,40 +264,12 @@ def create_pipeline(**kwargs) -> Pipeline: To turn it into a Python package, create an empty file `src/kedro_training/pipelines/data_engineering/__init__.py`. -Finally, we need to register the newly created modular pipeline in `src/kedro_training/pipeline.py`: - -```python -from typing import Dict - -from kedro.pipeline import Pipeline - -from kedro_training.pipelines import data_engineering as de - - -def create_pipelines(**kwargs) -> Dict[str, Pipeline]: - """Create the project's pipeline. - - Args: - kwargs: Ignore any additional arguments added in the future. - - Returns: - A mapping from a pipeline name to a ``Pipeline`` object. - - """ - data_engineering_pipeline = de.create_pipeline() - - return { - "de": data_engineering_pipeline, - "__default__": data_engineering_pipeline, - } -``` - -### Creating the data science pipeline +## Create the data science pipeline The data science pipeline is similar conceptually; it requires nodes and the pipeline definition. -Let's create `src/kedro_training/pipelines/data_science/nodes.py` and put the following code in it to create our nodes: +Let's create `src/kedro_training/pipelines/data_science/nodes.py`, put the following code in it to create the nodes, and follow the instructions to complete it: ```python import logging @@ -306,7 +292,7 @@ def split_data(data: pd.DataFrame, parameters: Dict) -> List: A list containing split data. """ - + # 1. Create X object that contains the following subset of the columns from 'data': # engines, passenger_capacity, crew, d_check_complete, moon_clearance_complete # 2. Take the values of 'price' column and put them into 'y' object @@ -314,7 +300,7 @@ def split_data(data: pd.DataFrame, parameters: Dict) -> List: # using 'train_test_split' function and 'test_size' and 'random_state' parameters return [X_train, X_test, y_train, y_test] - + def train_model(X_train: np.ndarray, y_train: np.ndarray) -> LinearRegression: """Train the linear regression model. @@ -331,7 +317,6 @@ def train_model(X_train: np.ndarray, y_train: np.ndarray) -> LinearRegression: regressor.fit(X_train, y_train) return regressor - def evaluate_model(regressor: LinearRegression, X_test: np.ndarray, y_test: np.ndarray): """Calculate the coefficient of determination and log the result. @@ -340,6 +325,8 @@ def evaluate_model(regressor: LinearRegression, X_test: np.ndarray, y_test: np.n X_test: Testing data of independent features. y_test: Testing data for price. + """ + """ # 1. Calculate predictions for 'X_test' using 'regressor' object # 2. Calculate R^2 score for the calculated predictions @@ -353,6 +340,7 @@ def evaluate_model(regressor: LinearRegression, X_test: np.ndarray, y_test: np.n CLICK TO SEE THE ANSWER ```python + import logging from typing import Dict, List @@ -373,7 +361,6 @@ def split_data(data: pd.DataFrame, parameters: Dict) -> List: A list containing split data. """ - X = data[ [ "engines", @@ -420,17 +407,17 @@ def evaluate_model(regressor: LinearRegression, X_test: np.ndarray, y_test: np.n score = r2_score(y_test, y_pred) logger = logging.getLogger(__name__) logger.info("Model has a coefficient R^2 of %.3f.", score) + ``` -Then we have to build the data science pipeline definition in `src/kedro_training/pipelines/data_science/pipeline.py`: +Then we have to build the data science pipeline definition in `src/kedro_training/pipelines/data_science/pipeline.py` with the following code. Follow the instructions to complete it: ```python from kedro.pipeline import Pipeline, node from .nodes import split_data, train_model, evaluate_model - def create_pipeline(**kwargs) -> Pipeline: """Create the project's pipeline. @@ -442,7 +429,7 @@ def create_pipeline(**kwargs) -> Pipeline: """ - # Here you need to construct a Data Science ('ds_pipeline') object, which satisfies + # Here you need to construct a data science ('ds_pipeline') object, which satisfies # the following requirements: # 1. Is an instance of a Pipeline class # 2. Contains 3 pipeline nodes: @@ -462,39 +449,31 @@ def create_pipeline(**kwargs) -> Pipeline: CLICK TO SEE THE ANSWER ```python -from kedro.pipeline import Pipeline, node - -from .nodes import split_data, train_model, evaluate_model - - -def create_pipeline(**kwargs) -> Pipeline: - """Create the project's pipeline. - - Args: - kwargs: Ignore any additional arguments added in the future. - Returns: - Pipeline object. +from kedro.pipeline import Pipeline, node - """ +from .nodes import evaluate_model, split_data, train_model - ds_pipeline = Pipeline( +def create_pipeline(**kwargs): + return Pipeline( [ node( - split_data, - ["master_table", "parameters"], - ["X_train", "X_test", "y_train", "y_test"], + func=split_data, + inputs=["master_table", "parameters"], + outputs=["X_train", "X_test", "y_train", "y_test"], ), - node(train_model, ["X_train", "y_train"], "regressor"), - node(evaluate_model, ["regressor", "X_test", "y_test"], None), - ], - name="ds", + node(func=train_model, inputs=["X_train", "y_train"], outputs="regressor"), + node( + func=evaluate_model, + inputs=["regressor", "X_test", "y_test"], + outputs=None, + ), + ] ) - - return ds_pipeline ``` + We also need to modify `conf/base/parameters.yml` by replacing its contents with the following: ```yaml @@ -504,72 +483,46 @@ random_state: 3 Don't forget to create an empty file `src/kedro_training/pipelines/data_science/__init__.py`. -Finally, let's add Data Science pipeline to `src/kedro_training/pipeline.py`: - -```python -from typing import Dict - -from kedro.pipeline import Pipeline - -from kedro_training.pipelines import data_engineering as de, data_science as ds +## Register the pipelines - -def create_pipelines(**kwargs) -> Dict[str, Pipeline]: - """Create the project's pipeline. - - Args: - kwargs: Ignore any additional arguments added in the future. - - Returns: - A mapping from a pipeline name to a ``Pipeline`` object. - - """ - - # Modify this function such that: - # 1. The Data Science pipeline object is created using 'create_pipeline' function - # from the above - # 2. The Data Science pipeline object is added to '__default__' pipeline - - return { - ... - } -``` - -
-CLICK TO SEE THE ANSWER +Finally, let's look at where to register both the data engineering and data science pipelines, in `src/kedro_training/hooks.py`: ```python -from typing import Dict +from typing import Any, Dict, Iterable, Optional +from kedro.config import ConfigLoader +from kedro.framework.hooks import hook_impl +from kedro.io import DataCatalog from kedro.pipeline import Pipeline +from kedro.versioning import Journal -from kedro_training.pipelines import data_engineering as de, data_science as ds - - -def create_pipelines(**kwargs) -> Dict[str, Pipeline]: - """Create the project's pipeline. +from kedro_training.pipelines import data_engineering as de +from kedro_training.pipelines import data_science as ds - Args: - kwargs: Ignore any additional arguments added in the future. - Returns: - A mapping from a pipeline name to a ``Pipeline`` object. +class ProjectHooks: + @hook_impl + def register_pipelines(self) -> Dict[str, Pipeline]: + """Register the project's pipeline. - """ + Returns: + A mapping from a pipeline name to a ``Pipeline`` object. - data_engineering_pipeline = de.create_pipeline() - data_science_pipeline = ds.create_pipeline() + """ + data_engineering_pipeline = de.create_pipeline() + data_science_pipeline = ds.create_pipeline() - return { - "de": data_engineering_pipeline, - "__default__": data_engineering_pipeline + data_science_pipeline, - } + return { + "__default__": data_engineering_pipeline + data_science_pipeline, + "de": data_engineering_pipeline, + "ds": data_science_pipeline, + } ``` -
+ ## Run a modular pipeline -Run your Data Engineering pipeline from the terminal: +To run your data engineering pipeline from the terminal: ```bash kedro run --pipeline de @@ -585,41 +538,13 @@ context.run(pipeline_name="de") ## Modular pipeline structure -Modular pipelines are intended to be reusable across various projects. Therefore, it is crucial that you, as a pipeline developer, document how it should be used. We would suggest to follow this structure: - -```console -src/kedro_training/pipelines/ -β”œβ”€β”€ __init__.py -β”œβ”€β”€ nodes.py -β”œβ”€β”€ pipeline.py -└── README.md -``` - -where -* `__init__.py` - indicates that modular pipeline is a Python package -* `nodes.py` - contains all the node definitions -* `pipeline.py` - contains `create_pipeline()` function similar to the above example -* `README.md` - main documentation source for the end users with all the information regarding the execution of the pipeline - -### Unsupported components - -Kedro _does not_ automatically handle the following components of modular pipelines: -* external package dependencies defined in, say, `src/kedro_training/pipelines//requirements.txt`, those are _not_ currently installed by `kedro install` command -* YAML configuration files - for example, `src/kedro_training/pipelines//conf/base/catalog.yml`, these config files are _not_ discoverage by Kedro `ConfigLoader` by default - -If your modular pipeline requires installation of some third-party Python packages (e.g., `pandas`, `numpy`, `pyspark`, etc.), you need to explicitly document this in `README.md` and, ideally, provide the relevant installation command, for example: - -```bash -pip install -r src/kedro_training/pipelines//requirements.txt -``` - -> Note: Modular pipelines should not depend on the main Python package (`kedro_training` in the example above) as it would break the portability to another project. +Modular pipelines are intended to be reusable across various projects. As a pipeline developer, you should follow convention and document how your pipeline should be used. Consult the extensive [Kedro documentation about modular pipelines](https://kedro.readthedocs.io/en/stable/06_nodes_and_pipelines/03_modular_pipelines.html) for further information ## Persisting the intermediate data It is important to emphasise that the Kedro pipeline is runnable only if _all_ free inputs, i.e. the datasets that are not produced by any of the nodes, are defined in `catalog.yml`. In the Spaceflights project those free inputs are: `companies`, `shuttles`, `reviews`. -All intermediary datasets, however, can be missing from the `catalog.yml`, and your pipeline will still run without errors. This is because Kedro automatically creates a `MemoryDataSet` for each intermediary dataset that is not defined in the `DataCatalog`. Intermediary datasets in the Spaceflights project are: `preprocessed_companies`, `preprocessed_shuttles`, `master_table`, `X_train`, `X_test`, `y_train`, `y_test`, `regressor`. +All intermediate datasets, however, can be missing from the `catalog.yml`, and your pipeline will still run without errors. This is because Kedro automatically creates a `MemoryDataSet` for each intermediary dataset that is not defined in the `DataCatalog`. Intermediary datasets in the Spaceflights project are: `preprocessed_companies`, `preprocessed_shuttles`, `master_table`, `X_train`, `X_test`, `y_train`, `y_test`, `regressor`. These `MemoryDataSet`s pass data across nodes during the run, but are automatically deleted after the run finishes, therefore if you want to have an access to those intermediary datasets after the run, you need to define them in `catalog.yml`. @@ -636,28 +561,14 @@ As you can see, dataset configuration contains `versioned: true` flag, which ena ## How to filter Kedro pipelines -Kedro has a flexible mechanism to filter the pipeline that you intend to run. Here is a list of CLI options supported out of the box: - -| CLI command | Description | Multiple options allowed? | -| ----------------------------------------------------- | ------------------------------------------------------------------------------- | ------------------------- | -| `kedro run --pipeline de` | Run the whole pipeline by its name | No | -| `kedro run --node debug_me --node debug_me_too` | Run only nodes with specified names | Yes | -| `kedro run --from-nodes node1,node2` | A list of node names which should be used as a starting point | No | -| `kedro run --to-nodes node3,node4` | A list of node names which should be used as an end point | No | -| `kedro run --from-inputs dataset1,dataset2` | A list of dataset names which should be used as a starting point | No | -| `kedro run --tag some_tag1 --tag some_tag2` | Run only nodes which have any of these tags attached | Yes | -| `kedro run --params param_key1:value1,param_key2:2.0` | Does a parametrised kedro run with `{"param_key1": "value1", "param_key2": 2}` | Yes | -| `kedro run --env env_name` | Run the pipeline in the env_name environment. Defaults to local if not provided | No | -| `kedro run --config config.yml` | Specify all command line options in a configuration file called `config.yml` | No | - -You can also combine these options together, so the command `kedro run --from-nodes split --to-nodes predict,report` will run all the nodes from `split` to `predict` and `report`. +Kedro has a flexible mechanism to filter the pipeline that you intend to run, which are listed in the [Kedro CLI documentation](https://kedro.readthedocs.io/en/stable/09_development/03_commands_reference.html?highlight=filter#modifying-a-kedro-run). ## Choosing a sequential or parallel runner -Having specified the data catalog and the pipeline, you are now ready to run the pipeline. There are two different runners you can specify: +Having specified the data catalog and the pipeline, you are now ready to run the pipeline. There are two different [Kedro runners](https://kedro.readthedocs.io/en/stable/06_nodes_and_pipelines/04_run_a_pipeline.html) you can specify: * `SequentialRunner` - runs your nodes sequentially; once a node has completed its task then the next one starts. -* `ParallelRunner` - runs your nodes in parallel; independent nodes can run at the same time, allowing you to take advantage of multiple CPU cores. +* `ParallelRunner` - runs your nodes in parallel; independent nodes can run at the same time, allowing you to take advantage of multiple CPU cores or multiple threads. By default, Kedro uses a `SequentialRunner`, which is instantiated when you execute `kedro run` from the command line. Switching to use `ParallelRunner` is as simple as providing an additional flag when running the pipeline from the command line as follows: @@ -667,33 +578,17 @@ kedro run --parallel `ParallelRunner` executes the pipeline nodes in parallel, and is more efficient when there are independent branches in your pipeline. -> *Note:* `ParallelRunner` performs task parallelisation, which is different from data parallelisation as seen in PySpark. +`ParallelRunner` performs task parallelisation, which is different from data parallelisation as seen in PySpark. You can also run the pipeline with multithreading for concurrent execution by specifying `ThreadRunner` as follows: -## Visualising a pipeline - -Kedro-Viz shows you how your Kedro data pipelines are structured. With Kedro-Viz you can: - - See how your datasets and Python functions (nodes) are resolved in Kedro so that you can understand how your data pipeline is built - - Get a clear picture when you have lots of datasets and nodes by using tags to visualise sub-pipelines - - Search for nodes and datasets - - You should already have `kedro-viz` installed according to these instructions [**here**](https://kedro.readthedocs.io/en/stable/03_tutorial/06_visualise_pipeline.html). - -### Using `kedro-viz` - -From your terminal, run: - -``` -kedro viz +```bash +kedro run --runner=ThreadRunner ``` -This command will run a server on http://127.0.0.1:4141 and will open up your visualisation on a browser. -> *Note:* If port `4141` is already occupied, you can run Kedro-Viz server on a different port by executing `kedro viz --port `. +> *Note:* `SparkDataSet` doesn't work correctly with `ParallelRunner`. To add concurrency to the pipeline with `SparkDataSet`, you must use `ThreadRunner`. +> +> For more information on how to maximise concurrency when using Kedro with PySpark, please visit our guide on [how to build a Kedro pipeline with PySpark](https://kedro.readthedocs.io/en/stable/11_tools_integration/01_pyspark.html). -### Examples of `kedro-viz` - - You can have a look at a retail ML use case [**here**](https://quantumblacklabs.github.io/kedro-viz/) - - And an example of this Spaceflights pipeline [**here**](https://medium.com/@QuantumBlack/demystifying-machine-learning-complexity-through-visualisation-11a9d73db3c5) -### Next section -[Go to the next section](./08_transformers.md) +_[Go to the next page](./08_visualisation.md)_ diff --git a/training_docs/08_visualisation.md b/training_docs/08_visualisation.md new file mode 100644 index 0000000..ca4503b --- /dev/null +++ b/training_docs/08_visualisation.md @@ -0,0 +1,28 @@ + +## Visualise your pipeline + +[Kedro Viz](https://github.com/quantumblacklabs/kedro-viz) shows you how your Kedro data pipelines are structured. You can use it to: + + - See how your datasets and Python functions (nodes) are resolved in Kedro so that you can understand how your data pipeline is built + - Get a clear picture when you have lots of datasets and nodes by using tags to visualise sub-pipelines + - Search for nodes and datasets + +Follow the [Kedro-Viz installation instructions](https://kedro.readthedocs.io/en/stable/03_tutorial/06_visualise_pipeline.html) to get set up. +### Using `kedro-viz` + +From your terminal, run: + +``` +kedro viz +``` + +This command will run a server on http://127.0.0.1:4141 and will open up your visualisation on a browser. + +> *Note:* If port `4141` is already occupied, you can run Kedro-Viz server on a different port by executing `kedro viz --port `. + +### Examples of `kedro-viz` + + - You can have a look at a retail ML use case [**here**](https://quantumblacklabs.github.io/kedro-viz/) + - And an example of this Spaceflights pipeline [**here**](https://medium.com/@QuantumBlack/demystifying-machine-learning-complexity-through-visualisation-11a9d73db3c5) + +_[Go to the next page](./09_versioning.md)_ diff --git a/training_docs/09_versioning.md b/training_docs/09_versioning.md new file mode 100644 index 0000000..01437d6 --- /dev/null +++ b/training_docs/09_versioning.md @@ -0,0 +1,40 @@ + +# Versioning + +## Data versioning +Making a simple addition to your Data Catalog allows you to perform versioning of datasets and machine learning models. + +Suppose you want to version `master_table`. To enable versioning, simply add a `versioned` entry in `catalog.yml` as follows: + +```yaml +master_table: + type: pandas.CSVDataSet + filepath: data/03_primary/master_table.csv + versioned: true +``` + +The `DataCatalog` will create a versioned `CSVDataSet` called `master_table`. The actual csv file location will look like `data/03_primary/master_table.csv//master_table.csv`, where the first `/master_table.csv/` is a directory and `` corresponds to a global save version string formatted as `YYYY-MM-DDThh.mm.ss.sssZ`. + +With the similar way, you can version your machine learning model. Enable versioning for `regressor` as follow: + +```yaml +regressor: + type: pickle.PickleDataSet + filepath: data/06_models/regressor.pickle + versioned: true +``` + +This will save versioned pickle models everytime you run the pipeline. + +> *Note:* The list of the datasets supporting versioning can be find in [the documentation](https://kedro.readthedocs.io/en/stable/05_data/02_kedro_io.html#supported-datasets). + +## Loading a versioned dataset +By default, the `DataCatalog` will load the latest version of the dataset. However, you can run the pipeline with a particular versioned data set with `--load-version` flag as follows: + +```bash +kedro run --load-version="master_table:YYYY-MM-DDThh.mm.ss.sssZ" +``` +where `--load-version` contains a dataset name and a version timestamp separated by `:`. + + +_[Go to the next page](./10_package_project.md)_ diff --git a/training_docs/10_package_project.md b/training_docs/10_package_project.md new file mode 100644 index 0000000..a473342 --- /dev/null +++ b/training_docs/10_package_project.md @@ -0,0 +1,53 @@ +# Distribute a project + +In this part of the training, you will learn how to distribute your data project. + +# Package a project + +This section explains how to build your project documentation, and how to bundle your project into a Python package. + +## Add documentation to your project + +You can generate project-specific documentation by running `kedro build-docs` in the project's root directory. Kedro builds the resulting HTML files in `docs/build/html/`. + +The `build-docs` command creates documentation based on the code structure of your project. Documentation includes any [`docstrings`](https://www.datacamp.com/community/tutorials/docstrings-python) defined in your code. + +Kedro uses the [Sphinx](https://www.sphinx-doc.org) framework, so if you want to customise your documentation, please refer to `docs/source/conf.py` and the [corresponding section of the Sphinx documentation](https://www.sphinx-doc.org/en/master/usage/configuration.html). + + +## Package your project + +To package your project, run the following in your project's root directory: + +```bash +kedro package +``` + +Kedro builds the package into the `src/dist/` folder of your project, and creates one `.egg` file and one `.whl` file, which are [Python packaging formats for binary distribution](https://packaging.python.org/overview/). + +The resulting package only contains the Python source code of your Kedro pipeline, not any of the `conf/`, `data/` and `logs/` subfolders. This means that you can distribute the project to run elsewhere, such as on a separate computer with different configuration, data and logging. When distributed, the packaged project must be run from within a directory that contains the `conf/` subfolder (and `data/` and `logs/` if your pipeline loads/saves local data or uses logging). + +Recipients of the `.egg` and `.whl` files need to have Python and `pip` on their machines, but do not need to have Kedro installed. The project is installed to the root of a folder with the relevant `conf/`, `data/` and `logs/` subfolders, by navigating to the root and calling: + +```bash +pip install +``` + +For example, having installed project `kedro-spaceflights` and package `kedro_spaceflights`, a recipient can run the Kedro project as follows from the root of the project: + +```bash +python -m kedro_spaceflights.run +``` + +An executable, `kedro-spaceflights`, is also placed in the `bin/` subfolder of the Python installation location. + + +### Docker and Airflow + +We support the [Kedro-Docker](https://github.com/quantumblacklabs/kedro-docker) plugin for packaging and shipping Kedro projects within [Docker](https://www.docker.com/) containers. + +We also support [Kedro-Airflow](https://github.com/quantumblacklabs/kedro-airflow) to convert your Kedro project into an [Airflow](https://airflow.apache.org/) project. + + +### Next section +_[Go to the next section](./11_configuration.md)_ diff --git a/training_docs/11_configuration.md b/training_docs/11_configuration.md new file mode 100644 index 0000000..b04de7d --- /dev/null +++ b/training_docs/11_configuration.md @@ -0,0 +1,131 @@ +# Configuration + +We recommend that you keep all configuration files in the `conf` directory of a Kedro project. However, if you prefer, you may point Kedro to any other directory and change the configuration paths by setting the `CONF_ROOT` variable in `src/kedro_training/settings.py` as follows: + +```python +# ... +CONF_ROOT = "new_conf" +``` + +## Loading configuration +Kedro-specific configuration (e.g., `DataCatalog` configuration for IO) is loaded using the `ConfigLoader` class: + +```python +from kedro.config import ConfigLoader + +conf_paths = ["conf/base", "conf/local"] +conf_loader = ConfigLoader(conf_paths) +conf_catalog = conf_loader.get("catalog*", "catalog*/**") +``` + +This will recursively scan for configuration files firstly in `conf/base/` and then in `conf/local/` directory according to the following rules: + +* ANY of the following is true: + * filename starts with `catalog` OR + * file is located in a sub-directory whose name is prefixed with `catalog` +* AND file extension is one of the following: `yaml`, `yml`, `json`, `ini`, `pickle`, `xml` or `properties` + +Configuration information from files stored in `base` or `local` that match these rules is merged at runtime and returned in the form of a config dictionary: + +* If any 2 configuration files located inside the same environment path (`conf/base/` or `conf/local/` in this example) contain the same top-level key, `load_config` will raise a `ValueError` indicating that the duplicates are not allowed. + +> *Note:* Any top-level keys that start with `_` character are considered hidden (or reserved) and therefore are ignored right after the config load. Those keys will neither trigger a key duplication error mentioned above, nor will they appear in the resulting configuration dictionary. However, you may still use such keys for various purposes. For example, as [YAML anchors and aliases](https://support.atlassian.com/bitbucket-cloud/docs/yaml-anchors/). + +* If 2 configuration files have duplicate top-level keys, but are placed into different environment paths (one in `conf/base/`, another in `conf/local/`, for example) then the last loaded path (`conf/local/` in this case) takes precedence and overrides that key value. `ConfigLoader.get(, ...)` will not raise any errors, however a `DEBUG` level log message will be emitted with the information on the over-ridden keys. +* If the same environment path is passed multiple times, a `UserWarning` will be emitted to draw attention to the duplicate loading attempt, and any subsequent loading after the first one will be skipped. + + +## Additional config environments + +In addition to the 2 built-in configuration environments, it is possible to create your own. Your project loads `conf/base/` as the bottom-level configuration environment but allows you to overwrite it with any other environments that you create. You are be able to create environments like `conf/server/`, `conf/test/`, etc. Any additional configuration environments can be created inside `conf` folder and loaded by running the following command: + +```bash +kedro run --env=test +``` + +If no `env` option is specified, this will default to using `local` environment to overwrite `conf/base`. + +> *Note*: If, for some reason, your project does not have any other environments apart from `base`, i.e. no `local` environment to default to, you will need to customise `KedroContext` to take `env="base"` in the constructor and then specify your custom `KedroContext` subclass in `src//settings.py` under `CONTEXT_CLASS` key. + +If you set the `KEDRO_ENV` environment variable to the name of your environment, Kedro will load that environment for your `kedro run`, `kedro ipython`, `kedro jupyter notebook` and `kedro jupyter lab` sessions. + +```bash +export KEDRO_ENV=test +``` + +> *Note*: If you specify both the `KEDRO_ENV` environment variable and provide the `--env` argument to a CLI command, the CLI argument takes precedence. + +## Templating configuration +Kedro also provides an extension (`TemplatedConfigLoader`) class that allows to template values in your configuration files. `TemplatedConfigLoader` is available in `kedro.config`, to apply templating to your project, you will need to update the `register_config_loader` hook implementation in your `src//hooks.py`: + +```python +from kedro.config import TemplatedConfigLoader # new import + + +class ProjectHooks: + @hook_impl + def register_config_loader(self, conf_paths: Iterable[str]) -> ConfigLoader: + return TemplatedConfigLoader( + conf_paths, + globals_pattern="*globals.yml", # read the globals dictionary from project config + globals_dict={ # extra keys to add to the globals dictionary, take precedence over globals_pattern + "bucket_name": "another_bucket_name", + "non_string_key": 10, + }, + ) +``` + +Let's assume the project contains a `conf/base/globals.yml` file with the following contents: + +```yaml +bucket_name: "my_s3_bucket" +key_prefix: "my/key/prefix/" + +datasets: + csv: "pandas.CSVDataSet" + spark: "spark.SparkDataSet" + +folders: + raw: "01_raw" + int: "02_intermediate" + pri: "03_primary" + fea: "04_feature" +``` + +The contents of the dictionary resulting from `globals_pattern` get merged with the `globals_dict` dictionary. In case of conflicts, the keys from the `globals_dict` dictionary take precedence. The resulting global dictionary prepared by `TemplatedConfigLoader` will look like this: + +```python +{ + "bucket_name": "another_bucket_name", + "non_string_key": 10, + "key_prefix": "my/key/prefix", + "datasets": { + "csv": "pandas.CSVDataSet", + "spark": "spark.SparkDataSet" + }, + "folders": { + "raw": "01_raw", + "int": "02_intermediate", + "pri": "03_primary", + "fea": "04_feature", + }, +} +``` + +Now the templating can be applied to the configs. Here is an example of a templated `conf/base/catalog.yml`: + +```yaml +raw_boat_data: + type: "${datasets.spark}" # nested paths into global dict are allowed + filepath: "s3a://${bucket_name}/${key_prefix}/${folders.raw}/boats.csv" + file_format: parquet + +raw_car_data: + type: "${datasets.csv}" + filepath: "s3://${bucket_name}/data/${key_prefix}/${folders.raw}/${filename|cars.csv}" # default to 'cars.csv' if the 'filename' key is not found in the global dict +``` + +> Note: `TemplatedConfigLoader` uses `jmespath` package in the background to extract elements from global dictionary. For more information about JMESPath syntax please see: https://github.com/jmespath/jmespath.py. + + +_[Go to the next page](./12_transcoding.md)_ diff --git a/training_docs/12_transcoding.md b/training_docs/12_transcoding.md new file mode 100644 index 0000000..7d71755 --- /dev/null +++ b/training_docs/12_transcoding.md @@ -0,0 +1,41 @@ +# Transcoding + +You may come across a situation where you would like to read the same file using two different dataset implementations. Use transcoding when you want to load and save the same file, via its specified `filepath`, using different `DataSet` implementations. + +## A typical example of transcoding + +For instance, parquet files can not only be loaded via the `ParquetDataSet` using `pandas`, but also directly by `SparkDataSet`. This conversion is typical when coordinating a `Spark` to `pandas` workflow. + +To enable transcoding, define two `DataCatalog` entries for the same dataset in a common format (Parquet, JSON, CSV, etc.) in your `conf/base/catalog.yml`: + +```yaml +my_dataframe@spark: + type: spark.SparkDataSet + filepath: data/02_intermediate/data.parquet + file_format: parquet + +my_dataframe@pandas: + type: pandas.ParquetDataSet + filepath: data/02_intermediate/data.parquet +``` + +These entries are used in the pipeline like this: + +```python +Pipeline( + [ + node(func=my_func1, inputs="spark_input", outputs="my_dataframe@spark"), + node(func=my_func2, inputs="my_dataframe@pandas", outputs="pipeline_output"), + ] +) +``` + +## How does transcoding work? + +In this example, Kedro understands that `my_dataframe` is the same dataset in its `spark.SparkDataSet` and `pandas.ParquetDataSet` formats and helps resolve the node execution order. + +In the pipeline, Kedro uses the `spark.SparkDataSet` implementation for saving and `pandas.ParquetDataSet` +for loading, so the first node should output a `pyspark.sql.DataFrame`, while the second node would receive a `pandas.Dataframe`. + + +_[Go to the next page](./13_custom_datasets.md)_ diff --git a/training_docs/13_custom_datasets.md b/training_docs/13_custom_datasets.md new file mode 100644 index 0000000..33f210a --- /dev/null +++ b/training_docs/13_custom_datasets.md @@ -0,0 +1,31 @@ +# Custom datasets + +Kedro supports a number of datasets out of the box, but you can also add support for any proprietary data format or filesystem in your pipeline. + +You can find further information about [how to add support for custom datasets](https://kedro.readthedocs.io/en/stable/07_extend_kedro/03_custom_datasets.html) in specific documentation covering advanced usage. + +## Contributing a custom dataset implementation + +One of the easiest ways to contribute back to Kedro is to share a custom dataset. Kedro has a `kedro.extras.datasets` sub-package where you can add a new custom dataset implementation to share it with others. You can find out more in the [Kedro contribution guide](https://github.com/quantumblacklabs/kedro/blob/master/CONTRIBUTING.md) on Github. + +To contribute your custom dataset: + +1. Add your dataset package to `kedro/extras/datasets/`. + +For example, in our `ImageDataSet` example, the directory structure should be: + +``` +kedro/extras/datasets/image +β”œβ”€β”€ __init__.py +└── image_dataset.py +``` + +2. If the dataset is complex, create a `README.md` file to explain how it works and document its API. + +3. The dataset should be accompanied by full test coverage in `tests/extras/datasets`. + +4. Make a pull request against the `master` branch of [Kedro's Github repository](https://github.com/quantumblacklabs/kedro). + + +_[Go to the next page](./14_custom_cli_commands.md)_ + diff --git a/docs/14_custom-cli-commands.md b/training_docs/14_custom_cli_commands.md similarity index 92% rename from docs/14_custom-cli-commands.md rename to training_docs/14_custom_cli_commands.md index 336964f..b47da3d 100644 --- a/docs/14_custom-cli-commands.md +++ b/training_docs/14_custom_cli_commands.md @@ -26,7 +26,7 @@ cli.add_command(custom) Once you have made the modification, if you run `kedro -h` in your terminal, you see the new `custom` group. -![](../img/custom_command.png) +![](./images/custom_command.png) Run the custom command: @@ -40,5 +40,5 @@ $ kedro custom to-json The only possible way to extend an existing command is to create a new custom command as described in the previous section. -### Next section -[Go to the next section](./15_plugins.md) +_[Go to the next page](./15_plugins.md)_ + diff --git a/docs/15_plugins.md b/training_docs/15_plugins.md similarity index 97% rename from docs/15_plugins.md rename to training_docs/15_plugins.md index b51e982..562f1e8 100644 --- a/docs/15_plugins.md +++ b/training_docs/15_plugins.md @@ -141,7 +141,7 @@ The following conditions must be true for Airflow to run your pipeline: ## Process 1. Run `kedro airflow create` to generate a DAG file for your project. -2. If needed, customize the DAG file as described [below](https://github.com/quantumblacklabs/kedro-airflow/blob/master/README.md#customization). +2. If needed, customize the DAG file. 3. Run `kedro airflow deploy` which will copy the DAG file from the `airflow_dags` folder in your Kedro project into the `dags` folder in the Airflow home directory. > *Note:* The generated DAG file will be placed in `$AIRFLOW_HOME/dags/` when `kedro airflow deploy` is run, where `AIRFLOW_HOME` is an environment variable. If the environment variable is not defined, Kedro-Airflow will create `~/airflow` and `~/airflow/dags` (if required) and copy the DAG file into it. @@ -150,6 +150,6 @@ If you need more customisation for Airflow, you can find more information in the Once `dags` folder is created, you can perform airflow commands. For example, you could run `airflow initdb` to initialise the Airflow SQLite database `airflow.db` under `$AIRFLOW_HOME/dags/`, or `airflow webserver` to start Flask server for Airflow UI as follows: -![Airflow UI](../img/airflow_ui.png) +![Airflow UI](./images/airflow_ui.png) You can find more details about Airflow command in their [documentation](https://airflow.apache.org/howto/index.html). diff --git a/training_docs/images/airflow_ui.png b/training_docs/images/airflow_ui.png new file mode 100644 index 0000000..234aaf9 Binary files /dev/null and b/training_docs/images/airflow_ui.png differ diff --git a/training_docs/images/custom_command.png b/training_docs/images/custom_command.png new file mode 100644 index 0000000..bc16ef9 Binary files /dev/null and b/training_docs/images/custom_command.png differ diff --git a/training_docs/images/jupyter_create_new_notebook.png b/training_docs/images/jupyter_create_new_notebook.png new file mode 100644 index 0000000..476d301 Binary files /dev/null and b/training_docs/images/jupyter_create_new_notebook.png differ diff --git a/training_docs/images/jupyter_notebook_loading_context.png b/training_docs/images/jupyter_notebook_loading_context.png new file mode 100644 index 0000000..b2d95f3 Binary files /dev/null and b/training_docs/images/jupyter_notebook_loading_context.png differ diff --git a/training_docs/images/jupyter_notebook_showing_context.png b/training_docs/images/jupyter_notebook_showing_context.png new file mode 100644 index 0000000..b8707ea Binary files /dev/null and b/training_docs/images/jupyter_notebook_showing_context.png differ diff --git a/training_docs/images/jupyter_notebook_workflow_activating_tags.png b/training_docs/images/jupyter_notebook_workflow_activating_tags.png new file mode 100644 index 0000000..d104567 Binary files /dev/null and b/training_docs/images/jupyter_notebook_workflow_activating_tags.png differ diff --git a/training_docs/images/jupyter_notebook_workflow_loading_data.png b/training_docs/images/jupyter_notebook_workflow_loading_data.png new file mode 100644 index 0000000..944fddc Binary files /dev/null and b/training_docs/images/jupyter_notebook_workflow_loading_data.png differ diff --git a/training_docs/images/jupyter_notebook_workflow_tagging_nodes.png b/training_docs/images/jupyter_notebook_workflow_tagging_nodes.png new file mode 100644 index 0000000..cc5a691 Binary files /dev/null and b/training_docs/images/jupyter_notebook_workflow_tagging_nodes.png differ