The main branch of this project contains the content of a finished workshop. If we want to start workshop from scratch firstly we need to switch to start_from_scratch
git branch.
git switch start_from_scratch
We will be using pipenv virtual environment. Let's install it first
pip install pipenv
Then we want to install all of required dependencies.
pipenv install
This command will install kedro
,'kedro-viz` and other required packages.
Finally, we can enter the virtual env shell.
pipenv shell
During the workshop we will work on an example project provided by Kedro. We can initialize it by executinh following command.
kedro new --starter=spaceflights
Let's start from vizualizing our pipeline in order to investigate what we actually do. The following command opens us a kedro viz instance and redirects us into the browser.
kedro viz
As a next step let's run ou full model training pipeline and create artifacts
kedro run
Kedro run
command is highly customizable and we can choose what to run.
For example the --pipeline
argument runs only a selected pipeline.
kedro run --pipeline data_processing
--to-outputs
runs the whole path of operations required to produce given output.
kedro run --to-outputs="evaluation_plot"
--from-inputs
runs the whole path starting from given dataset node
kedro run --from-inputs=model_input_table
As well, we can modify parameters configured in the parameters.yml file
kedro run --params model_options.test_size=0.1
- Add a new parameter to linear regression model in data science pipeline (ex.
n_jobs
parameter ofLinearRegressor
) Check how kedro viz diagram has changed and that you can specify it via command line. (10 min.)
Starting experiment tracking in Kedro requires modifying the project. Firstly, we need to setup the store for our experiment.
- Paste this snippet into
settings.py
from kedro_viz.integrations.kedro.sqlite_store import SQLiteStore
from pathlib import Path
SESSION_STORE_CLASS = SQLiteStore
SESSION_STORE_ARGS = {"path": str(Path(__file__).parents[2] / "data")}
- Create directory for tracking artifacts
mkdir -p data/09_tracking
- Add metric artifacts into the catalog
metrics:
type: tracking.MetricsDataSet
filepath: data/09_tracking/metrics.json
companies_columns:
type: tracking.JSONDataSet
filepath: data/09_tracking/companies_columns.json
- Modify the pipeline and nodes. See https://docs.kedro.org/en/stable/experiment_tracking/index.html#modify-your-nodes-and-pipelines-to-log-metrics for more.
- Register more parameters into tracking
- Save all the parameters defined in catalog into model tracking.
Kedro in the basic version offers building python .whl and .egg packages
kedro package
However, we can leverage the container support by using kedro-docker plugin. The following command generates for us a Dockerfile.
pipenv install kedro-docker && kedro docker init
Later, we can create an image that we can distribute.
kedro docker build
It is highly likely that it will fail due to missing OS dependencies. Adding following command to Dockerfile
should help.
RUN apt-get update && apt-get -y install python3-dev \
gcc \
libc-dev