Skip to content

Toy project using kedro and mlflow to demostrate how to track data science experiments and make them easier to colaborate

Notifications You must be signed in to change notification settings

bkolasa/kedro-workshop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kedro project including experiment tracking

Prerequisites

The main branch of this project contains the content of a finished workshop. If we want to start workshop from scratch firstly we need to switch to start_from_scratch git branch.

git switch start_from_scratch

We will be using pipenv virtual environment. Let's install it first

pip install pipenv

Then we want to install all of required dependencies.

pipenv install

This command will install kedro,'kedro-viz` and other required packages. Finally, we can enter the virtual env shell.

pipenv shell

Bootstrapping a new Kedro project

During the workshop we will work on an example project provided by Kedro. We can initialize it by executinh following command.

kedro new --starter=spaceflights

Working with Kedro

Let's start from vizualizing our pipeline in order to investigate what we actually do. The following command opens us a kedro viz instance and redirects us into the browser.

kedro viz

As a next step let's run ou full model training pipeline and create artifacts

kedro run

Kedro run command is highly customizable and we can choose what to run. For example the --pipeline argument runs only a selected pipeline.

kedro run --pipeline data_processing

--to-outputs runs the whole path of operations required to produce given output.

kedro run --to-outputs="evaluation_plot"

--from-inputs runs the whole path starting from given dataset node

kedro run --from-inputs=model_input_table

As well, we can modify parameters configured in the parameters.yml file

kedro run --params model_options.test_size=0.1

Tasks

  1. Add a new parameter to linear regression model in data science pipeline (ex. n_jobs parameter of LinearRegressor) Check how kedro viz diagram has changed and that you can specify it via command line. (10 min.)

Tracking experiments

Starting experiment tracking in Kedro requires modifying the project. Firstly, we need to setup the store for our experiment.

  1. Paste this snippet into settings.py
from kedro_viz.integrations.kedro.sqlite_store import SQLiteStore
from pathlib import Path

SESSION_STORE_CLASS = SQLiteStore
SESSION_STORE_ARGS = {"path": str(Path(__file__).parents[2] / "data")}
  1. Create directory for tracking artifacts
mkdir -p data/09_tracking
  1. Add metric artifacts into the catalog
metrics:
  type: tracking.MetricsDataSet
  filepath: data/09_tracking/metrics.json

companies_columns:
  type: tracking.JSONDataSet
  filepath: data/09_tracking/companies_columns.json
  1. Modify the pipeline and nodes. See https://docs.kedro.org/en/stable/experiment_tracking/index.html#modify-your-nodes-and-pipelines-to-log-metrics for more.

Tasks

  1. Register more parameters into tracking
  2. Save all the parameters defined in catalog into model tracking.

Deploying your model

Kedro in the basic version offers building python .whl and .egg packages

kedro package 

However, we can leverage the container support by using kedro-docker plugin. The following command generates for us a Dockerfile.

pipenv install kedro-docker && kedro docker init

Later, we can create an image that we can distribute.

kedro docker build

It is highly likely that it will fail due to missing OS dependencies. Adding following command to Dockerfile should help.

RUN apt-get update && apt-get -y install python3-dev \
                        gcc \
                        libc-dev

About

Toy project using kedro and mlflow to demostrate how to track data science experiments and make them easier to colaborate

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published