0.18.0
Release 0.18.0
TL;DR ✨
Kedro 0.18.0 strives to reduce the complexity of the project template and get us closer to a stable release of the framework. We've introduced the full micro-packaging workflow 📦, which allows you to import packages, utility functions and existing pipelines into your Kedro project. Integration with IPython and Jupyter has been streamlined in preparation for enhancements to Kedro's interactive workflow. Additionally, the release comes with long-awaited Python 3.9 and 3.10 support 🐍.
Major features and improvements
Framework
- Added
kedro.config.abstract_config.AbstractConfigLoader
as an abstract base class for allConfigLoader
implementations.ConfigLoader
andTemplatedConfigLoader
now inherit directly from this base class. - Streamlined the
ConfigLoader.get
andTemplatedConfigLoader.get
API and delegated the actualget
method functional implementation to thekedro.config.common
module. - The
hook_manager
is no longer a global singleton. Thehook_manager
lifecycle is now managed by theKedroSession
, and a newhook_manager
will be created every time asession
is instantiated. - Added support for specifying parameters mapping in
pipeline()
without theparams:
prefix. - Added new API
Pipeline.filter()
(previously inKedroContext._filter_pipeline()
) to filter parts of a pipeline. - Added
username
to Session store for logging during Experiment Tracking. - A packaged Kedro project can now be imported and run from another Python project as following:
from my_package.__main__ import main
main(
["--pipleine", "my_pipeline"]
) # or just main() if no parameters are needed for the run
Project template
- Removed
cli.py
from the Kedro project template. By default, all CLI commands, includingkedro run
, are now defined on the Kedro framework side. You can still define custom CLI commands by creating your owncli.py
. - Removed
hooks.py
from the Kedro project template. Registration hooks have been removed in favour ofsettings.py
configuration, but you can still define execution timeline hooks by creating your ownhooks.py
. - Removed
.ipython
directory from the Kedro project template. The IPython/Jupyter workflow no longer uses IPython profiles; it now uses an IPython extension. - The default
kedro
run configuration environment names can now be set insettings.py
using theCONFIG_LOADER_ARGS
variable. The relevant keyword arguments to supply arebase_env
anddefault_run_env
, which are set tobase
andlocal
respectively by default.
DataSets
- Added the following new datasets:
Type | Description | Location |
---|---|---|
pandas.XMLDataSet |
Read XML into Pandas DataFrame. Write Pandas DataFrame to XML | kedro.extras.datasets.pandas |
networkx.GraphMLDataSet |
Work with NetworkX using GraphML files | kedro.extras.datasets.networkx |
networkx.GMLDataSet |
Work with NetworkX using Graph Modelling Language files | kedro.extras.datasets.networkx |
redis.PickleDataSet |
loads/saves data from/to a Redis database | kedro.extras.datasets.redis |
- Added
partitionBy
support and exposedsave_args
forSparkHiveDataSet
. - Exposed
open_args_save
infs_args
forpandas.ParquetDataSet
. - Refactored the
load
andsave
operations forpandas
datasets in order to leveragepandas
own API and delegatefsspec
operations to them. This reduces the need to have our ownfsspec
wrappers. - Merged
pandas.AppendableExcelDataSet
intopandas.ExcelDataSet
. - Added
save_args
tofeather.FeatherDataSet
.
Jupyter and IPython integration
- The only recommended way to work with Kedro in Jupyter or IPython is now the Kedro IPython extension. Managed Jupyter instances should load this via
%load_ext kedro.extras.extensions.ipython
and use the line magic%reload_kedro
. kedro ipython
launches an IPython session that preloads the Kedro IPython extension.kedro jupyter notebook/lab
creates a custom Jupyter kernel that preloads the Kedro IPython extension and launches a notebook with that kernel selected. There is no longer a need to specify--all-kernels
to show all available kernels.
Dependencies
- Bumped the minimum version of
pandas
to 1.3. Anystorage_options
should continue to be specified underfs_args
and/orcredentials
. - Added support for Python 3.9 and 3.10, dropped support for Python 3.6.
- Updated
black
dependency in the project template to a non pre-release version.
Other
- Documented distribution of Kedro pipelines with Dask.
Breaking changes to the API
Framework
- Removed
RegistrationSpecs
and its associatedregister_config_loader
andregister_catalog
hook specifications in favour ofCONFIG_LOADER_CLASS
/CONFIG_LOADER_ARGS
andDATA_CATALOG_CLASS
insettings.py
. - Removed deprecated functions
load_context
andget_project_context
. - Removed deprecated
CONF_SOURCE
,package_name
,pipeline
,pipelines
,config_loader
andio
attributes fromKedroContext
as well as the deprecatedKedroContext.run
method. - Added the
PluginManager
hook_manager
argument toKedroContext
and theRunner.run()
method, which will be provided by theKedroSession
. - Removed the public method
get_hook_manager()
and replaced its functionality by_create_hook_manager()
. - Enforced that only one run can be successfully executed as part of a
KedroSession
.run_id
has been renamed tosession_id
as a result.
Configuration loaders
- The
settings.py
settingCONF_ROOT
has been renamed toCONF_SOURCE
. Default value ofconf
remains unchanged. ConfigLoader
andTemplatedConfigLoader
argumentconf_root
has been renamed toconf_source
.extra_params
has been renamed toruntime_params
inkedro.config.config.ConfigLoader
andkedro.config.templated_config.TemplatedConfigLoader
.- The environment defaulting behaviour has been removed from
KedroContext
and is now implemented in aConfigLoader
class (or equivalent) with thebase_env
anddefault_run_env
attributes.
DataSets
pandas.ExcelDataSet
now usesopenpyxl
engine instead ofxlrd
.pandas.ParquetDataSet
now callspd.to_parquet()
upon saving. Note that the argumentpartition_cols
is not supported.spark.SparkHiveDataSet
API has been updated to reflectspark.SparkDataSet
. Thewrite_mode=insert
option has also been replaced withwrite_mode=append
as per Spark styleguide. This change addresses Issue 725 and Issue 745. Additionally,upsert
mode now leveragescheckpoint
functionality and requires a validcheckpointDir
be set for currentSparkContext
.yaml.YAMLDataSet
can no longer save apandas.DataFrame
directly, but it can save a dictionary. Usepandas.DataFrame.to_dict()
to convert yourpandas.DataFrame
to a dictionary before you attempt to save it to YAML.- Removed
open_args_load
andopen_args_save
from the following datasets:pandas.CSVDataSet
pandas.ExcelDataSet
pandas.FeatherDataSet
pandas.JSONDataSet
pandas.ParquetDataSet
storage_options
are now dropped if they are specified underload_args
orsave_args
for the following datasets:pandas.CSVDataSet
pandas.ExcelDataSet
pandas.FeatherDataSet
pandas.JSONDataSet
pandas.ParquetDataSet
- Renamed
lambda_data_set
,memory_data_set
, andpartitioned_data_set
tolambda_dataset
,memory_dataset
, andpartitioned_dataset
, respectively, inkedro.io
. - The dataset
networkx.NetworkXDataSet
has been renamed tonetworkx.JSONDataSet
.
CLI
- Removed
kedro install
in favour ofpip install -r src/requirements.txt
to install project dependencies. - Removed
--parallel
flag fromkedro run
in favour of--runner=ParallelRunner
. The-p
flag is now an alias for--pipeline
. kedro pipeline package
has been replaced bykedro micropkg package
and, in addition to the--alias
flag used to rename the package, now accepts a module name and path to the pipeline or utility module to package, relative tosrc/<package_name>/
. The--version
CLI option has been removed in favour of setting a__version__
variable in the micro-package's__init__.py
file.kedro pipeline pull
has been replaced bykedro micropkg pull
and now also supports--destination
to provide a location for pulling the package.- Removed
kedro pipeline list
andkedro pipeline describe
in favour ofkedro registry list
andkedro registry describe
. kedro package
andkedro micropkg package
now saveegg
andwhl
ortar
files in the<project_root>/dist
folder (previously<project_root>/src/dist
).- Changed the behaviour of
kedro build-reqs
to compile requirements fromrequirements.txt
instead ofrequirements.in
and save them torequirements.lock
instead ofrequirements.txt
. kedro jupyter notebook/lab
no longer accept--all-kernels
or--idle-timeout
flags.--all-kernels
is now the default behaviour.KedroSession.run
now raisesValueError
rather thanKedroContextError
when the pipeline contains no nodes. The sameValueError
is raised when there are no matching tags.KedroSession.run
now raisesValueError
rather thanKedroContextError
when the pipeline name doesn't exist in the pipeline registry.
Other
- Added namespace to parameters in a modular pipeline, which addresses Issue 399.
- Switched from packaging pipelines as wheel files to tar archive files compressed with gzip (
.tar.gz
). - Removed decorator API from
Node
andPipeline
, as well as the moduleskedro.extras.decorators
andkedro.pipeline.decorators
. - Removed transformer API from
DataCatalog
, as well as the moduleskedro.extras.transformers
andkedro.io.transformers
. - Removed the
Journal
andDataCatalogWithDefault
. - Removed
%init_kedro
IPython line magic, with its functionality incorporated into%reload_kedro
. This means that if%reload_kedro
is called with a filepath, that will be set as default for subsequent calls.
Migration guide from Kedro 0.17.* to 0.18.*
Hooks
- Remove any existing
hook_impl
of theregister_config_loader
andregister_catalog
methods fromProjectHooks
inhooks.py
(or custom alternatives). - If you use
run_id
in theafter_catalog_created
hook, replace it withsave_version
instead. - If you use
run_id
in any of thebefore_node_run
,after_node_run
,on_node_error
,before_pipeline_run
,after_pipeline_run
oron_pipeline_error
hooks, replace it withsession_id
instead.
settings.py
file
- If you use a custom config loader class such as
kedro.config.TemplatedConfigLoader
, alterCONFIG_LOADER_CLASS
to specify the class andCONFIG_LOADER_ARGS
to specify keyword arguments. If not set, these default tokedro.config.ConfigLoader
and an empty dictionary respectively. - If you use a custom data catalog class, alter
DATA_CATALOG_CLASS
to specify the class. If not set, this defaults tokedro.io.DataCatalog
. - If you have a custom config location (i.e. not
conf
), updateCONF_ROOT
toCONF_SOURCE
and set it to a string with the expected configuration location. If not set, this defaults to"conf"
.
Modular pipelines
- If you use any modular pipelines with parameters, make sure they are declared with the correct namespace. See example below:
For a given pipeline:
active_pipeline = pipeline(
pipe=[
node(
func=some_func,
inputs=["model_input_table", "params:model_options"],
outputs=["**my_output"],
),
...,
],
inputs="model_input_table",
namespace="candidate_modelling_pipeline",
)
The parameters should look like this:
-model_options:
- test_size: 0.2
- random_state: 8
- features:
- - engines
- - passenger_capacity
- - crew
+candidate_modelling_pipeline:
+ model_options:
+ test_size: 0.2
+ random_state: 8
+ features:
+ - engines
+ - passenger_capacity
+ - crew
- Optional: You can now remove all
params:
prefix when supplying values toparameters
argument in apipeline()
call. - If you pull modular pipelines with
kedro pipeline pull my_pipeline --alias other_pipeline
, now usekedro micropkg pull my_pipeline --alias pipelines.other_pipeline
instead. - If you package modular pipelines with
kedro pipeline package my_pipeline
, now usekedro micropkg package pipelines.my_pipeline
instead. - Similarly, if you package any modular pipelines using
pyproject.toml
, you should modify the keys to include the full module path, and wrapped in double-quotes, e.g:
[tool.kedro.micropkg.package]
-data_engineering = {destination = "path/to/here"}
-data_science = {alias = "ds", env = "local"}
+"pipelines.data_engineering" = {destination = "path/to/here"}
+"pipelines.data_science" = {alias = "ds", env = "local"}
[tool.kedro.micropkg.pull]
-"s3://my_bucket/my_pipeline" = {alias = "aliased_pipeline"}
+"s3://my_bucket/my_pipeline" = {alias = "pipelines.aliased_pipeline"}
DataSets
- If you use
pandas.ExcelDataSet
, make sure you haveopenpyxl
installed in your environment. This is automatically installed if you specifykedro[pandas.ExcelDataSet]==0.18.0
in yourrequirements.txt
. You can uninstallxlrd
if you were only using it for this dataset. - If you use
pandas.ParquetDataSet
, pass pandas saving arguments directly tosave_args
instead of nested infrom_pandas
(e.g.save_args = {"preserve_index": False}
instead ofsave_args = {"from_pandas": {"preserve_index": False}}
). - If you use
spark.SparkHiveDataSet
withwrite_mode
option set toinsert
, change this toappend
in line with the Spark styleguide. If you usespark.SparkHiveDataSet
withwrite_mode
option set toupsert
, make sure that yourSparkContext
has a validcheckpointDir
set either bySparkContext.setCheckpointDir
method or directly in theconf
folder. - If you use
pandas~=1.2.0
and passstorage_options
throughload_args
orsavs_args
, specify them underfs_args
or viacredentials
instead. - If you import from
kedro.io.lambda_data_set
,kedro.io.memory_data_set
, orkedro.io.partitioned_data_set
, change the import tokedro.io.lambda_dataset
,kedro.io.memory_dataset
, orkedro.io.partitioned_dataset
, respectively (or import the dataset directly fromkedro.io
). - If you have any
pandas.AppendableExcelDataSet
entries in your catalog, replace them withpandas.ExcelDataSet
. - If you have any
networkx.NetworkXDataSet
entries in your catalog, replace them withnetworkx.JSONDataSet
.
Other
- Edit any scripts containing
kedro pipeline package --version
to usekedro micropkg package
instead. If you wish to set a specific pipeline package version, set the__version__
variable in the pipeline package's__init__.py
file. - To run a pipeline in parallel, use
kedro run --runner=ParallelRunner
rather than--parallel
or-p
. - If you call
ConfigLoader
orTemplatedConfigLoader
directly, update the keyword argumentsconf_root
toconf_source
andextra_params
toruntime_params
. - If you use
KedroContext
to accessConfigLoader
, usesettings.CONFIG_LOADER_CLASS
to access the currently usedConfigLoader
instead.