Skip to content

0.18.0

Compare
Choose a tag to compare
@idanov idanov released this 31 Mar 16:06
· 1529 commits to main since this release
35e4ac5

Release 0.18.0

TL;DR ✨

Kedro 0.18.0 strives to reduce the complexity of the project template and get us closer to a stable release of the framework. We've introduced the full micro-packaging workflow 📦, which allows you to import packages, utility functions and existing pipelines into your Kedro project. Integration with IPython and Jupyter has been streamlined in preparation for enhancements to Kedro's interactive workflow. Additionally, the release comes with long-awaited Python 3.9 and 3.10 support 🐍.

Major features and improvements

Framework

  • Added kedro.config.abstract_config.AbstractConfigLoader as an abstract base class for all ConfigLoader implementations. ConfigLoader and TemplatedConfigLoader now inherit directly from this base class.
  • Streamlined the ConfigLoader.get and TemplatedConfigLoader.get API and delegated the actual get method functional implementation to the kedro.config.common module.
  • The hook_manager is no longer a global singleton. The hook_manager lifecycle is now managed by the KedroSession, and a new hook_manager will be created every time a session is instantiated.
  • Added support for specifying parameters mapping in pipeline() without the params: prefix.
  • Added new API Pipeline.filter() (previously in KedroContext._filter_pipeline()) to filter parts of a pipeline.
  • Added username to Session store for logging during Experiment Tracking.
  • A packaged Kedro project can now be imported and run from another Python project as following:
from my_package.__main__ import main

main(
    ["--pipleine", "my_pipeline"]
)  # or just main() if no parameters are needed for the run

Project template

  • Removed cli.py from the Kedro project template. By default, all CLI commands, including kedro run, are now defined on the Kedro framework side. You can still define custom CLI commands by creating your own cli.py.
  • Removed hooks.py from the Kedro project template. Registration hooks have been removed in favour of settings.py configuration, but you can still define execution timeline hooks by creating your own hooks.py.
  • Removed .ipython directory from the Kedro project template. The IPython/Jupyter workflow no longer uses IPython profiles; it now uses an IPython extension.
  • The default kedro run configuration environment names can now be set in settings.py using the CONFIG_LOADER_ARGS variable. The relevant keyword arguments to supply are base_env and default_run_env, which are set to base and local respectively by default.

DataSets

  • Added the following new datasets:
Type Description Location
pandas.XMLDataSet Read XML into Pandas DataFrame. Write Pandas DataFrame to XML kedro.extras.datasets.pandas
networkx.GraphMLDataSet Work with NetworkX using GraphML files kedro.extras.datasets.networkx
networkx.GMLDataSet Work with NetworkX using Graph Modelling Language files kedro.extras.datasets.networkx
redis.PickleDataSet loads/saves data from/to a Redis database kedro.extras.datasets.redis
  • Added partitionBy support and exposed save_args for SparkHiveDataSet.
  • Exposed open_args_save in fs_args for pandas.ParquetDataSet.
  • Refactored the load and save operations for pandas datasets in order to leverage pandas own API and delegate fsspec operations to them. This reduces the need to have our own fsspec wrappers.
  • Merged pandas.AppendableExcelDataSet into pandas.ExcelDataSet.
  • Added save_args to feather.FeatherDataSet.

Jupyter and IPython integration

  • The only recommended way to work with Kedro in Jupyter or IPython is now the Kedro IPython extension. Managed Jupyter instances should load this via %load_ext kedro.extras.extensions.ipython and use the line magic %reload_kedro.
  • kedro ipython launches an IPython session that preloads the Kedro IPython extension.
  • kedro jupyter notebook/lab creates a custom Jupyter kernel that preloads the Kedro IPython extension and launches a notebook with that kernel selected. There is no longer a need to specify --all-kernels to show all available kernels.

Dependencies

  • Bumped the minimum version of pandas to 1.3. Any storage_options should continue to be specified under fs_args and/or credentials.
  • Added support for Python 3.9 and 3.10, dropped support for Python 3.6.
  • Updated black dependency in the project template to a non pre-release version.

Other

  • Documented distribution of Kedro pipelines with Dask.

Breaking changes to the API

Framework

  • Removed RegistrationSpecs and its associated register_config_loader and register_catalog hook specifications in favour of CONFIG_LOADER_CLASS/CONFIG_LOADER_ARGS and DATA_CATALOG_CLASS in settings.py.
  • Removed deprecated functions load_context and get_project_context.
  • Removed deprecated CONF_SOURCE, package_name, pipeline, pipelines, config_loader and io attributes from KedroContext as well as the deprecated KedroContext.run method.
  • Added the PluginManager hook_manager argument to KedroContext and the Runner.run() method, which will be provided by the KedroSession.
  • Removed the public method get_hook_manager() and replaced its functionality by _create_hook_manager().
  • Enforced that only one run can be successfully executed as part of a KedroSession. run_id has been renamed to session_id as a result.

Configuration loaders

  • The settings.py setting CONF_ROOT has been renamed to CONF_SOURCE. Default value of conf remains unchanged.
  • ConfigLoader and TemplatedConfigLoader argument conf_root has been renamed to conf_source.
  • extra_params has been renamed to runtime_params in kedro.config.config.ConfigLoader and kedro.config.templated_config.TemplatedConfigLoader.
  • The environment defaulting behaviour has been removed from KedroContext and is now implemented in a ConfigLoader class (or equivalent) with the base_env and default_run_env attributes.

DataSets

  • pandas.ExcelDataSet now uses openpyxl engine instead of xlrd.
  • pandas.ParquetDataSet now calls pd.to_parquet() upon saving. Note that the argument partition_cols is not supported.
  • spark.SparkHiveDataSet API has been updated to reflect spark.SparkDataSet. The write_mode=insert option has also been replaced with write_mode=append as per Spark styleguide. This change addresses Issue 725 and Issue 745. Additionally, upsert mode now leverages checkpoint functionality and requires a valid checkpointDir be set for current SparkContext.
  • yaml.YAMLDataSet can no longer save a pandas.DataFrame directly, but it can save a dictionary. Use pandas.DataFrame.to_dict() to convert your pandas.DataFrame to a dictionary before you attempt to save it to YAML.
  • Removed open_args_load and open_args_save from the following datasets:
    • pandas.CSVDataSet
    • pandas.ExcelDataSet
    • pandas.FeatherDataSet
    • pandas.JSONDataSet
    • pandas.ParquetDataSet
  • storage_options are now dropped if they are specified under load_args or save_args for the following datasets:
    • pandas.CSVDataSet
    • pandas.ExcelDataSet
    • pandas.FeatherDataSet
    • pandas.JSONDataSet
    • pandas.ParquetDataSet
  • Renamed lambda_data_set, memory_data_set, and partitioned_data_set to lambda_dataset, memory_dataset, and partitioned_dataset, respectively, in kedro.io.
  • The dataset networkx.NetworkXDataSet has been renamed to networkx.JSONDataSet.

CLI

  • Removed kedro install in favour of pip install -r src/requirements.txt to install project dependencies.
  • Removed --parallel flag from kedro run in favour of --runner=ParallelRunner. The -p flag is now an alias for --pipeline.
  • kedro pipeline package has been replaced by kedro micropkg package and, in addition to the --alias flag used to rename the package, now accepts a module name and path to the pipeline or utility module to package, relative to src/<package_name>/. The --version CLI option has been removed in favour of setting a __version__ variable in the micro-package's __init__.py file.
  • kedro pipeline pull has been replaced by kedro micropkg pull and now also supports --destination to provide a location for pulling the package.
  • Removed kedro pipeline list and kedro pipeline describe in favour of kedro registry list and kedro registry describe.
  • kedro package and kedro micropkg package now save egg and whl or tar files in the <project_root>/dist folder (previously <project_root>/src/dist).
  • Changed the behaviour of kedro build-reqs to compile requirements from requirements.txt instead of requirements.in and save them to requirements.lock instead of requirements.txt.
  • kedro jupyter notebook/lab no longer accept --all-kernels or --idle-timeout flags. --all-kernels is now the default behaviour.
  • KedroSession.run now raises ValueError rather than KedroContextError when the pipeline contains no nodes. The same ValueError is raised when there are no matching tags.
  • KedroSession.run now raises ValueError rather than KedroContextError when the pipeline name doesn't exist in the pipeline registry.

Other

  • Added namespace to parameters in a modular pipeline, which addresses Issue 399.
  • Switched from packaging pipelines as wheel files to tar archive files compressed with gzip (.tar.gz).
  • Removed decorator API from Node and Pipeline, as well as the modules kedro.extras.decorators and kedro.pipeline.decorators.
  • Removed transformer API from DataCatalog, as well as the modules kedro.extras.transformers and kedro.io.transformers.
  • Removed the Journal and DataCatalogWithDefault.
  • Removed %init_kedro IPython line magic, with its functionality incorporated into %reload_kedro. This means that if %reload_kedro is called with a filepath, that will be set as default for subsequent calls.

Migration guide from Kedro 0.17.* to 0.18.*

Hooks

  • Remove any existing hook_impl of the register_config_loader and register_catalog methods from ProjectHooks in hooks.py (or custom alternatives).
  • If you use run_id in the after_catalog_created hook, replace it with save_version instead.
  • If you use run_id in any of the before_node_run, after_node_run, on_node_error, before_pipeline_run, after_pipeline_run or on_pipeline_error hooks, replace it with session_id instead.

settings.py file

  • If you use a custom config loader class such as kedro.config.TemplatedConfigLoader, alter CONFIG_LOADER_CLASS to specify the class and CONFIG_LOADER_ARGS to specify keyword arguments. If not set, these default to kedro.config.ConfigLoader and an empty dictionary respectively.
  • If you use a custom data catalog class, alter DATA_CATALOG_CLASS to specify the class. If not set, this defaults to kedro.io.DataCatalog.
  • If you have a custom config location (i.e. not conf), update CONF_ROOT to CONF_SOURCE and set it to a string with the expected configuration location. If not set, this defaults to "conf".

Modular pipelines

  • If you use any modular pipelines with parameters, make sure they are declared with the correct namespace. See example below:

For a given pipeline:

active_pipeline = pipeline(
    pipe=[
        node(
            func=some_func,
            inputs=["model_input_table", "params:model_options"],
            outputs=["**my_output"],
        ),
        ...,
    ],
    inputs="model_input_table",
    namespace="candidate_modelling_pipeline",
)

The parameters should look like this:

-model_options:
-    test_size: 0.2
-    random_state: 8
-    features:
-    - engines
-    - passenger_capacity
-    - crew
+candidate_modelling_pipeline:
+    model_options:
+      test_size: 0.2
+      random_state: 8
+      features:
+        - engines
+        - passenger_capacity
+        - crew
  • Optional: You can now remove all params: prefix when supplying values to parameters argument in a pipeline() call.
  • If you pull modular pipelines with kedro pipeline pull my_pipeline --alias other_pipeline, now use kedro micropkg pull my_pipeline --alias pipelines.other_pipeline instead.
  • If you package modular pipelines with kedro pipeline package my_pipeline, now use kedro micropkg package pipelines.my_pipeline instead.
  • Similarly, if you package any modular pipelines using pyproject.toml, you should modify the keys to include the full module path, and wrapped in double-quotes, e.g:
[tool.kedro.micropkg.package]
-data_engineering = {destination = "path/to/here"}
-data_science = {alias = "ds", env = "local"}
+"pipelines.data_engineering" = {destination = "path/to/here"}
+"pipelines.data_science" = {alias = "ds", env = "local"}

[tool.kedro.micropkg.pull]
-"s3://my_bucket/my_pipeline" = {alias = "aliased_pipeline"}
+"s3://my_bucket/my_pipeline" = {alias = "pipelines.aliased_pipeline"}

DataSets

  • If you use pandas.ExcelDataSet, make sure you have openpyxl installed in your environment. This is automatically installed if you specify kedro[pandas.ExcelDataSet]==0.18.0 in your requirements.txt. You can uninstall xlrd if you were only using it for this dataset.
  • If you usepandas.ParquetDataSet, pass pandas saving arguments directly to save_args instead of nested in from_pandas (e.g. save_args = {"preserve_index": False} instead of save_args = {"from_pandas": {"preserve_index": False}}).
  • If you use spark.SparkHiveDataSet with write_mode option set to insert, change this to append in line with the Spark styleguide. If you use spark.SparkHiveDataSet with write_mode option set to upsert, make sure that your SparkContext has a valid checkpointDir set either by SparkContext.setCheckpointDir method or directly in the conf folder.
  • If you use pandas~=1.2.0 and pass storage_options through load_args or savs_args, specify them under fs_args or via credentials instead.
  • If you import from kedro.io.lambda_data_set, kedro.io.memory_data_set, or kedro.io.partitioned_data_set, change the import to kedro.io.lambda_dataset, kedro.io.memory_dataset, or kedro.io.partitioned_dataset, respectively (or import the dataset directly from kedro.io).
  • If you have any pandas.AppendableExcelDataSet entries in your catalog, replace them with pandas.ExcelDataSet.
  • If you have any networkx.NetworkXDataSet entries in your catalog, replace them with networkx.JSONDataSet.

Other

  • Edit any scripts containing kedro pipeline package --version to use kedro micropkg package instead. If you wish to set a specific pipeline package version, set the __version__ variable in the pipeline package's __init__.py file.
  • To run a pipeline in parallel, use kedro run --runner=ParallelRunner rather than --parallel or -p.
  • If you call ConfigLoader or TemplatedConfigLoader directly, update the keyword arguments conf_root to conf_source and extra_params to runtime_params.
  • If you use KedroContext to access ConfigLoader, use settings.CONFIG_LOADER_CLASS to access the currently used ConfigLoader instead.