Skip to content

Commit

Permalink
Update partitioned dataset lazy saving docs (#4402)
Browse files Browse the repository at this point in the history
* Updated Partitioned dataset lazy saving docs

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Updated release notes

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Fixed typo

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Updated docs based on new solution

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Applied revire comments

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

---------

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
ElenaKhaustova authored Jan 22, 2025
1 parent fba7c53 commit 9ee181f
Showing 2 changed files with 20 additions and 0 deletions.
1 change: 1 addition & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
@@ -14,6 +14,7 @@
* Safeguard hooks when user incorrectly registers a hook class in settings.py.
* Fixed parsing paths with query and fragment.
* Remove lowercase transformation in regex validation.
* Updated `Partitioned dataset lazy saving` docs page.

## Breaking changes to the API
## Documentation changes
19 changes: 19 additions & 0 deletions docs/source/data/partitioned_and_incremental_datasets.md
Original file line number Diff line number Diff line change
@@ -175,6 +175,7 @@ new_partitioned_dataset:
path: s3://my-bucket-name
dataset: pandas.CSVDataset
filename_suffix: ".csv"
save_lazily: True
```
Here is the node definition:
@@ -238,6 +239,24 @@ def create_partitions() -> Dict[str, Callable[[], Any]]:
When using lazy saving, the dataset will be written _after_ the `after_node_run` [hook](../hooks/introduction).
```

```{note}
Lazy saving is the default behaviour, meaning that if a `Callable` type is provided, the dataset will be written _after_ the `after_node_run` hook is executed.
```

In certain cases, it might be useful to disable lazy saving, such as when your object is already a `Callable` (e.g., a TensorFlow model) and you do not intend to save it lazily.
To disable the lazy saving set `save_lazily` parameter to `False`:

```yaml
# conf/base/catalog.yml

new_partitioned_dataset:
type: partitions.PartitionedDataset
path: s3://my-bucket-name
dataset: pandas.CSVDataset
filename_suffix: ".csv"
save_lazily: False
```
## Incremental datasets
{class}`IncrementalDataset<kedro-datasets:kedro_datasets.partitions.IncrementalDataset>` is a subclass of `PartitionedDataset`, which stores the information about the last processed partition in the so-called `checkpoint`. `IncrementalDataset` addresses the use case when partitions have to be processed incrementally, that is, each subsequent pipeline run should process just the partitions which were not processed by the previous runs.

0 comments on commit 9ee181f

Please sign in to comment.