Skip to content

Commit

Permalink
Update the instructions for creating a new dataset
Browse files Browse the repository at this point in the history
Signed-off-by: Deepyaman Datta <[email protected]>
  • Loading branch information
deepyaman committed Jun 17, 2024
1 parent efdc0f9 commit c25bab3
Show file tree
Hide file tree
Showing 3 changed files with 22 additions and 21 deletions.
1 change: 1 addition & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Upcoming Release 0.19.7

## Major features and improvements
* Exposed `load` and `save` publicly for each dataset in the core `kedro` library, and enabled other datasets to do the same. If a dataset doesn't expose `load` or `save` publicly, Kedro will fall back to using `_load` or `_save`, respectively.

## Bug fixes and other changes
* Updated error message for invalid catalog entries.
Expand Down
34 changes: 17 additions & 17 deletions docs/source/data/how_to_create_a_custom_dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

## AbstractDataset

If you are a contributor and would like to submit a new dataset, you must extend the {py:class}`~kedro.io.AbstractDataset` interface or {py:class}`~kedro.io.AbstractVersionedDataset` interface if you plan to support versioning. It requires subclasses to override the `_load` and `_save` and provides `load` and `save` methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataset` implementation.
If you are a contributor and would like to submit a new dataset, you must extend the {py:class}`~kedro.io.AbstractDataset` interface or {py:class}`~kedro.io.AbstractVersionedDataset` interface if you plan to support versioning. It requires subclasses to implement the `load` and `save` methods while providing wrappers that enrich the corresponding methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataset` implementation.

Check warning on line 7 in docs/source/data/how_to_create_a_custom_dataset.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/how_to_create_a_custom_dataset.md#L7

[Kedro.toowordy] 'implement' is too wordy
Raw output
{"message": "[Kedro.toowordy] 'implement' is too wordy", "location": {"path": "docs/source/data/how_to_create_a_custom_dataset.md", "range": {"start": {"line": 7, "column": 255}}}, "severity": "WARNING"}


## Scenario
Expand All @@ -31,8 +31,8 @@ Consult the [Pillow documentation](https://pillow.readthedocs.io/en/stable/insta

At the minimum, a valid Kedro dataset needs to subclass the base {py:class}`~kedro.io.AbstractDataset` and provide an implementation for the following abstract methods:

* `_load`
* `_save`
* `load`
* `save`
* `_describe`

`AbstractDataset` is generically typed with an input data type for saving data, and an output data type for loading data.
Expand Down Expand Up @@ -70,15 +70,15 @@ class ImageDataset(AbstractDataset[np.ndarray, np.ndarray]):
"""
self._filepath = filepath

def _load(self) -> np.ndarray:
def load(self) -> np.ndarray:
"""Loads data from the image file.
Returns:
Data from the image file as a numpy array.
"""
...

def _save(self, data: np.ndarray) -> None:
def save(self, data: np.ndarray) -> None:
"""Saves image data to the specified filepath"""
...

Expand All @@ -96,11 +96,11 @@ src/kedro_pokemon/datasets
└── image_dataset.py
```

## Implement the `_load` method with `fsspec`
## Implement the `load` method with `fsspec`

Check warning on line 99 in docs/source/data/how_to_create_a_custom_dataset.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/how_to_create_a_custom_dataset.md#L99

[Kedro.toowordy] 'Implement' is too wordy
Raw output
{"message": "[Kedro.toowordy] 'Implement' is too wordy", "location": {"path": "docs/source/data/how_to_create_a_custom_dataset.md", "range": {"start": {"line": 99, "column": 4}}}, "severity": "WARNING"}

Many of the built-in Kedro datasets rely on [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) as a consistent interface to different data sources, as described earlier in the section about the [Data Catalog](../data/data_catalog.md#dataset-filepath). In this example, it's particularly convenient to use `fsspec` in conjunction with `Pillow` to read image data, since it allows the dataset to work flexibly with different image locations and formats.

Here is the implementation of the `_load` method using `fsspec` and `Pillow` to read the data of a single image into a `numpy` array:
Here is the implementation of the `load` method using `fsspec` and `Pillow` to read the data of a single image into a `numpy` array:

<details>
<summary><b>Click to expand</b></summary>
Expand Down Expand Up @@ -130,7 +130,7 @@ class ImageDataset(AbstractDataset[np.ndarray, np.ndarray]):
self._filepath = PurePosixPath(path)
self._fs = fsspec.filesystem(self._protocol)

def _load(self) -> np.ndarray:
def load(self) -> np.ndarray:
"""Loads data from the image file.
Returns:
Expand Down Expand Up @@ -168,14 +168,14 @@ In [2]: from PIL import Image
In [3]: Image.fromarray(image).show()
```

## Implement the `_save` method with `fsspec`
## Implement the `save` method with `fsspec`

Check warning on line 171 in docs/source/data/how_to_create_a_custom_dataset.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/how_to_create_a_custom_dataset.md#L171

[Kedro.toowordy] 'Implement' is too wordy
Raw output
{"message": "[Kedro.toowordy] 'Implement' is too wordy", "location": {"path": "docs/source/data/how_to_create_a_custom_dataset.md", "range": {"start": {"line": 171, "column": 4}}}, "severity": "WARNING"}

Similarly, we can implement the `_save` method as follows:


```python
class ImageDataset(AbstractDataset[np.ndarray, np.ndarray]):
def _save(self, data: np.ndarray) -> None:
def save(self, data: np.ndarray) -> None:
"""Saves image data to the specified filepath."""
# using get_filepath_str ensures that the protocol and path are appended correctly for different filesystems
save_path = get_filepath_str(self._filepath, self._protocol)
Expand Down Expand Up @@ -243,7 +243,7 @@ class ImageDataset(AbstractDataset[np.ndarray, np.ndarray]):
self._filepath = PurePosixPath(path)
self._fs = fsspec.filesystem(self._protocol)

def _load(self) -> np.ndarray:
def load(self) -> np.ndarray:
"""Loads data from the image file.
Returns:
Expand All @@ -254,7 +254,7 @@ class ImageDataset(AbstractDataset[np.ndarray, np.ndarray]):
image = Image.open(f).convert("RGBA")
return np.asarray(image)

def _save(self, data: np.ndarray) -> None:
def save(self, data: np.ndarray) -> None:
"""Saves image data to the specified filepath."""
save_path = get_filepath_str(self._filepath, self._protocol)
with self._fs.open(save_path, mode="wb") as f:
Expand Down Expand Up @@ -312,7 +312,7 @@ To add versioning support to the new dataset we need to extend the
{py:class}`~kedro.io.AbstractVersionedDataset` to:

* Accept a `version` keyword argument as part of the constructor
* Adapt the `_save` and `_load` method to use the versioned data path obtained from `_get_save_path` and `_get_load_path` respectively
* Adapt the `load` and `save` method to use the versioned data path obtained from `_get_load_path` and `_get_save_path` respectively

The following amends the full implementation of our basic `ImageDataset`. It now loads and saves data to and from a versioned subfolder (`data/01_raw/pokemon-images-and-types/images/images/pikachu.png/<version>/pikachu.png` with `version` being a datetime-formatted string `YYYY-MM-DDThh.mm.ss.sssZ` by default):

Expand Down Expand Up @@ -359,7 +359,7 @@ class ImageDataset(AbstractVersionedDataset[np.ndarray, np.ndarray]):
glob_function=self._fs.glob,
)

def _load(self) -> np.ndarray:
def load(self) -> np.ndarray:
"""Loads data from the image file.
Returns:
Expand All @@ -370,7 +370,7 @@ class ImageDataset(AbstractVersionedDataset[np.ndarray, np.ndarray]):
image = Image.open(f).convert("RGBA")
return np.asarray(image)

def _save(self, data: np.ndarray) -> None:
def save(self, data: np.ndarray) -> None:
"""Saves image data to the specified filepath."""
save_path = get_filepath_str(self._get_save_path(), self._protocol)
with self._fs.open(save_path, mode="wb") as f:
Expand Down Expand Up @@ -435,7 +435,7 @@ The difference between the original `ImageDataset` and the versioned `ImageDatas
+ glob_function=self._fs.glob,
+ )
+
def _load(self) -> np.ndarray:
def load(self) -> np.ndarray:
"""Loads data from the image file.

Returns:
Expand All @@ -447,7 +447,7 @@ The difference between the original `ImageDataset` and the versioned `ImageDatas
image = Image.open(f).convert("RGBA")
return np.asarray(image)

def _save(self, data: np.ndarray) -> None:
def save(self, data: np.ndarray) -> None:
"""Saves image data to the specified filepath."""
- save_path = get_filepath_str(self._filepath, self._protocol)
+ save_path = get_filepath_str(self._get_save_path(), self._protocol)
Expand Down
8 changes: 4 additions & 4 deletions kedro/io/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,10 +93,10 @@ class AbstractDataset(abc.ABC, Generic[_DI, _DO]):
>>> self._param1 = param1
>>> self._param2 = param2
>>>
>>> def _load(self) -> pd.DataFrame:
>>> def load(self) -> pd.DataFrame:
>>> return pd.read_csv(self._filepath)
>>>
>>> def _save(self, df: pd.DataFrame) -> None:
>>> def save(self, df: pd.DataFrame) -> None:
>>> df.to_csv(str(self._filepath))
>>>
>>> def _exists(self) -> bool:
Expand Down Expand Up @@ -555,11 +555,11 @@ class AbstractVersionedDataset(AbstractDataset[_DI, _DO], abc.ABC):
>>> self._param1 = param1
>>> self._param2 = param2
>>>
>>> def _load(self) -> pd.DataFrame:
>>> def load(self) -> pd.DataFrame:
>>> load_path = self._get_load_path()
>>> return pd.read_csv(load_path)
>>>
>>> def _save(self, df: pd.DataFrame) -> None:
>>> def save(self, df: pd.DataFrame) -> None:
>>> save_path = self._get_save_path()
>>> df.to_csv(str(save_path))
>>>
Expand Down

0 comments on commit c25bab3

Please sign in to comment.