Skip to content

Commit

Permalink
add dataset guide
Browse files Browse the repository at this point in the history
  • Loading branch information
tianweidut committed Dec 12, 2023
1 parent 873065a commit a5a6874
Show file tree
Hide file tree
Showing 12 changed files with 401 additions and 4 deletions.
Empty file added docs/dataset/build.md
Empty file.
Empty file added docs/dataset/integration.md
Empty file.
Empty file added docs/dataset/load.md
Empty file.
Empty file added docs/dataset/version.md
Empty file.
Empty file added docs/dataset/view.md
Empty file.
206 changes: 206 additions & 0 deletions i18n/zh/docusaurus-plugin-content-docs/current/dataset/build.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,206 @@
---
title: 数据集构建
---

Starwhale 数据集构建方式非常灵活,可以从一些图片/音频/视频/csv/json/jsonl文件构建,也可以写一些Python脚本构建,还可以从Huggingface Hub 导入数据集。

## 从数据文件直接构建

### 图片

支持递归遍历目录中的图片文件,构建Starwhale 数据集,不需要写任何代码:

- 支持的图片文件格式: `png/jpg/jpeg/webp/svg/apng`
- 图片会转成 Starwhale.Image 类型,并可以在 Starwhale Server Web页面中查看。
- 支持命令行 `swcli dataset build --image` 和 Python SDK `starwhale.Dataset.from_folder` 两种方式。
- label机制:当 SDK 设置 `auto_label=True` 或 命令行设置 `--auto-label` 时,会将父目录的名字作为 `label`
- metadata机制:可以通过在根目录设置 `metadata.csv``metadata.jsonl` 文件来扩展数据集的列。
- caption机制:当在同目录下发现 `{image-name}.txt` 文件时,文件中的内容会被自动导入,填充到 `caption` 列中。

假设在 folder 目录中有下面四个文件:

```console
folder/dog/1.png
folder/cat/2.png
folder/dog/3.png
folder/cat/4.png
```

命令方式构建方法:

```console
swcli dataset build --image folder --name image-folder
🚧 start to build dataset bundle...
👷 uri local/project/self/dataset/image-folder/version/latest
🌊 creating dataset local/project/self/dataset/image-folder/version/uw6mdisnf7alg4t4fs2myfug4ie4636w3x4jqcu2...
🦋 update 4 records into dataset
🌺 congratulation! you can run swcli dataset info image-folder/version/uw6mdisnf7al
```

```console
swcli dataset head image-folder -n 2
row ───────────────────────────────────────
🌳 id: cat/2.png
🌀 features:
🔅 file_name : cat/2.png
🔅 label : cat
🔅 file : ArtifactType.Image, display:2.png, mime_type:MIMEType.PNG, shape:[None, None, 3], encoding:
row ───────────────────────────────────────
🌳 id: cat/4.png
🌀 features:
🔅 file_name : cat/4.png
🔅 label : cat
🔅 file : ArtifactType.Image, display:4.png, mime_type:MIMEType.PNG, shape:[None, None, 3], encoding:
```


Python SDK方式构建:

```python
from starwhale import Dataset
ds = Dataset.from_folder("folder", kind="image")
print(ds)
print(ds.fetch_one().features)
```

```console
🌊 creating dataset local/project/self/dataset/folder/version/nyc2ay4gnyayv4zqalovdgakl3k2esvrne42cjna...
🦋 update 4 records into dataset
Dataset: folder, stash version: d22hdiwyakdfh5xitcpn2s32gblfbhrerzczkb63, loading version: nyc2ay4gnyayv4zqalovdgakl3k2esvrne42cjna
{'file_name': 'cat/2.png', 'label': 'cat', 'file': ArtifactType.Image, display:2.png, mime_type:MIMEType.PNG, shape:[None, None, 3], encoding: }
```


### 视频

支持递归遍历目录中的视频文件,构建Starwhale 数据集,不需要写任何代码:

- 支持的视频文件格式:`mp4/webm/avi`
- 视频会被转成 Starwhale.Video 类型,并可以在 Starwhale Server Web页面中查看。
- 支持命令行 `swcli dataset build --video` 和 Python SDK `starwhale.Dataset.from_folder` 两种方式。
- label, caption 和 metadata 机制与图片方式相同。

### 音频

支持递归遍历目录中的音频文件,构建Starwhale 数据集,不需要写任何代码:

- 支持的音频文件格式:`mp3/wav`
- 音频会被转成 Starwhale.Audio 类型,并可以在 Starwhale Server Web页面中查看。
- 支持命令行 `swcli dataset build --video` 和 Python SDK `starwhale.Dataset.from_folder` 两种方式。
- label, caption 和 metadata 机制与图片方式相同。

### csv 文件

支持命令行或Python SDK方式将本地或远端的csv文件直接转化成 Starwhale 数据集:

- 支持一个或多个本地csv文件
- 支持对本地目录递归寻找csv文件
- 支持一个或多个以http url方式指定的远端csv文件

命令行方式构建:

```console
swcli dataset build --name product-desc-modelscope --csv https://modelscope.cn/api/v1/datasets/lcl193798/product_description_generation/repo\?Revision\=master\&FilePath\=test.csv --encoding=utf-8-sig
🚧 start to build dataset bundle...
👷 uri local/project/self/dataset/product-desc-modelscope/version/latest
🌊 creating dataset local/project/self/dataset/product-desc-modelscope/version/wzaz4ccodpyj4jelgupljreyida2bleg5xp7viwe...
🦋 update 3848 records into dataset
🌺 congratulation! dataset build from csv files(('https://modelscope.cn/api/v1/datasets/lcl193798/product_description_generation/repo?Revision=master&FilePath=test.csv',)) has been built. You can run swcli dataset info product-desc-modelscope/version/wzaz4ccodpyj
```

Python SDK方式构建:

```python
from starwhale import Dataset
ds = Dataset.from_csv(path="http://example.com/data.csv", name="my-csv-dataset")
```

### json/jsonl 文件

支持命令行或Python SDK方式将本地或远端的json/jsonl文件直接转化成 Starwhale 数据集:

- 支持一个或多个本地json/jsonl文件
- 支持对本地目录递归寻找json/jsonl文件
- 支持一个或多个以http url方式指定的远端json/jsonl文件

对于json文件:

- 默认认为json解析后的对象是list,list中的每个对象是dict,会映射为Starwhale 数据集中的一行。
- 可以通过 `--field-selector``field_selector` 参数定位具体的某个list。

比如json文件如下:

```json
{
"p1": {
"p2":{
"p3": [
{"a": 1, "b": 2},
{"a": 10, "b": 20},
]
}
}
}
```

那么可以设置 `--field-selector=p1.p2.p3` ,准确添加两行数据到数据集中。

命令方式构建:

```console
swcli dataset build --json https://modelscope.cn/api/v1/datasets/damo/100PoisonMpts/repo\?Revision\=master\&FilePath\=train.jsonl
🚧 start to build dataset bundle...
👷 uri local/project/self/dataset/json-b0o2zsvg/version/latest
🌊 creating dataset local/project/self/dataset/json-b0o2zsvg/version/q3uoziwqligxdggncqywpund75jz55h3bne6a5la...
🦋 update 906 records into dataset
🌺 congratulation! dataset build from ('https://modelscope.cn/api/v1/datasets/damo/100PoisonMpts/repo?Revision=master&FilePath=train.jsonl',) has been built. You can run swcli dataset info json-b0o2zsvg/version/q3uoziwqligx
```

Python SDK方式构建:

```python
from starwhale import Dataset
myds = Dataset.from_json(
name="translation",
text='{"content":{"child_content":[{"en":"hello","zh-cn":"你好"},{"en":"how are you","zh-cn":"最近怎么样"}]}}',
field_selector="content.child_content"
)
print(myds[0].features["zh-cn"])
```

```console
🌊 creating dataset local/project/self/dataset/translation/version/kblfn5zh4cpoqxqbhgdfbvonulr2zefp6lojq44y...
🦋 update 2 records into dataset
你好
```

## 从Huggingface Datasets Hub中构建

Huggingface Hub 上有大量的数据集,可以通过一行代码或一条命令就能转化为 Starwhale 数据集。

:::tip
Huggingface Datasets 转化需要依赖 [datasets](https://pypi.org/project/datasets/) 库。
:::

命令行方式:

```console
swcli dataset build -hf lambdalabs/pokemon-blip-captions --name pokemon
```

Python SDK方式:

```python
from starwhale import Dataset

# You only specify starwhale dataset expected name and huggingface repo name
# example: https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions
ds = Dataset.from_huggingface("pokemon", "lambdalabs/pokemon-blip-captions")
print(ds)
print(len(ds))
print(repr(ds.fetch_one()))
```

## 使用 Starwhale SDK 编写 Python Script 方式构建

## 使用 swcli dataset build + Python Handler 方式构建
178 changes: 178 additions & 0 deletions i18n/zh/docusaurus-plugin-content-docs/current/dataset/integration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
---
title: 数据集与其他ML库的集成
---

Starwhale 数据集可以 Pillow, Numpy, Huggingface Datasets, Pytorch 和 Tensorflow 等流行的ML库进行良好的集成,方便进行数据转化。

## Pillow

[Starwhale Image](../reference/sdk/type#image) 类型与 [Pillow Image](https://pillow.readthedocs.io/en/stable/reference/Image.html) 对象进行双向转化。

### 使用 Pillow Image 初始化 Starwhale Image

```python
from starwhale import dataset

# login cloud instance in advance: `swcli instance login` command or `starwhale.login` sdk
# raw dataset url: https://cloud.starwhale.cn/projects/397/datasets/172/versions/236/files
ds = dataset("https://cloud.starwhale.cn/project/starwhale:object-detection/dataset/coco128/v2")
img = ds.head(n=1)[0].features.image

pil = img.to_pil()
print(pil)
print(pil.size)
```

```console
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x480 at 0x7F77FBA98250>
(640, 480)
```

### 将 Starwhale Image 转化为 Pillow Image

```python
import numpy
from PIL import Image as PILImage
from starwhale import Image

# generate a random image
random_array = numpy.random.randint(low=0, high=256, size=(100, 100, 3), dtype=numpy.uint8)
pil = PILImage.fromarray(random_array, mode="RGB")

img = Image(pil)
print(img)
```

```console
ArtifactType.Image, display:, mime_type:MIMEType.UNDEFINED, shape:[None, None, 3], encoding:
```

## Numpy

### 转化为 numpy.ndarray

Starwhale 的以下数据类型可以转化为 numpy.ndarray 对象:

* Image:先转化为Pillow Image类型,然后再转化为 numpy.ndarray 对象。
* Video:将 video bytes 直接转化 numpy.ndarray 对象。
* Audio:调用 soundfile 库将 audio bytes 转化为 numpy.ndarray 对象。
* BoundingBox:转化为 xywh 格式的 numpy.ndarray 对象。
* Binary:将 bytes 直接转化 numpy.ndarray 对象。

```python
from starwhale import dataset

# login cloud instance in advance: `swcli instance login` command or `starwhale.login` sdk
# raw dataset url: https://cloud.starwhale.cn/projects/397/datasets/172/versions/236/files
ds = dataset("https://cloud.starwhale.cn/project/starwhale:object-detection/dataset/coco128/v2")

item = ds.head(n=1)[0]

img = item.features.image
img_array = img.to_numpy()
print(img_array)
print(img_array.shape)

bbox = item.features.annotations[0]["bbox"]
print(bbox)
print(bbox.to_numpy())
```

```console
<class 'numpy.ndarray'>
(480, 640, 3)
BoundingBox[XYWH]- x:1.0799999999999699, y:187.69008, width:611.5897600000001, height:285.84000000000003
array([ 1.08 , 187.69008, 611.58976, 285.84 ])
```

### 使用 numpy.ndarray 初始化 Starwhale Image

当一个图片表示为 numpy.ndarray 对象时,可以用来初始化为 Starwhale Image 对象。

```python
import numpy
from starwhale import Image

# generate a random image numpy.ndarray
random_array = numpy.random.randint(low=0, high=256, size=(100, 100, 3), dtype=numpy.uint8)
img = Image(random_array)
print(img)
```

```console
ArtifactType.Image, display:, mime_type:MIMEType.UNDEFINED, shape:[None, None, 3], encoding:
```

## Huggingface Datasets

Huggingface Hub 上有大量的数据集,可以通过一行代码就能转化为 Starwhale 数据集。

:::tip
Huggingface Datasets 转化需要依赖 [datasets](https://pypi.org/project/datasets/) 库。
:::

```python
from starwhale import Dataset

# You only specify starwhale dataset expected name and huggingface repo name
# example: https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions
ds = Dataset.from_huggingface("pokemon", "lambdalabs/pokemon-blip-captions")
print(ds)
print(len(ds))
print(repr(ds.fetch_one()))
```

```console
🌊 creating dataset local/project/self/dataset/pokemon/version/r2m3is6ormwcio4gtayop25qk4gmfr6mcei6hise...
🦋 update 833 records into dataset
Dataset: pokemon, stash version: r2m3is6ormwcio4gtayop25qk4gmfr6mcei6hise, loading version: r2m3is6ormwcio4gtayop25qk4gmfr6mcei6hise
833
index:default/train/0, features:{'image': ArtifactType.Image, display:, mime_type:MIMEType.JPEG, shape:[1280, 1280, 3], encoding: , 'text': 'a drawing of a green pokemon with red eyes', '_hf_subset': 'default', '_hf_split': 'train'}, shadow dataset: None
```

## Pytorch

Starwhale Dataset 可以转化为 Pytorch 的 [torch.utils.dataset.IterableDataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset) 对象,并接受 transform 变换。转化后的 Pytorch dataset 对象就可以传递给 Pytorch dataloader 或 Huggingface Trainer 等。

```python
from starwhale import dataset
import torch.utils.data as tdata

def custom_transform(data):
data["label"] = data["label"] + 100
return data

with dataset("simple", create="empty") as ds:
for i in range(0, 10):
ds[i] = {"text": f"{i}-text", "label": i}
ds.commit()

torch_ds = ds.to_pytorch(transform=custom_transform)
torch_loader = tdata.DataLoader(torch_ds, batch_size=1)
item = next(iter(torch_loader))
print(item)
print(item["label"])
```

```console
{'text': ['0-text'], 'label': tensor([100])}
tensor([100])
```

## Tensorflow

Starwhale Dataset 可以转化为 Tensorflow 的 [tensorflow.data.Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) 对象,同时也支持 transform 函数,可以对数据进行变化。

```python
from starwhale import dataset

# login cloud instance in advance: `swcli instance login` command or `starwhale.login` sdk
# raw dataset url: https://cloud.starwhale.cn/projects/397/datasets/172/versions/236/files
ds = dataset("https://cloud.starwhale.cn/project/starwhale:helloworld/dataset/mnist64")
tf_ds = ds.to_tensorflow()
print(tf_ds)
```

```console
<_FlatMapDataset element_spec={'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'img': TensorSpec(shape=(8, 8, 1), dtype=tf.uint8, name=None)}>
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
---
title: 数据集加载
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
---
title: 数据集版本控制
---
Loading

0 comments on commit a5a6874

Please sign in to comment.