add dataset guide

star-whale · Dec 12, 2023 · a5a6874 · a5a6874
1 parent 873065a
commit a5a6874
Show file tree

Hide file tree

Showing 12 changed files with 401 additions and 4 deletions.
diff --git a/docs/dataset/build.md b/docs/dataset/build.md
diff --git a/docs/dataset/integration.md b/docs/dataset/integration.md
diff --git a/docs/dataset/load.md b/docs/dataset/load.md
diff --git a/docs/dataset/version.md b/docs/dataset/version.md
diff --git a/docs/dataset/view.md b/docs/dataset/view.md
diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/build.md b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/build.md
@@ -0,0 +1,206 @@
+---
+title: 数据集构建
+---
+
+Starwhale 数据集构建方式非常灵活，可以从一些图片/音频/视频/csv/json/jsonl文件构建，也可以写一些Python脚本构建，还可以从Huggingface Hub 导入数据集。
+
+## 从数据文件直接构建
+
+### 图片
+
+支持递归遍历目录中的图片文件，构建Starwhale 数据集，不需要写任何代码：
+
+- 支持的图片文件格式： `png/jpg/jpeg/webp/svg/apng`
+- 图片会转成 Starwhale.Image 类型，并可以在 Starwhale Server Web页面中查看。
+- 支持命令行 `swcli dataset build --image` 和 Python SDK `starwhale.Dataset.from_folder` 两种方式。
+- label机制：当 SDK 设置 `auto_label=True` 或 命令行设置 `--auto-label` 时，会将父目录的名字作为 `label`。
+- metadata机制：可以通过在根目录设置 `metadata.csv` 或 `metadata.jsonl` 文件来扩展数据集的列。
+- caption机制：当在同目录下发现 `{image-name}.txt` 文件时，文件中的内容会被自动导入，填充到 `caption` 列中。
+
+假设在 folder 目录中有下面四个文件：
+
+```console
+folder/dog/1.png
+folder/cat/2.png
+folder/dog/3.png
+folder/cat/4.png
+```
+
+命令方式构建方法：
+
+```console
+❯ swcli dataset build --image folder --name image-folder
+🚧 start to build dataset bundle...
+👷 uri local/project/self/dataset/image-folder/version/latest
+🌊 creating dataset local/project/self/dataset/image-folder/version/uw6mdisnf7alg4t4fs2myfug4ie4636w3x4jqcu2...
+🦋 update 4 records into dataset
+🌺 congratulation! you can run  swcli dataset info image-folder/version/uw6mdisnf7al
+```
+
+```console
+❯ swcli dataset head image-folder -n 2
+row  ───────────────────────────────────────
+🌳 id: cat/2.png 
+🌀 features:
+         🔅 file_name : cat/2.png 
+         🔅 label : cat 
+         🔅 file : ArtifactType.Image, display:2.png, mime_type:MIMEType.PNG, shape:[None, None, 3], encoding:  
+row  ───────────────────────────────────────
+🌳 id: cat/4.png 
+🌀 features:
+         🔅 file_name : cat/4.png 
+         🔅 label : cat 
+         🔅 file : ArtifactType.Image, display:4.png, mime_type:MIMEType.PNG, shape:[None, None, 3], encoding:  
+```
+
+
+Python SDK方式构建：
+
+```python
+from starwhale import Dataset
+ds = Dataset.from_folder("folder", kind="image")
+print(ds)
+print(ds.fetch_one().features)
+```
+
+```console
+🌊 creating dataset local/project/self/dataset/folder/version/nyc2ay4gnyayv4zqalovdgakl3k2esvrne42cjna...
+🦋 update 4 records into dataset
+Dataset: folder, stash version: d22hdiwyakdfh5xitcpn2s32gblfbhrerzczkb63, loading version: nyc2ay4gnyayv4zqalovdgakl3k2esvrne42cjna
+{'file_name': 'cat/2.png', 'label': 'cat', 'file': ArtifactType.Image, display:2.png, mime_type:MIMEType.PNG, shape:[None, None, 3], encoding: }
+```
+
+
+### 视频
+
+支持递归遍历目录中的视频文件，构建Starwhale 数据集，不需要写任何代码：
+
+- 支持的视频文件格式：`mp4/webm/avi`
+- 视频会被转成 Starwhale.Video 类型，并可以在 Starwhale Server Web页面中查看。
+- 支持命令行 `swcli dataset build --video` 和 Python SDK `starwhale.Dataset.from_folder` 两种方式。
+- label, caption 和 metadata 机制与图片方式相同。
+
+### 音频
+
+支持递归遍历目录中的音频文件，构建Starwhale 数据集，不需要写任何代码：
+
+- 支持的音频文件格式：`mp3/wav`
+- 音频会被转成 Starwhale.Audio 类型，并可以在 Starwhale Server Web页面中查看。
+- 支持命令行 `swcli dataset build --video` 和 Python SDK `starwhale.Dataset.from_folder` 两种方式。
+- label, caption 和 metadata 机制与图片方式相同。
+
+### csv 文件
+
+支持命令行或Python SDK方式将本地或远端的csv文件直接转化成 Starwhale 数据集：
+
+- 支持一个或多个本地csv文件
+- 支持对本地目录递归寻找csv文件
+- 支持一个或多个以http url方式指定的远端csv文件
+
+命令行方式构建：
+
+```console
+❯ swcli dataset build --name product-desc-modelscope --csv https://modelscope.cn/api/v1/datasets/lcl193798/product_description_generation/repo\?Revision\=master\&FilePath\=test.csv --encoding=utf-8-sig
+🚧 start to build dataset bundle...
+👷 uri local/project/self/dataset/product-desc-modelscope/version/latest
+🌊 creating dataset local/project/self/dataset/product-desc-modelscope/version/wzaz4ccodpyj4jelgupljreyida2bleg5xp7viwe...
+🦋 update 3848 records into dataset
+🌺 congratulation! dataset build from csv files(('https://modelscope.cn/api/v1/datasets/lcl193798/product_description_generation/repo?Revision=master&FilePath=test.csv',)) has been built. You can run  swcli dataset info product-desc-modelscope/version/wzaz4ccodpyj
+```
+
+Python SDK方式构建：
+
+```python
+from starwhale import Dataset
+ds = Dataset.from_csv(path="http://example.com/data.csv", name="my-csv-dataset")
+```
+
+### json/jsonl 文件
+
+支持命令行或Python SDK方式将本地或远端的json/jsonl文件直接转化成 Starwhale 数据集：
+
+- 支持一个或多个本地json/jsonl文件
+- 支持对本地目录递归寻找json/jsonl文件
+- 支持一个或多个以http url方式指定的远端json/jsonl文件
+
+对于json文件：
+
+- 默认认为json解析后的对象是list，list中的每个对象是dict，会映射为Starwhale 数据集中的一行。
+- 可以通过 `--field-selector` 或 `field_selector` 参数定位具体的某个list。
+
+比如json文件如下：
+
+```json
+{
+    "p1": {
+        "p2":{
+            "p3": [
+                {"a": 1, "b": 2},
+                {"a": 10, "b": 20},
+            ]
+        }
+    }
+}
+```
+
+那么可以设置 `--field-selector=p1.p2.p3` ，准确添加两行数据到数据集中。
+
+命令方式构建：
+
+```console
+❯ swcli dataset build --json https://modelscope.cn/api/v1/datasets/damo/100PoisonMpts/repo\?Revision\=master\&FilePath\=train.jsonl
+🚧 start to build dataset bundle...
+👷 uri local/project/self/dataset/json-b0o2zsvg/version/latest
+🌊 creating dataset local/project/self/dataset/json-b0o2zsvg/version/q3uoziwqligxdggncqywpund75jz55h3bne6a5la...
+🦋 update 906 records into dataset
+🌺 congratulation! dataset build from ('https://modelscope.cn/api/v1/datasets/damo/100PoisonMpts/repo?Revision=master&FilePath=train.jsonl',) has been built. You can run  swcli dataset info json-b0o2zsvg/version/q3uoziwqligx
+```
+
+Python SDK方式构建：
+
+```python
+from starwhale import Dataset
+myds = Dataset.from_json(
+    name="translation",
+    text='{"content":{"child_content":[{"en":"hello","zh-cn":"你好"},{"en":"how are you","zh-cn":"最近怎么样"}]}}',
+    field_selector="content.child_content"
+)
+print(myds[0].features["zh-cn"])
+```
+
+```console
+🌊 creating dataset local/project/self/dataset/translation/version/kblfn5zh4cpoqxqbhgdfbvonulr2zefp6lojq44y...
+🦋 update 2 records into dataset
+你好
+```
+
+## 从Huggingface Datasets Hub中构建
+
+Huggingface Hub 上有大量的数据集，可以通过一行代码或一条命令就能转化为 Starwhale 数据集。
+
+:::tip
+Huggingface Datasets 转化需要依赖 [datasets](https://pypi.org/project/datasets/) 库。
+:::
+
+命令行方式：
+
+```console
+swcli dataset build -hf lambdalabs/pokemon-blip-captions --name pokemon
+```
+
+Python SDK方式：
+
+```python
+from starwhale import Dataset
+
+# You only specify starwhale dataset expected name and huggingface repo name
+# example: https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions
+ds = Dataset.from_huggingface("pokemon", "lambdalabs/pokemon-blip-captions")
+print(ds)
+print(len(ds))
+print(repr(ds.fetch_one()))
+```
+
+## 使用 Starwhale SDK 编写 Python Script 方式构建
+
+## 使用 swcli dataset build + Python Handler 方式构建
diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/integration.md b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/integration.md
@@ -0,0 +1,178 @@
+---
+title: 数据集与其他ML库的集成
+---
+
+Starwhale 数据集可以 Pillow, Numpy, Huggingface Datasets, Pytorch 和 Tensorflow 等流行的ML库进行良好的集成，方便进行数据转化。
+
+## Pillow
+
+[Starwhale Image](../reference/sdk/type#image) 类型与 [Pillow Image](https://pillow.readthedocs.io/en/stable/reference/Image.html) 对象进行双向转化。
+
+### 使用 Pillow Image 初始化 Starwhale Image
+
+```python
+from starwhale import dataset
+
+# login cloud instance in advance: `swcli instance login` command or `starwhale.login` sdk 
+# raw dataset url: https://cloud.starwhale.cn/projects/397/datasets/172/versions/236/files
+ds = dataset("https://cloud.starwhale.cn/project/starwhale:object-detection/dataset/coco128/v2")
+img = ds.head(n=1)[0].features.image
+
+pil = img.to_pil()
+print(pil)
+print(pil.size)
+```
+
+```console
+<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x480 at 0x7F77FBA98250>
+(640, 480)
+```
+
+### 将 Starwhale Image 转化为 Pillow Image
+
+```python
+import numpy
+from PIL import Image as PILImage
+from starwhale import Image
+
+# generate a random image
+random_array = numpy.random.randint(low=0, high=256, size=(100, 100, 3), dtype=numpy.uint8)
+pil = PILImage.fromarray(random_array, mode="RGB")
+
+img = Image(pil)
+print(img)
+```
+
+```console
+ArtifactType.Image, display:, mime_type:MIMEType.UNDEFINED, shape:[None, None, 3], encoding: 
+```
+
+## Numpy
+
+### 转化为 numpy.ndarray
+
+Starwhale 的以下数据类型可以转化为 numpy.ndarray 对象:
+
+* Image：先转化为Pillow Image类型，然后再转化为 numpy.ndarray 对象。
+* Video：将 video bytes 直接转化 numpy.ndarray 对象。
+* Audio：调用 soundfile 库将 audio bytes 转化为 numpy.ndarray 对象。
+* BoundingBox：转化为 xywh 格式的 numpy.ndarray 对象。
+* Binary：将 bytes 直接转化 numpy.ndarray 对象。
+
+```python
+from starwhale import dataset
+
+# login cloud instance in advance: `swcli instance login` command or `starwhale.login` sdk 
+# raw dataset url: https://cloud.starwhale.cn/projects/397/datasets/172/versions/236/files
+ds = dataset("https://cloud.starwhale.cn/project/starwhale:object-detection/dataset/coco128/v2")
+
+item = ds.head(n=1)[0]
+
+img = item.features.image
+img_array = img.to_numpy()
+print(img_array)
+print(img_array.shape)
+
+bbox = item.features.annotations[0]["bbox"]
+print(bbox)
+print(bbox.to_numpy())
+```
+
+```console
+<class 'numpy.ndarray'>
+(480, 640, 3)
+BoundingBox[XYWH]- x:1.0799999999999699, y:187.69008, width:611.5897600000001, height:285.84000000000003
+array([  1.08   , 187.69008, 611.58976, 285.84   ])
+```
+
+### 使用 numpy.ndarray 初始化 Starwhale Image
+
+当一个图片表示为 numpy.ndarray 对象时，可以用来初始化为 Starwhale Image 对象。
+
+```python
+import numpy
+from starwhale import Image
+
+# generate a random image numpy.ndarray
+random_array = numpy.random.randint(low=0, high=256, size=(100, 100, 3), dtype=numpy.uint8)
+img = Image(random_array)
+print(img)
+```
+
+```console
+ArtifactType.Image, display:, mime_type:MIMEType.UNDEFINED, shape:[None, None, 3], encoding: 
+```
+
+## Huggingface Datasets
+
+Huggingface Hub 上有大量的数据集，可以通过一行代码就能转化为 Starwhale 数据集。
+
+:::tip
+Huggingface Datasets 转化需要依赖 [datasets](https://pypi.org/project/datasets/) 库。
+:::
+
+```python
+from starwhale import Dataset
+
+# You only specify starwhale dataset expected name and huggingface repo name
+# example: https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions
+ds = Dataset.from_huggingface("pokemon", "lambdalabs/pokemon-blip-captions")
+print(ds)
+print(len(ds))
+print(repr(ds.fetch_one()))
+```
+
+```console
+🌊 creating dataset local/project/self/dataset/pokemon/version/r2m3is6ormwcio4gtayop25qk4gmfr6mcei6hise...
+🦋 update 833 records into dataset
+Dataset: pokemon, stash version: r2m3is6ormwcio4gtayop25qk4gmfr6mcei6hise, loading version: r2m3is6ormwcio4gtayop25qk4gmfr6mcei6hise
+833
+index:default/train/0, features:{'image': ArtifactType.Image, display:, mime_type:MIMEType.JPEG, shape:[1280, 1280, 3], encoding: , 'text': 'a drawing of a green pokemon with red eyes', '_hf_subset': 'default', '_hf_split': 'train'}, shadow dataset: None
+```
+
+## Pytorch
+
+Starwhale Dataset 可以转化为 Pytorch 的 [torch.utils.dataset.IterableDataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset) 对象，并接受 transform 变换。转化后的 Pytorch dataset 对象就可以传递给 Pytorch dataloader 或 Huggingface Trainer 等。
+
+```python
+from starwhale import dataset
+import torch.utils.data as tdata
+
+def custom_transform(data):
+	data["label"] = data["label"] + 100
+	return data
+
+with dataset("simple", create="empty") as ds:
+    for i in range(0, 10):
+        ds[i] = {"text": f"{i}-text", "label": i}
+    ds.commit()
+
+    torch_ds = ds.to_pytorch(transform=custom_transform)
+    torch_loader = tdata.DataLoader(torch_ds, batch_size=1)
+    item = next(iter(torch_loader))
+    print(item)
+    print(item["label"])
+```
+
+```console
+{'text': ['0-text'], 'label': tensor([100])}
+tensor([100])
+```
+
+## Tensorflow
+
+Starwhale Dataset 可以转化为 Tensorflow 的 [tensorflow.data.Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) 对象，同时也支持 transform 函数，可以对数据进行变化。
+
+```python
+from starwhale import dataset
+
+# login cloud instance in advance: `swcli instance login` command or `starwhale.login` sdk 
+# raw dataset url: https://cloud.starwhale.cn/projects/397/datasets/172/versions/236/files
+ds = dataset("https://cloud.starwhale.cn/project/starwhale:helloworld/dataset/mnist64")
+tf_ds = ds.to_tensorflow()
+print(tf_ds)
+```
+
+```console
+<_FlatMapDataset element_spec={'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'img': TensorSpec(shape=(8, 8, 1), dtype=tf.uint8, name=None)}>
+```
diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/load.md b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/load.md
@@ -0,0 +1,3 @@
+---
+title: 数据集加载
+---
diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/version.md b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/version.md
@@ -0,0 +1,3 @@
+---
+title: 数据集版本控制
+---