diff --git a/docs/dataset/build.md b/docs/dataset/build.md new file mode 100644 index 000000000..e69de29bb diff --git a/docs/dataset/integration.md b/docs/dataset/integration.md new file mode 100644 index 000000000..e69de29bb diff --git a/docs/dataset/load.md b/docs/dataset/load.md new file mode 100644 index 000000000..e69de29bb diff --git a/docs/dataset/version.md b/docs/dataset/version.md new file mode 100644 index 000000000..e69de29bb diff --git a/docs/dataset/view.md b/docs/dataset/view.md new file mode 100644 index 000000000..e69de29bb diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/build.md b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/build.md new file mode 100644 index 000000000..a5e92b76a --- /dev/null +++ b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/build.md @@ -0,0 +1,206 @@ +--- +title: 数据集构建 +--- + +Starwhale 数据集构建方式非常灵活,可以从一些图片/音频/视频/csv/json/jsonl文件构建,也可以写一些Python脚本构建,还可以从Huggingface Hub 导入数据集。 + +## 从数据文件直接构建 + +### 图片 + +支持递归遍历目录中的图片文件,构建Starwhale 数据集,不需要写任何代码: + +- 支持的图片文件格式: `png/jpg/jpeg/webp/svg/apng` +- 图片会转成 Starwhale.Image 类型,并可以在 Starwhale Server Web页面中查看。 +- 支持命令行 `swcli dataset build --image` 和 Python SDK `starwhale.Dataset.from_folder` 两种方式。 +- label机制:当 SDK 设置 `auto_label=True` 或 命令行设置 `--auto-label` 时,会将父目录的名字作为 `label`。 +- metadata机制:可以通过在根目录设置 `metadata.csv` 或 `metadata.jsonl` 文件来扩展数据集的列。 +- caption机制:当在同目录下发现 `{image-name}.txt` 文件时,文件中的内容会被自动导入,填充到 `caption` 列中。 + +假设在 folder 目录中有下面四个文件: + +```console +folder/dog/1.png +folder/cat/2.png +folder/dog/3.png +folder/cat/4.png +``` + +命令方式构建方法: + +```console +❯ swcli dataset build --image folder --name image-folder +🚧 start to build dataset bundle... +👷 uri local/project/self/dataset/image-folder/version/latest +🌊 creating dataset local/project/self/dataset/image-folder/version/uw6mdisnf7alg4t4fs2myfug4ie4636w3x4jqcu2... +🦋 update 4 records into dataset +🌺 congratulation! you can run swcli dataset info image-folder/version/uw6mdisnf7al +``` + +```console +❯ swcli dataset head image-folder -n 2 +row ─────────────────────────────────────── +🌳 id: cat/2.png +🌀 features: + 🔅 file_name : cat/2.png + 🔅 label : cat + 🔅 file : ArtifactType.Image, display:2.png, mime_type:MIMEType.PNG, shape:[None, None, 3], encoding: +row ─────────────────────────────────────── +🌳 id: cat/4.png +🌀 features: + 🔅 file_name : cat/4.png + 🔅 label : cat + 🔅 file : ArtifactType.Image, display:4.png, mime_type:MIMEType.PNG, shape:[None, None, 3], encoding: +``` + + +Python SDK方式构建: + +```python +from starwhale import Dataset +ds = Dataset.from_folder("folder", kind="image") +print(ds) +print(ds.fetch_one().features) +``` + +```console +🌊 creating dataset local/project/self/dataset/folder/version/nyc2ay4gnyayv4zqalovdgakl3k2esvrne42cjna... +🦋 update 4 records into dataset +Dataset: folder, stash version: d22hdiwyakdfh5xitcpn2s32gblfbhrerzczkb63, loading version: nyc2ay4gnyayv4zqalovdgakl3k2esvrne42cjna +{'file_name': 'cat/2.png', 'label': 'cat', 'file': ArtifactType.Image, display:2.png, mime_type:MIMEType.PNG, shape:[None, None, 3], encoding: } +``` + + +### 视频 + +支持递归遍历目录中的视频文件,构建Starwhale 数据集,不需要写任何代码: + +- 支持的视频文件格式:`mp4/webm/avi` +- 视频会被转成 Starwhale.Video 类型,并可以在 Starwhale Server Web页面中查看。 +- 支持命令行 `swcli dataset build --video` 和 Python SDK `starwhale.Dataset.from_folder` 两种方式。 +- label, caption 和 metadata 机制与图片方式相同。 + +### 音频 + +支持递归遍历目录中的音频文件,构建Starwhale 数据集,不需要写任何代码: + +- 支持的音频文件格式:`mp3/wav` +- 音频会被转成 Starwhale.Audio 类型,并可以在 Starwhale Server Web页面中查看。 +- 支持命令行 `swcli dataset build --video` 和 Python SDK `starwhale.Dataset.from_folder` 两种方式。 +- label, caption 和 metadata 机制与图片方式相同。 + +### csv 文件 + +支持命令行或Python SDK方式将本地或远端的csv文件直接转化成 Starwhale 数据集: + +- 支持一个或多个本地csv文件 +- 支持对本地目录递归寻找csv文件 +- 支持一个或多个以http url方式指定的远端csv文件 + +命令行方式构建: + +```console +❯ swcli dataset build --name product-desc-modelscope --csv https://modelscope.cn/api/v1/datasets/lcl193798/product_description_generation/repo\?Revision\=master\&FilePath\=test.csv --encoding=utf-8-sig +🚧 start to build dataset bundle... +👷 uri local/project/self/dataset/product-desc-modelscope/version/latest +🌊 creating dataset local/project/self/dataset/product-desc-modelscope/version/wzaz4ccodpyj4jelgupljreyida2bleg5xp7viwe... +🦋 update 3848 records into dataset +🌺 congratulation! dataset build from csv files(('https://modelscope.cn/api/v1/datasets/lcl193798/product_description_generation/repo?Revision=master&FilePath=test.csv',)) has been built. You can run swcli dataset info product-desc-modelscope/version/wzaz4ccodpyj +``` + +Python SDK方式构建: + +```python +from starwhale import Dataset +ds = Dataset.from_csv(path="http://example.com/data.csv", name="my-csv-dataset") +``` + +### json/jsonl 文件 + +支持命令行或Python SDK方式将本地或远端的json/jsonl文件直接转化成 Starwhale 数据集: + +- 支持一个或多个本地json/jsonl文件 +- 支持对本地目录递归寻找json/jsonl文件 +- 支持一个或多个以http url方式指定的远端json/jsonl文件 + +对于json文件: + +- 默认认为json解析后的对象是list,list中的每个对象是dict,会映射为Starwhale 数据集中的一行。 +- 可以通过 `--field-selector` 或 `field_selector` 参数定位具体的某个list。 + +比如json文件如下: + +```json +{ + "p1": { + "p2":{ + "p3": [ + {"a": 1, "b": 2}, + {"a": 10, "b": 20}, + ] + } + } +} +``` + +那么可以设置 `--field-selector=p1.p2.p3` ,准确添加两行数据到数据集中。 + +命令方式构建: + +```console +❯ swcli dataset build --json https://modelscope.cn/api/v1/datasets/damo/100PoisonMpts/repo\?Revision\=master\&FilePath\=train.jsonl +🚧 start to build dataset bundle... +👷 uri local/project/self/dataset/json-b0o2zsvg/version/latest +🌊 creating dataset local/project/self/dataset/json-b0o2zsvg/version/q3uoziwqligxdggncqywpund75jz55h3bne6a5la... +🦋 update 906 records into dataset +🌺 congratulation! dataset build from ('https://modelscope.cn/api/v1/datasets/damo/100PoisonMpts/repo?Revision=master&FilePath=train.jsonl',) has been built. You can run swcli dataset info json-b0o2zsvg/version/q3uoziwqligx +``` + +Python SDK方式构建: + +```python +from starwhale import Dataset +myds = Dataset.from_json( + name="translation", + text='{"content":{"child_content":[{"en":"hello","zh-cn":"你好"},{"en":"how are you","zh-cn":"最近怎么样"}]}}', + field_selector="content.child_content" +) +print(myds[0].features["zh-cn"]) +``` + +```console +🌊 creating dataset local/project/self/dataset/translation/version/kblfn5zh4cpoqxqbhgdfbvonulr2zefp6lojq44y... +🦋 update 2 records into dataset +你好 +``` + +## 从Huggingface Datasets Hub中构建 + +Huggingface Hub 上有大量的数据集,可以通过一行代码或一条命令就能转化为 Starwhale 数据集。 + +:::tip +Huggingface Datasets 转化需要依赖 [datasets](https://pypi.org/project/datasets/) 库。 +::: + +命令行方式: + +```console +swcli dataset build -hf lambdalabs/pokemon-blip-captions --name pokemon +``` + +Python SDK方式: + +```python +from starwhale import Dataset + +# You only specify starwhale dataset expected name and huggingface repo name +# example: https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions +ds = Dataset.from_huggingface("pokemon", "lambdalabs/pokemon-blip-captions") +print(ds) +print(len(ds)) +print(repr(ds.fetch_one())) +``` + +## 使用 Starwhale SDK 编写 Python Script 方式构建 + +## 使用 swcli dataset build + Python Handler 方式构建 diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/integration.md b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/integration.md new file mode 100644 index 000000000..e0a97604b --- /dev/null +++ b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/integration.md @@ -0,0 +1,178 @@ +--- +title: 数据集与其他ML库的集成 +--- + +Starwhale 数据集可以 Pillow, Numpy, Huggingface Datasets, Pytorch 和 Tensorflow 等流行的ML库进行良好的集成,方便进行数据转化。 + +## Pillow + +[Starwhale Image](../reference/sdk/type#image) 类型与 [Pillow Image](https://pillow.readthedocs.io/en/stable/reference/Image.html) 对象进行双向转化。 + +### 使用 Pillow Image 初始化 Starwhale Image + +```python +from starwhale import dataset + +# login cloud instance in advance: `swcli instance login` command or `starwhale.login` sdk +# raw dataset url: https://cloud.starwhale.cn/projects/397/datasets/172/versions/236/files +ds = dataset("https://cloud.starwhale.cn/project/starwhale:object-detection/dataset/coco128/v2") +img = ds.head(n=1)[0].features.image + +pil = img.to_pil() +print(pil) +print(pil.size) +``` + +```console + +(640, 480) +``` + +### 将 Starwhale Image 转化为 Pillow Image + +```python +import numpy +from PIL import Image as PILImage +from starwhale import Image + +# generate a random image +random_array = numpy.random.randint(low=0, high=256, size=(100, 100, 3), dtype=numpy.uint8) +pil = PILImage.fromarray(random_array, mode="RGB") + +img = Image(pil) +print(img) +``` + +```console +ArtifactType.Image, display:, mime_type:MIMEType.UNDEFINED, shape:[None, None, 3], encoding: +``` + +## Numpy + +### 转化为 numpy.ndarray + +Starwhale 的以下数据类型可以转化为 numpy.ndarray 对象: + +* Image:先转化为Pillow Image类型,然后再转化为 numpy.ndarray 对象。 +* Video:将 video bytes 直接转化 numpy.ndarray 对象。 +* Audio:调用 soundfile 库将 audio bytes 转化为 numpy.ndarray 对象。 +* BoundingBox:转化为 xywh 格式的 numpy.ndarray 对象。 +* Binary:将 bytes 直接转化 numpy.ndarray 对象。 + +```python +from starwhale import dataset + +# login cloud instance in advance: `swcli instance login` command or `starwhale.login` sdk +# raw dataset url: https://cloud.starwhale.cn/projects/397/datasets/172/versions/236/files +ds = dataset("https://cloud.starwhale.cn/project/starwhale:object-detection/dataset/coco128/v2") + +item = ds.head(n=1)[0] + +img = item.features.image +img_array = img.to_numpy() +print(img_array) +print(img_array.shape) + +bbox = item.features.annotations[0]["bbox"] +print(bbox) +print(bbox.to_numpy()) +``` + +```console + +(480, 640, 3) +BoundingBox[XYWH]- x:1.0799999999999699, y:187.69008, width:611.5897600000001, height:285.84000000000003 +array([ 1.08 , 187.69008, 611.58976, 285.84 ]) +``` + +### 使用 numpy.ndarray 初始化 Starwhale Image + +当一个图片表示为 numpy.ndarray 对象时,可以用来初始化为 Starwhale Image 对象。 + +```python +import numpy +from starwhale import Image + +# generate a random image numpy.ndarray +random_array = numpy.random.randint(low=0, high=256, size=(100, 100, 3), dtype=numpy.uint8) +img = Image(random_array) +print(img) +``` + +```console +ArtifactType.Image, display:, mime_type:MIMEType.UNDEFINED, shape:[None, None, 3], encoding: +``` + +## Huggingface Datasets + +Huggingface Hub 上有大量的数据集,可以通过一行代码就能转化为 Starwhale 数据集。 + +:::tip +Huggingface Datasets 转化需要依赖 [datasets](https://pypi.org/project/datasets/) 库。 +::: + +```python +from starwhale import Dataset + +# You only specify starwhale dataset expected name and huggingface repo name +# example: https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions +ds = Dataset.from_huggingface("pokemon", "lambdalabs/pokemon-blip-captions") +print(ds) +print(len(ds)) +print(repr(ds.fetch_one())) +``` + +```console +🌊 creating dataset local/project/self/dataset/pokemon/version/r2m3is6ormwcio4gtayop25qk4gmfr6mcei6hise... +🦋 update 833 records into dataset +Dataset: pokemon, stash version: r2m3is6ormwcio4gtayop25qk4gmfr6mcei6hise, loading version: r2m3is6ormwcio4gtayop25qk4gmfr6mcei6hise +833 +index:default/train/0, features:{'image': ArtifactType.Image, display:, mime_type:MIMEType.JPEG, shape:[1280, 1280, 3], encoding: , 'text': 'a drawing of a green pokemon with red eyes', '_hf_subset': 'default', '_hf_split': 'train'}, shadow dataset: None +``` + +## Pytorch + +Starwhale Dataset 可以转化为 Pytorch 的 [torch.utils.dataset.IterableDataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset) 对象,并接受 transform 变换。转化后的 Pytorch dataset 对象就可以传递给 Pytorch dataloader 或 Huggingface Trainer 等。 + +```python +from starwhale import dataset +import torch.utils.data as tdata + +def custom_transform(data): + data["label"] = data["label"] + 100 + return data + +with dataset("simple", create="empty") as ds: + for i in range(0, 10): + ds[i] = {"text": f"{i}-text", "label": i} + ds.commit() + + torch_ds = ds.to_pytorch(transform=custom_transform) + torch_loader = tdata.DataLoader(torch_ds, batch_size=1) + item = next(iter(torch_loader)) + print(item) + print(item["label"]) +``` + +```console +{'text': ['0-text'], 'label': tensor([100])} +tensor([100]) +``` + +## Tensorflow + +Starwhale Dataset 可以转化为 Tensorflow 的 [tensorflow.data.Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) 对象,同时也支持 transform 函数,可以对数据进行变化。 + +```python +from starwhale import dataset + +# login cloud instance in advance: `swcli instance login` command or `starwhale.login` sdk +# raw dataset url: https://cloud.starwhale.cn/projects/397/datasets/172/versions/236/files +ds = dataset("https://cloud.starwhale.cn/project/starwhale:helloworld/dataset/mnist64") +tf_ds = ds.to_tensorflow() +print(tf_ds) +``` + +```console +<_FlatMapDataset element_spec={'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'img': TensorSpec(shape=(8, 8, 1), dtype=tf.uint8, name=None)}> +``` diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/load.md b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/load.md new file mode 100644 index 000000000..bcbc362c6 --- /dev/null +++ b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/load.md @@ -0,0 +1,3 @@ +--- +title: 数据集加载 +--- \ No newline at end of file diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/version.md b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/version.md new file mode 100644 index 000000000..b76f9028e --- /dev/null +++ b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/version.md @@ -0,0 +1,3 @@ +--- +title: 数据集版本控制 +--- \ No newline at end of file diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/view.md b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/view.md new file mode 100644 index 000000000..de090010a --- /dev/null +++ b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/view.md @@ -0,0 +1,3 @@ +--- +title: 数据集可视化 +--- \ No newline at end of file diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/reference/sdk/type.md b/i18n/zh/docusaurus-plugin-content-docs/current/reference/sdk/type.md index 3291e9598..7506bfe4e 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/reference/sdk/type.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/reference/sdk/type.md @@ -158,7 +158,7 @@ ClassLabel( ) ``` -## Image +## Image {#image} 图片类型。 @@ -175,7 +175,7 @@ Image( |参数|说明| |---|---| -|`fp`|图片的路径、IO对象或文件内容的bytes| +|`fp`|图片的路径、IO对象、numpy对象、pillow image对象或文件内容的bytes| |`display_name`|Dataset Viewer上展示的名字| |`shape`|图片的Width、Height和channel| |`mime_type`|MIMEType支持的类型| diff --git a/sidebars.js b/sidebars.js index 1fa49d9eb..32d1ca1f9 100644 --- a/sidebars.js +++ b/sidebars.js @@ -157,7 +157,12 @@ module.exports = { }, collapsed: true, items: [ - "dataset/yaml" + "dataset/yaml", + "dataset/build", + "dataset/load", + "dataset/view", + "dataset/version", + "dataset/integration" ] }, { @@ -219,7 +224,6 @@ module.exports = { "reference/sdk/evaluation", "reference/sdk/model", "reference/sdk/job", - "reference/swcli/server", "reference/sdk/other", ] }