-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
873065a
commit a5a6874
Showing
12 changed files
with
401 additions
and
4 deletions.
There are no files selected for viewing
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
206 changes: 206 additions & 0 deletions
206
i18n/zh/docusaurus-plugin-content-docs/current/dataset/build.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,206 @@ | ||
--- | ||
title: 数据集构建 | ||
--- | ||
|
||
Starwhale 数据集构建方式非常灵活,可以从一些图片/音频/视频/csv/json/jsonl文件构建,也可以写一些Python脚本构建,还可以从Huggingface Hub 导入数据集。 | ||
|
||
## 从数据文件直接构建 | ||
|
||
### 图片 | ||
|
||
支持递归遍历目录中的图片文件,构建Starwhale 数据集,不需要写任何代码: | ||
|
||
- 支持的图片文件格式: `png/jpg/jpeg/webp/svg/apng` | ||
- 图片会转成 Starwhale.Image 类型,并可以在 Starwhale Server Web页面中查看。 | ||
- 支持命令行 `swcli dataset build --image` 和 Python SDK `starwhale.Dataset.from_folder` 两种方式。 | ||
- label机制:当 SDK 设置 `auto_label=True` 或 命令行设置 `--auto-label` 时,会将父目录的名字作为 `label`。 | ||
- metadata机制:可以通过在根目录设置 `metadata.csv` 或 `metadata.jsonl` 文件来扩展数据集的列。 | ||
- caption机制:当在同目录下发现 `{image-name}.txt` 文件时,文件中的内容会被自动导入,填充到 `caption` 列中。 | ||
|
||
假设在 folder 目录中有下面四个文件: | ||
|
||
```console | ||
folder/dog/1.png | ||
folder/cat/2.png | ||
folder/dog/3.png | ||
folder/cat/4.png | ||
``` | ||
|
||
命令方式构建方法: | ||
|
||
```console | ||
❯ swcli dataset build --image folder --name image-folder | ||
🚧 start to build dataset bundle... | ||
👷 uri local/project/self/dataset/image-folder/version/latest | ||
🌊 creating dataset local/project/self/dataset/image-folder/version/uw6mdisnf7alg4t4fs2myfug4ie4636w3x4jqcu2... | ||
🦋 update 4 records into dataset | ||
🌺 congratulation! you can run swcli dataset info image-folder/version/uw6mdisnf7al | ||
``` | ||
|
||
```console | ||
❯ swcli dataset head image-folder -n 2 | ||
row ─────────────────────────────────────── | ||
🌳 id: cat/2.png | ||
🌀 features: | ||
🔅 file_name : cat/2.png | ||
🔅 label : cat | ||
🔅 file : ArtifactType.Image, display:2.png, mime_type:MIMEType.PNG, shape:[None, None, 3], encoding: | ||
row ─────────────────────────────────────── | ||
🌳 id: cat/4.png | ||
🌀 features: | ||
🔅 file_name : cat/4.png | ||
🔅 label : cat | ||
🔅 file : ArtifactType.Image, display:4.png, mime_type:MIMEType.PNG, shape:[None, None, 3], encoding: | ||
``` | ||
|
||
|
||
Python SDK方式构建: | ||
|
||
```python | ||
from starwhale import Dataset | ||
ds = Dataset.from_folder("folder", kind="image") | ||
print(ds) | ||
print(ds.fetch_one().features) | ||
``` | ||
|
||
```console | ||
🌊 creating dataset local/project/self/dataset/folder/version/nyc2ay4gnyayv4zqalovdgakl3k2esvrne42cjna... | ||
🦋 update 4 records into dataset | ||
Dataset: folder, stash version: d22hdiwyakdfh5xitcpn2s32gblfbhrerzczkb63, loading version: nyc2ay4gnyayv4zqalovdgakl3k2esvrne42cjna | ||
{'file_name': 'cat/2.png', 'label': 'cat', 'file': ArtifactType.Image, display:2.png, mime_type:MIMEType.PNG, shape:[None, None, 3], encoding: } | ||
``` | ||
|
||
|
||
### 视频 | ||
|
||
支持递归遍历目录中的视频文件,构建Starwhale 数据集,不需要写任何代码: | ||
|
||
- 支持的视频文件格式:`mp4/webm/avi` | ||
- 视频会被转成 Starwhale.Video 类型,并可以在 Starwhale Server Web页面中查看。 | ||
- 支持命令行 `swcli dataset build --video` 和 Python SDK `starwhale.Dataset.from_folder` 两种方式。 | ||
- label, caption 和 metadata 机制与图片方式相同。 | ||
|
||
### 音频 | ||
|
||
支持递归遍历目录中的音频文件,构建Starwhale 数据集,不需要写任何代码: | ||
|
||
- 支持的音频文件格式:`mp3/wav` | ||
- 音频会被转成 Starwhale.Audio 类型,并可以在 Starwhale Server Web页面中查看。 | ||
- 支持命令行 `swcli dataset build --video` 和 Python SDK `starwhale.Dataset.from_folder` 两种方式。 | ||
- label, caption 和 metadata 机制与图片方式相同。 | ||
|
||
### csv 文件 | ||
|
||
支持命令行或Python SDK方式将本地或远端的csv文件直接转化成 Starwhale 数据集: | ||
|
||
- 支持一个或多个本地csv文件 | ||
- 支持对本地目录递归寻找csv文件 | ||
- 支持一个或多个以http url方式指定的远端csv文件 | ||
|
||
命令行方式构建: | ||
|
||
```console | ||
❯ swcli dataset build --name product-desc-modelscope --csv https://modelscope.cn/api/v1/datasets/lcl193798/product_description_generation/repo\?Revision\=master\&FilePath\=test.csv --encoding=utf-8-sig | ||
🚧 start to build dataset bundle... | ||
👷 uri local/project/self/dataset/product-desc-modelscope/version/latest | ||
🌊 creating dataset local/project/self/dataset/product-desc-modelscope/version/wzaz4ccodpyj4jelgupljreyida2bleg5xp7viwe... | ||
🦋 update 3848 records into dataset | ||
🌺 congratulation! dataset build from csv files(('https://modelscope.cn/api/v1/datasets/lcl193798/product_description_generation/repo?Revision=master&FilePath=test.csv',)) has been built. You can run swcli dataset info product-desc-modelscope/version/wzaz4ccodpyj | ||
``` | ||
|
||
Python SDK方式构建: | ||
|
||
```python | ||
from starwhale import Dataset | ||
ds = Dataset.from_csv(path="http://example.com/data.csv", name="my-csv-dataset") | ||
``` | ||
|
||
### json/jsonl 文件 | ||
|
||
支持命令行或Python SDK方式将本地或远端的json/jsonl文件直接转化成 Starwhale 数据集: | ||
|
||
- 支持一个或多个本地json/jsonl文件 | ||
- 支持对本地目录递归寻找json/jsonl文件 | ||
- 支持一个或多个以http url方式指定的远端json/jsonl文件 | ||
|
||
对于json文件: | ||
|
||
- 默认认为json解析后的对象是list,list中的每个对象是dict,会映射为Starwhale 数据集中的一行。 | ||
- 可以通过 `--field-selector` 或 `field_selector` 参数定位具体的某个list。 | ||
|
||
比如json文件如下: | ||
|
||
```json | ||
{ | ||
"p1": { | ||
"p2":{ | ||
"p3": [ | ||
{"a": 1, "b": 2}, | ||
{"a": 10, "b": 20}, | ||
] | ||
} | ||
} | ||
} | ||
``` | ||
|
||
那么可以设置 `--field-selector=p1.p2.p3` ,准确添加两行数据到数据集中。 | ||
|
||
命令方式构建: | ||
|
||
```console | ||
❯ swcli dataset build --json https://modelscope.cn/api/v1/datasets/damo/100PoisonMpts/repo\?Revision\=master\&FilePath\=train.jsonl | ||
🚧 start to build dataset bundle... | ||
👷 uri local/project/self/dataset/json-b0o2zsvg/version/latest | ||
🌊 creating dataset local/project/self/dataset/json-b0o2zsvg/version/q3uoziwqligxdggncqywpund75jz55h3bne6a5la... | ||
🦋 update 906 records into dataset | ||
🌺 congratulation! dataset build from ('https://modelscope.cn/api/v1/datasets/damo/100PoisonMpts/repo?Revision=master&FilePath=train.jsonl',) has been built. You can run swcli dataset info json-b0o2zsvg/version/q3uoziwqligx | ||
``` | ||
|
||
Python SDK方式构建: | ||
|
||
```python | ||
from starwhale import Dataset | ||
myds = Dataset.from_json( | ||
name="translation", | ||
text='{"content":{"child_content":[{"en":"hello","zh-cn":"你好"},{"en":"how are you","zh-cn":"最近怎么样"}]}}', | ||
field_selector="content.child_content" | ||
) | ||
print(myds[0].features["zh-cn"]) | ||
``` | ||
|
||
```console | ||
🌊 creating dataset local/project/self/dataset/translation/version/kblfn5zh4cpoqxqbhgdfbvonulr2zefp6lojq44y... | ||
🦋 update 2 records into dataset | ||
你好 | ||
``` | ||
|
||
## 从Huggingface Datasets Hub中构建 | ||
|
||
Huggingface Hub 上有大量的数据集,可以通过一行代码或一条命令就能转化为 Starwhale 数据集。 | ||
|
||
:::tip | ||
Huggingface Datasets 转化需要依赖 [datasets](https://pypi.org/project/datasets/) 库。 | ||
::: | ||
|
||
命令行方式: | ||
|
||
```console | ||
swcli dataset build -hf lambdalabs/pokemon-blip-captions --name pokemon | ||
``` | ||
|
||
Python SDK方式: | ||
|
||
```python | ||
from starwhale import Dataset | ||
|
||
# You only specify starwhale dataset expected name and huggingface repo name | ||
# example: https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions | ||
ds = Dataset.from_huggingface("pokemon", "lambdalabs/pokemon-blip-captions") | ||
print(ds) | ||
print(len(ds)) | ||
print(repr(ds.fetch_one())) | ||
``` | ||
|
||
## 使用 Starwhale SDK 编写 Python Script 方式构建 | ||
|
||
## 使用 swcli dataset build + Python Handler 方式构建 |
178 changes: 178 additions & 0 deletions
178
i18n/zh/docusaurus-plugin-content-docs/current/dataset/integration.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,178 @@ | ||
--- | ||
title: 数据集与其他ML库的集成 | ||
--- | ||
|
||
Starwhale 数据集可以 Pillow, Numpy, Huggingface Datasets, Pytorch 和 Tensorflow 等流行的ML库进行良好的集成,方便进行数据转化。 | ||
|
||
## Pillow | ||
|
||
[Starwhale Image](../reference/sdk/type#image) 类型与 [Pillow Image](https://pillow.readthedocs.io/en/stable/reference/Image.html) 对象进行双向转化。 | ||
|
||
### 使用 Pillow Image 初始化 Starwhale Image | ||
|
||
```python | ||
from starwhale import dataset | ||
|
||
# login cloud instance in advance: `swcli instance login` command or `starwhale.login` sdk | ||
# raw dataset url: https://cloud.starwhale.cn/projects/397/datasets/172/versions/236/files | ||
ds = dataset("https://cloud.starwhale.cn/project/starwhale:object-detection/dataset/coco128/v2") | ||
img = ds.head(n=1)[0].features.image | ||
|
||
pil = img.to_pil() | ||
print(pil) | ||
print(pil.size) | ||
``` | ||
|
||
```console | ||
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x480 at 0x7F77FBA98250> | ||
(640, 480) | ||
``` | ||
|
||
### 将 Starwhale Image 转化为 Pillow Image | ||
|
||
```python | ||
import numpy | ||
from PIL import Image as PILImage | ||
from starwhale import Image | ||
|
||
# generate a random image | ||
random_array = numpy.random.randint(low=0, high=256, size=(100, 100, 3), dtype=numpy.uint8) | ||
pil = PILImage.fromarray(random_array, mode="RGB") | ||
|
||
img = Image(pil) | ||
print(img) | ||
``` | ||
|
||
```console | ||
ArtifactType.Image, display:, mime_type:MIMEType.UNDEFINED, shape:[None, None, 3], encoding: | ||
``` | ||
|
||
## Numpy | ||
|
||
### 转化为 numpy.ndarray | ||
|
||
Starwhale 的以下数据类型可以转化为 numpy.ndarray 对象: | ||
|
||
* Image:先转化为Pillow Image类型,然后再转化为 numpy.ndarray 对象。 | ||
* Video:将 video bytes 直接转化 numpy.ndarray 对象。 | ||
* Audio:调用 soundfile 库将 audio bytes 转化为 numpy.ndarray 对象。 | ||
* BoundingBox:转化为 xywh 格式的 numpy.ndarray 对象。 | ||
* Binary:将 bytes 直接转化 numpy.ndarray 对象。 | ||
|
||
```python | ||
from starwhale import dataset | ||
|
||
# login cloud instance in advance: `swcli instance login` command or `starwhale.login` sdk | ||
# raw dataset url: https://cloud.starwhale.cn/projects/397/datasets/172/versions/236/files | ||
ds = dataset("https://cloud.starwhale.cn/project/starwhale:object-detection/dataset/coco128/v2") | ||
|
||
item = ds.head(n=1)[0] | ||
|
||
img = item.features.image | ||
img_array = img.to_numpy() | ||
print(img_array) | ||
print(img_array.shape) | ||
|
||
bbox = item.features.annotations[0]["bbox"] | ||
print(bbox) | ||
print(bbox.to_numpy()) | ||
``` | ||
|
||
```console | ||
<class 'numpy.ndarray'> | ||
(480, 640, 3) | ||
BoundingBox[XYWH]- x:1.0799999999999699, y:187.69008, width:611.5897600000001, height:285.84000000000003 | ||
array([ 1.08 , 187.69008, 611.58976, 285.84 ]) | ||
``` | ||
|
||
### 使用 numpy.ndarray 初始化 Starwhale Image | ||
|
||
当一个图片表示为 numpy.ndarray 对象时,可以用来初始化为 Starwhale Image 对象。 | ||
|
||
```python | ||
import numpy | ||
from starwhale import Image | ||
|
||
# generate a random image numpy.ndarray | ||
random_array = numpy.random.randint(low=0, high=256, size=(100, 100, 3), dtype=numpy.uint8) | ||
img = Image(random_array) | ||
print(img) | ||
``` | ||
|
||
```console | ||
ArtifactType.Image, display:, mime_type:MIMEType.UNDEFINED, shape:[None, None, 3], encoding: | ||
``` | ||
|
||
## Huggingface Datasets | ||
|
||
Huggingface Hub 上有大量的数据集,可以通过一行代码就能转化为 Starwhale 数据集。 | ||
|
||
:::tip | ||
Huggingface Datasets 转化需要依赖 [datasets](https://pypi.org/project/datasets/) 库。 | ||
::: | ||
|
||
```python | ||
from starwhale import Dataset | ||
|
||
# You only specify starwhale dataset expected name and huggingface repo name | ||
# example: https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions | ||
ds = Dataset.from_huggingface("pokemon", "lambdalabs/pokemon-blip-captions") | ||
print(ds) | ||
print(len(ds)) | ||
print(repr(ds.fetch_one())) | ||
``` | ||
|
||
```console | ||
🌊 creating dataset local/project/self/dataset/pokemon/version/r2m3is6ormwcio4gtayop25qk4gmfr6mcei6hise... | ||
🦋 update 833 records into dataset | ||
Dataset: pokemon, stash version: r2m3is6ormwcio4gtayop25qk4gmfr6mcei6hise, loading version: r2m3is6ormwcio4gtayop25qk4gmfr6mcei6hise | ||
833 | ||
index:default/train/0, features:{'image': ArtifactType.Image, display:, mime_type:MIMEType.JPEG, shape:[1280, 1280, 3], encoding: , 'text': 'a drawing of a green pokemon with red eyes', '_hf_subset': 'default', '_hf_split': 'train'}, shadow dataset: None | ||
``` | ||
|
||
## Pytorch | ||
|
||
Starwhale Dataset 可以转化为 Pytorch 的 [torch.utils.dataset.IterableDataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset) 对象,并接受 transform 变换。转化后的 Pytorch dataset 对象就可以传递给 Pytorch dataloader 或 Huggingface Trainer 等。 | ||
|
||
```python | ||
from starwhale import dataset | ||
import torch.utils.data as tdata | ||
|
||
def custom_transform(data): | ||
data["label"] = data["label"] + 100 | ||
return data | ||
|
||
with dataset("simple", create="empty") as ds: | ||
for i in range(0, 10): | ||
ds[i] = {"text": f"{i}-text", "label": i} | ||
ds.commit() | ||
|
||
torch_ds = ds.to_pytorch(transform=custom_transform) | ||
torch_loader = tdata.DataLoader(torch_ds, batch_size=1) | ||
item = next(iter(torch_loader)) | ||
print(item) | ||
print(item["label"]) | ||
``` | ||
|
||
```console | ||
{'text': ['0-text'], 'label': tensor([100])} | ||
tensor([100]) | ||
``` | ||
|
||
## Tensorflow | ||
|
||
Starwhale Dataset 可以转化为 Tensorflow 的 [tensorflow.data.Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) 对象,同时也支持 transform 函数,可以对数据进行变化。 | ||
|
||
```python | ||
from starwhale import dataset | ||
|
||
# login cloud instance in advance: `swcli instance login` command or `starwhale.login` sdk | ||
# raw dataset url: https://cloud.starwhale.cn/projects/397/datasets/172/versions/236/files | ||
ds = dataset("https://cloud.starwhale.cn/project/starwhale:helloworld/dataset/mnist64") | ||
tf_ds = ds.to_tensorflow() | ||
print(tf_ds) | ||
``` | ||
|
||
```console | ||
<_FlatMapDataset element_spec={'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'img': TensorSpec(shape=(8, 8, 1), dtype=tf.uint8, name=None)}> | ||
``` |
3 changes: 3 additions & 0 deletions
3
i18n/zh/docusaurus-plugin-content-docs/current/dataset/load.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
--- | ||
title: 数据集加载 | ||
--- |
3 changes: 3 additions & 0 deletions
3
i18n/zh/docusaurus-plugin-content-docs/current/dataset/version.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
--- | ||
title: 数据集版本控制 | ||
--- |
Oops, something went wrong.