Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add image documentation #238

Merged
merged 62 commits into from
Oct 24, 2024
Merged
Show file tree
Hide file tree
Changes from 56 commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
2d6ad87
Add partial image implementation
ryantwolf Aug 19, 2024
4b32c1e
Refactor requirements
ryantwolf Aug 19, 2024
601bf5c
Fix bugs
ryantwolf Aug 19, 2024
6c8ecd6
Change from_map to map_partitions
ryantwolf Aug 19, 2024
0856a65
Add super constructor
ryantwolf Aug 19, 2024
4dbef42
Add kwargs for load_object_on_worker
ryantwolf Aug 19, 2024
abb6b13
Get proper epoch size
ryantwolf Aug 19, 2024
61eb1e3
Complete embedding creation loop
ryantwolf Aug 19, 2024
f8d692f
Change devices
ryantwolf Aug 19, 2024
5752562
Add device
ryantwolf Aug 19, 2024
4a4b356
Refactor embedding creation and add classifier
ryantwolf Aug 27, 2024
bfde960
Fix bugs in classifiers
ryantwolf Aug 27, 2024
e421a36
Refactor model names
ryantwolf Aug 27, 2024
b09892e
Add model name
ryantwolf Aug 27, 2024
8d43f9a
Fix classifier bugs
ryantwolf Aug 27, 2024
49a21ef
Allow postprocessing for classifiers
ryantwolf Aug 27, 2024
edf2905
Fix name and add print
ryantwolf Aug 27, 2024
eaef49a
Fix variable name
ryantwolf Aug 27, 2024
7ba7c34
Add NSFW
ryantwolf Aug 28, 2024
c1c1b1a
Update init for import
ryantwolf Aug 28, 2024
88032c8
Fix embedding size
ryantwolf Aug 28, 2024
b4c5cd5
Add fused classifiers
ryantwolf Aug 28, 2024
8d93913
Fix missing index
ryantwolf Aug 28, 2024
873b410
Update metdata for fused classifiers
ryantwolf Aug 28, 2024
c73e292
Add export to webdataset
ryantwolf Sep 4, 2024
361e0d1
Fix missing id col
ryantwolf Sep 4, 2024
ce91626
Sort embeddings by id
ryantwolf Sep 4, 2024
d338943
Add timm
ryantwolf Sep 5, 2024
fc5fefb
Update init file
ryantwolf Sep 5, 2024
29eb2ba
Add autocast to timm
ryantwolf Sep 5, 2024
09ed9d6
Update requirements and transform
ryantwolf Sep 5, 2024
2a6b510
Add additional interpolation support
ryantwolf Sep 5, 2024
b6bda19
Fix transform normalization
ryantwolf Sep 5, 2024
d57462e
Remove open_clip
ryantwolf Sep 5, 2024
bacd1c0
Add index path support to wds
ryantwolf Sep 6, 2024
35ae97c
Merge branch 'main' into rywolf/images
ryantwolf Sep 6, 2024
8e66c8f
Address Vibhu's feedback
ryantwolf Sep 6, 2024
946053e
Add import guard for image dataset
ryantwolf Sep 6, 2024
015d40c
Change default device
ryantwolf Sep 7, 2024
852863d
Remove commented code
ryantwolf Sep 7, 2024
e7e320f
Remove device id
ryantwolf Sep 8, 2024
92f47a0
Fix index issue
ryantwolf Sep 9, 2024
37ee892
Merge branch 'main' into rywolf/images
ryantwolf Sep 9, 2024
0eca48f
Add docstrings and standardize variable names
ryantwolf Sep 9, 2024
1e95d91
Merge branch 'main' into rywolf/image-docs
ryantwolf Sep 10, 2024
59763a1
Add image curation tutorial
ryantwolf Sep 10, 2024
40e1549
Add initial image docs
ryantwolf Sep 16, 2024
0d857b4
Remove tutorial
ryantwolf Sep 17, 2024
b4e474a
Add dataset docs
ryantwolf Sep 19, 2024
91a083d
Merge branch 'main' into rywolf/image-docs
ryantwolf Sep 19, 2024
4b1f008
Add embedder documentation
ryantwolf Sep 19, 2024
e350ab0
Revert embedding column name change
ryantwolf Oct 15, 2024
31dbfde
Update user guide for images
ryantwolf Oct 18, 2024
30e004a
Update README
ryantwolf Oct 18, 2024
9c81b6e
Update README with RAPIDS nightly instructions
ryantwolf Oct 18, 2024
f9f47ed
Fix formatting issues in image documentation
ryantwolf Oct 18, 2024
c090ab6
Remove extra newline in README
ryantwolf Oct 18, 2024
e91e70d
Address most of Sarah's feedback
ryantwolf Oct 22, 2024
9746a66
Add section summary
ryantwolf Oct 22, 2024
2a069b6
Fix errors and REWORD GPU bullets in README
ryantwolf Oct 24, 2024
26479ce
Merge branch 'main' into rywolf/image-docs
ryantwolf Oct 24, 2024
34318e7
Fix how table of contents displays with new sections
ryantwolf Oct 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
147 changes: 60 additions & 87 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,51 +9,42 @@
</div>

# NeMo Curator
🚀 **The GPU-Accelerated Open Source Framework for Efficient Large Language Model Data Curation** 🚀
🚀 **The GPU-Accelerated Open Source Framework for Efficient Generative AI Model Data Curation** 🚀

<p align="center">
<img src="./docs/user-guide/images/diagram.png" alt="diagram"/>
</p>

NeMo Curator is a Python library specifically designed for fast and scalable dataset preparation and curation for [large language model (LLM)](https://www.nvidia.com/en-us/glossary/large-language-models/) use-cases such as foundation model pretraining, domain-adaptive pretraining (DAPT), supervised fine-tuning (SFT) and paramter-efficient fine-tuning (PEFT). It greatly accelerates data curation by leveraging GPUs with [Dask](https://www.dask.org/) and [RAPIDS](https://developer.nvidia.com/rapids), resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline expansion and accelerating model convergence through the preparation of high-quality tokens.

At the core of the NeMo Curator is the `DocumentDataset` which serves as the the main dataset class. It acts as a straightforward wrapper around a Dask `DataFrame`. The Python library offers easy-to-use methods for expanding the functionality of your curation pipeline while eliminating scalability concerns.
NeMo Curator is a Python library specifically designed for fast and scalable dataset preparation and curation for generative AI use-cases such as foundation language model pretraining, text to image model training, domain-adaptive pretraining (DAPT), supervised fine-tuning (SFT) and paramter-efficient fine-tuning (PEFT). It greatly accelerates data curation by leveraging GPUs with [Dask](https://www.dask.org/) and [RAPIDS](https://developer.nvidia.com/rapids), resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline expansion and accelerating model convergence through the preparation of high-quality tokens.

## Key Features

NeMo Curator provides a collection of scalable data-mining modules. Some of the key features include:

- [Data download and text extraction](docs/user-guide/download.rst)
NeMo Curator provides a collection of scalable data curation modules for text and image curation.

- Default implementations for downloading and extracting Common Crawl, Wikipedia, and ArXiv data
- Easily customize the download and extraction and extend to other datasets
### Text
All of our text pipelines have great multilingual support.

- [Language identification and separation](docs/user-guide/languageidentificationunicodeformatting.rst) with [fastText](https://fasttext.cc/docs/en/language-identification.html) and [pycld2](https://pypi.org/project/pycld2/)
- [Download and Extraction](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/download.html)
- Common Crawl, Wikipedia, and ArXiv sources
- Easily customize and extend to other sources

- [Text reformatting and cleaning](docs/user-guide/languageidentificationunicodeformatting.rst) to fix unicode decoding errors via [ftfy](https://ftfy.readthedocs.io/en/latest/)

- [Quality filtering](docs/user-guide/qualityfiltering.rst)

- Multilingual heuristic-based filtering
- [Language Identification](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/languageidentificationunicodeformatting.html)
- [Unicode Fixing](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/languageidentificationunicodeformatting.html)
- [Heuristic Filtering](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/qualityfiltering.html)
- Classifier-based filtering via [fastText](https://fasttext.cc/)

- [Document-level deduplication](docs/user-guide/gpudeduplication.rst)

- exact and fuzzy (near-identical) deduplication are accelerated using cuDF and Dask
- For fuzzy deduplication, our implementation follows the method described in [Microsoft Turing NLG 530B](https://arxiv.org/abs/2201.11990)
- For semantic deduplication, our implementation follows the method described in [SemDeDup](https://arxiv.org/pdf/2303.09540) by Meta AI (FAIR) [facebookresearch/SemDeDup](https://github.com/facebookresearch/SemDeDup)

- [Multilingual downstream-task decontamination](docs/user-guide/taskdecontamination.rst) following the approach of [OpenAI GPT3](https://arxiv.org/pdf/2005.14165.pdf) and [Microsoft Turing NLG 530B](https://arxiv.org/abs/2201.11990)

- [Distributed data classification](docs/user-guide/distributeddataclassification.rst)

- Multi-node, multi-GPU classifier inference
- Provides sophisticated domain and quality classification
- Flexible interface for extending to your own classifier network

- [Personal identifiable information (PII) redaction](docs/user-guide/personalidentifiableinformationidentificationandremoval.rst) for removing addresses, credit card numbers, social security numbers, and more

These modules offer flexibility and permit reordering, with only a few exceptions. In addition, the [NeMo Framework Launcher](https://github.com/NVIDIA/NeMo-Megatron-Launcher) provides pre-built pipelines that can serve as a foundation for your customization use cases.
- Classifier Filtering
- [fastText]((https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/qualityfiltering.html))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [fastText]((https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/qualityfiltering.html))
- Quality filtering with [fastText](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/qualityfiltering.html)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern with "quality" as above.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you. Although, I think the link is still broken from the extra set of parentheses here.

- GPU-based: [Domain, Quality, Safety](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/distributeddataclassification.html)
- **GPU Deduplication**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **GPU Deduplication**
- **Document-level Deduplication**

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to keep a mention of the fact that GPUs are used in deduplication. Not sure how to best combine it. Upon rereading my original bullet though, I wonder if it sounds too much like we are deduplicating GPUs haha.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense. In that case maybe "GPU-Accelerated Deduplication" or "GPU-Accelerated Document Deduplication"?

- [Exact](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/gpudeduplication.html)
- [Fuzzy](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/gpudeduplication.html) (Minhash LSH)
- [Semantic](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/semdedup.html)
- [Downstream-task Decontamination](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/taskdecontamination.html)
- [Personal Identifiable Information (PII) Redaction](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/personalidentifiableinformationidentificationandremoval.html)

### Image

- [Embedding Creation](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/image/classifiers/embedders.html)
- Classifier Filtering
- [Aesthetic](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/image/classifiers/aesthetic.html), [NSFW](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/image/classifiers/nsfw.html)
- GPU Deduplication
- [Semantic](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/semdedup.html)

## Resources

Expand Down Expand Up @@ -83,58 +74,51 @@ Before installing NeMo Curator, ensure that the following requirements are met:
- Volta™ or higher ([compute capability 7.0+](https://developer.nvidia.com/cuda-gpus))
- CUDA 12 (or above)

You can install NeMo-Curator
1. from PyPi
2. from source
3. get it through the [NeMo Framework container](https://github.com/NVIDIA/NeMo?tab=readme-ov-file#docker-containers).

You can get NeMo-Curator in 3 ways.
1. PyPi
2. Source
3. NeMo Framework Container


#### From PyPi

To install the CPU-only modules:
#### PyPi

```bash
pip install cython
pip install nemo-curator
pip install --extra-index-url https://pypi.nvidia.com nemo-curator[all]
```

To install the CPU and CUDA-accelerated modules:

#### Source
```bash
git clone https://github.com/NVIDIA/NeMo-Curator.git
pip install cython
pip install --extra-index-url https://pypi.nvidia.com nemo-curator[cuda12x]
pip install ./NeMo-Curator[all]
```

#### From Source

1. Clone the NeMo Curator repository in GitHub.

```bash
git clone https://github.com/NVIDIA/NeMo-Curator.git
cd NeMo-Curator
```

2. Install the modules that you need.
#### From the NeMo Framework Container

To install the CPU-only modules:
The latest release of NeMo Curator comes preinstalled in the [NeMo Framework Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo/tags). If you want the latest commit inside the container, you can reinstall NeMo Curator using:

```bash
pip install cython
pip install .
```
```bash
pip uninstall nemo-curator
rm -r /opt/NeMo-Curator
git clone https://github.com/NVIDIA/NeMo-Curator.git /opt/NeMo-Curator
pip install --extra-index-url https://pypi.nvidia.com /opt/NeMo-Curator[all]
```

To install the CPU and CUDA-accelerated modules:
#### Extras
NeMo Curator has a set of extras you can use to only install the necessary modules for your workload.
These extras are available for all installation methods provided.

```bash
pip install cython
pip install --extra-index-url https://pypi.nvidia.com ".[cuda12x]"
```
```bash
pip install nemo-curator # Installs CPU-only text curation modules
pip install --extra-index-url https://pypi.nvidia.com nemo-curator[cuda12x] # Installs CPU + GPU text curation modules
pip install --extra-index-url https://pypi.nvidia.com nemo-curator[image] # Installs CPU + GPU text and image curation modules
pip install --extra-index-url https://pypi.nvidia.com nemo-curator[all] # Installs all of the above
```

#### Using Nightly Dependencies for Rapids

You can also install NeMo Curator using the Rapids nightly, to do so you can set the environment variable `RAPIDS_NIGHTLY=1`.
#### Using Nightly Dependencies for RAPIDS

You can also install NeMo Curator using the [RAPIDS Nightly Builds](https://docs.rapids.ai/install). To do so, you can set the environment variable `RAPIDS_NIGHTLY=1`.

```bash
# installing from pypi
Expand All @@ -146,18 +130,6 @@ RAPIDS_NIGHTLY=1 pip install --extra-index-url=https://pypi.anaconda.org/rapidsa

When the environment variable set to 0 or not set (default behavior) it'll use the stable version of Rapids.

#### From the NeMo Framework Container

The latest release of NeMo Curator comes preinstalled in the [NeMo Framework Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo/tags). If you want the latest commit inside the container, you can reinstall NeMo Curator using:

```bash
pip uninstall nemo-curator
rm -r /opt/NeMo-Curator
git clone https://github.com/NVIDIA/NeMo-Curator.git /opt/NeMo-Curator
pip install --extra-index-url https://pypi.nvidia.com /opt/NeMo-Curator[cuda12x]
```
And follow the instructions for installing from source from [above](#from-source).

## Use NeMo Curator
### Python API Quick Example

Expand Down Expand Up @@ -189,6 +161,7 @@ To get started with NeMo Curator, you can follow the tutorials [available here](
- [`peft-curation`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/peft-curation) which focuses on data curation for LLM parameter-efficient fine-tuning (PEFT) use-cases.
- [`distributed_data_classification`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/distributed_data_classification) which focuses on using the quality and domain classifiers to help with data annotation.
- [`single_node_tutorial`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/single_node_tutorial) which demonstrates an end-to-end data curation pipeline for curating Wikipedia data in Thai.
- [`image-curation`](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/image-curation/image-curation.ipynb) which explores the scalable image curation modules.


### Access Python Modules
Expand All @@ -201,9 +174,9 @@ NeMo Curator also offers CLI scripts for you to use. The scripts in `nemo_curato

### Use NeMo Framework Launcher

As an alternative method for interfacing with NeMo Curator, you can use the [NeMo Framework Launcher](https://github.com/NVIDIA/NeMo-Megatron-Launcher). The launcher enables you to easily configure the parameters and cluster. It can also automatically generate the SLURM batch scripts that wrap around the CLI scripts required to run your pipeline.
As an alternative method for interfacing with NeMo Curator, you can use the [NeMo Framework Launcher](https://github.com/NVIDIA/NeMo-Megatron-Launcher). The launcher enables you to easily configure the parameters and cluster. It can also automatically generate the Slurm batch scripts that wrap around the CLI scripts required to run your pipeline.

In addition, other methods are available to run NeMo Curator on SLURM. For example, refer to the example scripts in [`examples/slurm`](examples/slurm/) for information on how to run NeMo Curator on SLURM without the NeMo Framework Launcher.
In addition, other methods are available to run NeMo Curator on Slurm. For example, refer to the example scripts in [`examples/slurm`](examples/slurm/) for information on how to run NeMo Curator on Slurm without the NeMo Framework Launcher.

## Module Ablation and Compute Performance

Expand All @@ -212,7 +185,7 @@ The modules within NeMo Curator were primarily designed to curate high-quality d
The following figure shows that the use of different data curation modules implemented in NeMo Curator led to improved model zero-shot downstream task performance.

<p align="center">
<img src="./docs/user-guide/images/zeroshot_ablations.png" alt="drawing" width="700"/>
<img src="./docs/user-guide/assets/zeroshot_ablations.png" alt="drawing" width="700"/>
</p>

In terms of scalability and compute performance, using the combination of RAPIDS and Dask fuzzy deduplication enabled us to deduplicate the 1.1 Trillion token Red Pajama dataset in 1.8 hours with 64 NVIDIA A100 Tensor Core GPUs.
Expand Down
8 changes: 8 additions & 0 deletions docs/user-guide/api/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,12 @@ DocumentDataset
-------------------

.. autoclass:: nemo_curator.datasets.DocumentDataset
:members:


-------------------------------
ImageTextPairDataset
-------------------------------

.. autoclass:: nemo_curator.datasets.ImageTextPairDataset
:members:
21 changes: 21 additions & 0 deletions docs/user-guide/api/image/classifiers.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
======================================
Classifiers
======================================

------------------------------
Base Class
------------------------------

.. autoclass:: nemo_curator.image.classifiers.ImageClassifier
:members:


------------------------------
Image Classifiers
------------------------------

.. autoclass:: nemo_curator.image.classifiers.AestheticClassifier
:members:

.. autoclass:: nemo_curator.image.classifiers.NsfwClassifier
:members:
18 changes: 18 additions & 0 deletions docs/user-guide/api/image/embedders.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
======================================
Embedders
======================================

------------------------------
Base Class
------------------------------

.. autoclass:: nemo_curator.image.embedders.ImageEmbedder
:members:


------------------------------
Timm
------------------------------

.. autoclass:: nemo_curator.image.embedders.TimmImageEmbedder
:members:
10 changes: 10 additions & 0 deletions docs/user-guide/api/image/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
======================================
Image Curation
======================================

.. toctree::
:maxdepth: 4
:titlesonly:

embedders.rst
classifiers.rst
1 change: 1 addition & 0 deletions docs/user-guide/api/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,5 @@ API Reference
decontamination.rst
services.rst
synthetic.rst
image/index.rst
misc.rst
2 changes: 1 addition & 1 deletion docs/user-guide/distributeddataclassification.rst
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@ The key feature of CrossFit used in NeMo Curator is the sorted sequence data loa
- Groups sorted sequences into optimized batches.
- Efficiently allocates batches to the the provided GPU memories by estimating the memory footprint for each sequence length and batch size.

.. image:: images/sorted_sequence_dataloader.png
.. image:: assets/sorted_sequence_dataloader.png
:alt: Sorted Sequence Data Loader

Check out the `rapidsai/crossfit`_ repository for more information.
Expand Down
97 changes: 97 additions & 0 deletions docs/user-guide/image/classifiers/aesthetic.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
=========================
Aesthetic Classifier
=========================

--------------------
Overview
--------------------
Aesthetic classifiers can be used to assess the subjective quality of an image.
NeMo Curator integrates the `improved aesthetic predictor <https://github.com/christophschuhmann/improved-aesthetic-predictor>`_ that outputs a score from 0-10 where 10 is aesthetically pleasing.

--------------------
Use Cases
--------------------
Filtering by aesthetic quality is common in generative image pipelines.
For example, `Stable Diffusion <https://github.com/CompVis/stable-diffusion?tab=readme-ov-file#weights>`_ progressively filtered by aesthetic score during training.


--------------------
Prerequisities
--------------------
Make sure you check out the `image curation getting started page <https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/image/gettingstarted.html>`_ to install everything you will need.

--------------------
Usage
--------------------

The aesthetic classifier is a linear classifier that takes as input OpenAI CLIP ViT-L/14 image embeddings as input.
This model is available through the ``vit_large_patch14_clip_quickgelu_224.openai`` identifier in ``TimmImageEmbedder``.
First, we can compute these embeddings, then we can perform the classification.

.. code-block:: python

from nemo_curator import get_client
from nemo_curator.datasets import ImageTextPairDataset
from nemo_curator.image.embedders import TimmImageEmbedder
from nemo_curator.image.classifiers import AestheticClassifier

client = get_client(cluster_type="gpu")

dataset = ImageTextPairDataset.from_webdataset(path="/path/to/dataset", id_col="key")

embedding_model = TimmImageEmbedder(
"vit_large_patch14_clip_quickgelu_224.openai",
pretrained=True,
batch_size=1024,
num_threads_per_worker=16,
normalize_embeddings=True,
)
aesthetic_classifier = AestheticClassifier()

dataset_with_embeddings = embedding_model(dataset)
dataset_with_aesthetic_scores = aesthetic_classifier(dataset_with_embeddings)

# Metdata will have a new column named "aesthetic_score"
dataset_with_aesthetic_scores.save_metadata()

--------------------
Key Parameters
--------------------
* ``batch_size=-1`` is the optional batch size parameter. By default, it will process all the embeddings in a shard at once. Since the aesthetic classifier is a linear model, this is usually fine.

---------------------------
Performance Considerations
---------------------------
Since the aesthetic model is so small, you can load it onto the GPU at the same time as the embedding model and perform inference directly after computing the embeddings.
Check out this example:

.. code-block:: python

from nemo_curator import get_client
from nemo_curator.datasets import ImageTextPairDataset
from nemo_curator.image.embedders import TimmImageEmbedder
from nemo_curator.image.classifiers import AestheticClassifier

client = get_client(cluster_type="gpu")

dataset = ImageTextPairDataset.from_webdataset(path="/path/to/dataset", id_col="key")

embedding_model = TimmImageEmbedder(
"vit_large_patch14_clip_quickgelu_224.openai",
pretrained=True,
batch_size=1024,
num_threads_per_worker=16,
normalize_embeddings=True,
classifiers=[AestheticClassifier()],
)

dataset_with_aesthetic_scores = embedding_model(dataset)

# Metdata will have a new column named "aesthetic_score"
dataset_with_aesthetic_scores.save_metadata()

---------------------------
Additional Resources
---------------------------
* `Image Curation Tutorial <https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/image-curation/image-curation.ipynb>`_
* `API Reference <https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/api/image/classifiers.html>`_
Loading