Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLM fine-tuning tutorial #26940

Merged
merged 22 commits into from
Jan 22, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
---
title: Feature engineering
description: Feature Engineering Book Categories
last_update:
author: Dennis Hume
sidebar_position: 30
---

With the data loaded, we can think of how we might want to train our model. One possible use case is to create a model that can tell categorize books based on their details.

The Goodreads data does not include categories exactly, but has something similar in `popular_shelves`. These are free text tags that users can associate with books. Looking at a book, you can see how often certain shelves are used:

```sql
select popular_shelves from graphic_novels limit 5;
```

```
[{'count': 228, 'name': to-read}, {'count': 2, 'name': graphic-novels}, {'count': 1, 'name': ff-re-…`
[{'count': 2, 'name': bd}, {'count': 2, 'name': to-read}, {'count': 1, 'name': french-author}, {'co…`
[{'count': 493, 'name': to-read}, {'count': 113, 'name': graphic-novels}, {'count': 102, 'name': co…`
[{'count': 222, 'name': to-read}, {'count': 9, 'name': currently-reading}, {'count': 3, 'name': mil…`
[{'count': 20, 'name': to-read}, {'count': 8, 'name': comics}, {'count': 4, 'name': graphic-novel},…`
```

Parsing the data out by unpacking and aggregating this field, we can see the most popular shelves:

```sql
select
shelf.name as category,
sum(cast(shelf.count as integer)) as category_count
from (
select
unnest(popular_shelves) as shelf
from graphic_novels
)
group by 1
order by 2 desc
limit 15;
```

| category | category_count |

Check failure on line 41 in docs/docs-beta/docs/tutorials/category-two/llm-fine-tuning/feature-engineering.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'category_count' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'category_count' spelled correctly?", "location": {"path": "docs/docs-beta/docs/tutorials/category-two/llm-fine-tuning/feature-engineering.md", "range": {"start": {"line": 41, "column": 14}}}, "severity": "ERROR"}

Check failure on line 41 in docs/docs-beta/docs/tutorials/category-two/llm-fine-tuning/feature-engineering.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Spelling] Did you really mean 'category_count'? Raw Output: {"message": "[Vale.Spelling] Did you really mean 'category_count'?", "location": {"path": "docs/docs-beta/docs/tutorials/category-two/llm-fine-tuning/feature-engineering.md", "range": {"start": {"line": 41, "column": 14}}}, "severity": "ERROR"}
| --- | --- |
| to-read | 87252 |
| comics | 76283 |
| graphic-novels | 67923 |
| graphic-novel | 58219 |
| currently-reading | 57252 |

Check failure on line 47 in docs/docs-beta/docs/tutorials/category-two/llm-fine-tuning/feature-engineering.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.hyphens] 'currently-reading' doesn't need a hyphen. Raw Output: {"message": "[Dagster.hyphens] 'currently-reading' doesn't need a hyphen.", "location": {"path": "docs/docs-beta/docs/tutorials/category-two/llm-fine-tuning/feature-engineering.md", "range": {"start": {"line": 47, "column": 3}}}, "severity": "ERROR"}
| fiction | 50014 |
| owned | 48936 |
| favorites | 47256 |
| comic | 46948 |
| comics-graphic-novels | 38433 |
| fantasy | 37003 |
| comic-books | 36638 |
| default | 35292 |
| books-i-own | 34620 |
| library | 31378 |

A lot of these shelves would be hard to use for modeling (such as `owned` or `default`). But genres such as `fantasy` could be interesting. If we continued looking through shelves, these are the most popular genres:

```python
CATEGORIES = [
"fantasy", "horror", "humor", "adventure",
"action", "romance", "ya", "superheroes",
"comedy", "mystery", "supernatural", "drama",
]
```

Using these categories, we can construct a table of the most common genres and select the single best genre for each book (assuming it was shelved that way at least three times). We can then wrap that query in an asset and materialize it as a table alongside our other DuckDB tables:

<CodeExample path="project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" lineStart="64" lineEnd="105"/>

## Enrichment table

With our `book_category` asset created, we can combine that with the `author` and `graphic_novel` assets to create our final data set we will use for modeling. Here we will both create the table within DuckDB and select its contents into a DataFrame, which we can pass to our next series of assets:

<CodeExample path="project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" lineStart="107" lineEnd="134"/>

## Next steps

- Continue this tutorial with [file creation](file-creation)
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
---
title: File creation
description: File Creation and File Validation
last_update:
author: Dennis Hume
sidebar_position: 40
---

Using the data we prepared in the [previous step](feature-engineering), we will create two files: a training file and a validation file. A training file provides the model with labeled data to learn patterns, while a validation file evaluates the model's performance on unseen data to prevent overfitting. These will be used in our OpenAI fine-tuning job to create our model. The columnar data from our DuckDB assets needs to be fit into messages that resemble the conversation a user would have with a chatbot. Here we can inject the values of those fields into conversations:

Check failure on line 9 in docs/docs-beta/docs/tutorials/category-two/llm-fine-tuning/file-creation.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'overfitting' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'overfitting' spelled correctly?", "location": {"path": "docs/docs-beta/docs/tutorials/category-two/llm-fine-tuning/file-creation.md", "range": {"start": {"line": 9, "column": 294}}}, "severity": "ERROR"}

Check failure on line 9 in docs/docs-beta/docs/tutorials/category-two/llm-fine-tuning/file-creation.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Spelling] Did you really mean 'overfitting'? Raw Output: {"message": "[Vale.Spelling] Did you really mean 'overfitting'?", "location": {"path": "docs/docs-beta/docs/tutorials/category-two/llm-fine-tuning/file-creation.md", "range": {"start": {"line": 9, "column": 294}}}, "severity": "ERROR"}

<CodeExample path="project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" lineStart="136" lineEnd="154"/>

The fine-tuning process does not need all the data prepared from `enriched_graphic_novels`. We will simply take a sample of the DataFrame and write it to a `.jsonl` file. The assets to create the training and validation set are very similar (only the filename is different). They will take in the `enriched_graphic_novels` asset, generate the prompts, and write the outputs to a file stored locally:

Check failure on line 13 in docs/docs-beta/docs/tutorials/category-two/llm-fine-tuning/file-creation.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Avoid] Avoid using 'simply'. Raw Output: {"message": "[Vale.Avoid] Avoid using 'simply'.", "location": {"path": "docs/docs-beta/docs/tutorials/category-two/llm-fine-tuning/file-creation.md", "range": {"start": {"line": 13, "column": 101}}}, "severity": "ERROR"}

Check failure on line 13 in docs/docs-beta/docs/tutorials/category-two/llm-fine-tuning/file-creation.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Avoid] Avoid using 'very'. Raw Output: {"message": "[Vale.Avoid] Avoid using 'very'.", "location": {"path": "docs/docs-beta/docs/tutorials/category-two/llm-fine-tuning/file-creation.md", "range": {"start": {"line": 13, "column": 229}}}, "severity": "ERROR"}

<CodeExample path="project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" lineStart="156" lineEnd="172"/>

:::note

Since these files save the data locally, it may not work with every type of deployment.

:::

## Validation

The files are ready, but before we send the data to OpenAI to perform the training job, we should do some validation. It is always a good idea to put checkpoints in place as your workflows become more involved. Taking the time to ensure our data if formatted correctly can save debugging time before we get other APIs involved.

Luckily, OpenAI provides a cookbook specifically about [format validation](https://cookbook.openai.com/examples/chat_finetuning_data_prep#format-validation). This contains a series of checks we can perform to ensure our data meets the requirements for OpenAI training jobs.

Looking at this notebook. This would make a great asset check. Asset checks help ensure the assets in our DAG meet certain criteria that we define. Asset checks look very similar to assets, but are connected directly to the asset and do not appear as a separate node within the DAG.

Check failure on line 29 in docs/docs-beta/docs/tutorials/category-two/llm-fine-tuning/file-creation.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Avoid] Avoid using 'very'. Raw Output: {"message": "[Vale.Avoid] Avoid using 'very'.", "location": {"path": "docs/docs-beta/docs/tutorials/category-two/llm-fine-tuning/file-creation.md", "range": {"start": {"line": 29, "column": 167}}}, "severity": "ERROR"}

Since we want an asset check for both the training and validation files, we will write a general function that contains the logic from the cookbook:

<CodeExample path="project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" lineStart="192" lineEnd="237"/>

This looks like any other Python function, except it returns an `AssetCheckResult`, which is what Dagster uses to store the output of the asset check. Now we can use that function to create asset checks directly tied to our file assets. Again, they look similar to assets, except they use the `asset_check` decorator:

<CodeExample path="project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" lineStart="239" lineEnd="249"/>

## Next steps

- Continue this tutorial with [OpenAI job](open-ai-job)
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
---
title: LLM fine-tuning
description: Learn how to fine-tune an LLM
last_update:
author: Dennis Hume
sidebar_position: 10
---

# Fine-tune an LLM

In this tutorial, you'll build a pipeline with Dagster that:

- Loads a public Goodreads JSON dataset into DuckDB
- Performs feature engineering to enhance the data
- Creates and validates the data files needed for an OpenAI fine-tuning job
- Generate a custom model and validate it

<details>
<summary>Prerequisites</summary>

To follow the steps in this guide, you'll need:

- Basic Python knowledge
- Python 3.9+ installed on your system. Refer to the [Installation guide](/getting-started/installation) for information.
- Familiarity with SQL and Python data manipulation libraries, such as [Pandas](https://pandas.pydata.org/).
- Understanding of data pipelines and the extract, transform, and load process (ETL).
</details>


## Step 1: Set up your Dagster environment

First, set up a new Dagster project.

1. Within the Dagster repo, navigate to the project:

```bash
cd examples/dagster-llm-fine-tune
```

2. Create and activate a virtual environment:

<Tabs>
<TabItem value="macos" label="MacOS">
```bash
uv venv dagster_tutorial
source dagster_tutorial/bin/activate
```
</TabItem>
<TabItem value="windows" label="Windows">
```bash
uv venv dagster_tutorial
dagster_tutorial\Scripts\activate
```
</TabItem>
</Tabs>

3. Install Dagster and the required dependencies:

```bash
uv pip install -e ".[dev]"
```

## Step 2: Launch the Dagster webserver

To make sure Dagster and its dependencies were installed correctly, navigate to the project root directory and start the Dagster webserver:

followed by a bash code snippet for

Check warning on line 67 in docs/docs-beta/docs/tutorials/category-two/llm-fine-tuning/index.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorials/category-two/llm-fine-tuning/index.md", "range": {"start": {"line": 67, "column": 36}}}, "severity": "WARNING"}

```bash
dagster dev
```

## Next steps

- Continue this tutorial with [ingestion](ingestion)
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
title: Ingestion
description: Ingest Data from Goodreads
last_update:
author: Dennis Hume
sidebar_position: 20
---

We will be working with a [Goodreads dataset](https://mengtingwan.github.io/data/goodreads#datasets) that consists of JSON files that contain different genres of books. The dataset consists of JSON files that contain different genres of books. We will focus on graphic novels to limit the amount of data we need to process. Within this domain, the files we need are `goodreads_books_comics_graphic.json.gz` and `goodreads_book_authors.json.gz`.

Since the data is normalized across these two files, we will want to combine them in some way to produce a single dataset. This is a great use case for [DuckDB](https://duckdb.org/). DuckDB is an in-process database, similar to SQLite, optimized for analytical workloads. Using DuckDB, we can directly load the semi-structured data and work on it using SQL.

We will start by creating two Dagster assets to load in the data. Each asset will load one of the files and create a DuckDB table (`graphic_novels` and `authors`). The asset will use the Dagster `DuckDBResource`, which gives us an easy way to interact with and run queries in DuckDB. Both files will create a table from their respective JSON files:

<CodeExample path="project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" lineStart="22" lineEnd="41"/>

Now that the base tables are loaded, we can move on to working with the data.

## Next steps

- Continue this tutorial with [feature engineering](feature-engineering)
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
---
title: Model validation
description: Validate Fine-Tuned Model
last_update:
author: Dennis Hume
sidebar_position: 60
---

We are going to use another asset check and tie this to our `fine_tuned_model` asset. This will be slightly more sophisticated than our file validation asset check, since it will need to use both the OpenAI resource and the output of the `enriched_graphic_novels` asset.

What we will do is take another sample of data (100 records) from our `enriched_graphic_novels`. Even though our asset check is for the `fine_tuned_model` model, we can still use the `enriched_graphic_novels` asset by including it as an `additional_ins`. Now that we have another sample of data, we can use OpenAI to try and determine the category. We will run the same sample record against the base model (`gpt-4o-mini-2024-07-18`) and our fine-tuned model (`ft:gpt-4o-mini-2024-07-18:test:goodreads:AoAYW0x3`). We can then compare the number of correct answers for both models:

<CodeExample path="project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" lineStart="382" lineEnd="425"/>

We will store the accuracy of both models as metadata in the check. Because this is an asset check, this will automatically every time we run our fine-tuning asset. When we execute the pipeline, you will see our check has passed since our fine-tuned model correctly identified 76 of the genres in our sample compared to the base model which was only correct in 44 instances.

![2048 resolution](/images/tutorial/llm-fine-tuning/model_accuracy_1.png)

We can also execute this asset check separately from the fine-tuning job if we ever want to compare the accuracy. Running it a few more times, we can see that the accuracy is plotted:

![2048 resolution](/images/tutorial/llm-fine-tuning/model_accuracy_2.png)

## Summary

This should give you a good sense of how to fine-tune a model end to end, from ingesting the files, to creating features, and generating and validating the model.
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
---
title: OpenAI Job
description: Execute the OpenAI Fine-Tuning Job
last_update:
author: Dennis Hume
sidebar_position: 50
---

Now that we are confident in the files we have generated, we can kick off our OpenAI fine-tuning job. The first step is uploading the files to the OpenAI storage endpoint. Like DuckDB, Dagster offers a resource to interact with OpenAI and provide us with a client we can use. After the files have been uploaded, OpenAI will return a file ID, which we will need for the fine-tuning job.

We have an asset for each file to upload, but as in the file creation step, they will look similar:

Check failure on line 11 in docs/docs-beta/docs/tutorials/category-two/llm-fine-tuning/open-ai-job.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Terms] Use 'we have' instead of 'We have'. Raw Output: {"message": "[Vale.Terms] Use 'we have' instead of 'We have'.", "location": {"path": "docs/docs-beta/docs/tutorials/category-two/llm-fine-tuning/open-ai-job.md", "range": {"start": {"line": 11, "column": 1}}}, "severity": "ERROR"}

<CodeExample path="project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" lineStart="263" lineEnd="277"/>

## Fine-tuning job

We can now fine-tune our model. Using the OpenAI resource again, we will use the fine-tuning endpoint and submit a job using our two files. Executing a fine-tuning job may take a while, so after we submit it, we will want to the asset to poll and wait for it to succeed:

<CodeExample path="project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" lineStart="295" lineEnd="328"/>

After the fine-tuning job has succeeded, we are given the unique name of our new model (in this case `ft:gpt-4o-mini-2024-07-18:test:goodreads:AoAYW0x3`). Note that we used `context.add_output_metadata` to record this as metadata as it will be good to track all the fine-tuned models we create over time with this job.

Now that we have a model, we should test if it is an improvement over our initial model.

## Next steps

- Continue this tutorial with [model validation](model-validation)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions examples/project_llm_fine_tune/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
OPENAI_API_KEY=
12 changes: 12 additions & 0 deletions examples/project_llm_fine_tune/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
*.egg-info/*
*.mp3
*.pyc
.DS_Store
.direnv/
.env
.envrc
__pycache__/
tmp*/
data/data.duckdb
goodreads-training.jsonl
goodreads-validation.jsonl
9 changes: 9 additions & 0 deletions examples/project_llm_fine_tune/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
ruff:
ruff check --select I --fix .

clean:
find . -name "__pycache__" -exec rm -rf {} +
find . -name "*.pyc" -exec rm -f {} +

install:
uv pip install -e .[dev]
46 changes: 46 additions & 0 deletions examples/project_llm_fine_tune/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
## Dagster × OpenAI Fine-Tune Demo

Fine-tune a custom model to detect specific features from Goodreads data.

In this example project we show how you can write a pipeline that ingests data from Goodreads
with DuckDB and then generates features for modeling. You can then use this data to fine-tune
a model in OpenAI to identify that feature while also validating the model against the base
model is was built against.

### Example Asset Lineage

![Screenshot Dagster Lineage](_static/screenshot_dagster_lineage.png)

## Getting started

Install the project dependencies:

```sh
pip install -e ".[dev]"
```

Run Dagster:

```sh
dagster dev
```

Open http://localhost:3000 in your browser.

## References

Dagster

- [Dagster Docs](https://docs.dagster.io/)
- [Dagster Docs: DuckDB](https://docs.dagster.io/_apidocs/libraries/dagster-duckdb)
- [Dagster Docs: OpenAI Integration](https://docs.dagster.io/integrations/openai)

DuckDB

- [DuckDB Docs](https://duckdb.org/docs/)

OpenAI

- [OpenAI Fine-Tuning](https://platform.openai.com/docs/guides/fine-tuning)
- [OpenAI Cookbook: How to fine-tune chat models](https://cookbook.openai.com/examples/how_to_finetune_chat_models)
- [OpenAI Cookbook: Data preparation and analysis for chat model fine-tuning](https://cookbook.openai.com/examples/chat_finetuning_data_prep)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

Loading
Loading