dagster-io · dehume · Jan 22, 2025 · Jan 8, 2025 · Jan 10, 2025 · Jan 10, 2025
@@ -0,0 +1,81 @@
+---
+title: Feature engineering
+description: Feature Engineering Book Categories
+last_update:
+  author: Dennis Hume
+sidebar_position: 30
+---
+
+With the data loaded, we can think of how we might want to train our model. One possible use case is to create a model that can tell categorize books based on their details.
+
+The Goodreads data does not include categories exactly, but has something similar in `popular_shelves`. These are free text tags that users can associate with books. Looking at a book, you can see how often certain shelves are used:
+
+```sql
+select popular_shelves from graphic_novels limit 5;
+```
+
+```
+[{'count': 228, 'name': to-read}, {'count': 2, 'name': graphic-novels}, {'count': 1, 'name': ff-re-…`
+[{'count': 2, 'name': bd}, {'count': 2, 'name': to-read}, {'count': 1, 'name': french-author}, {'co…`
+[{'count': 493, 'name': to-read}, {'count': 113, 'name': graphic-novels}, {'count': 102, 'name': co…`
+[{'count': 222, 'name': to-read}, {'count': 9, 'name': currently-reading}, {'count': 3, 'name': mil…`
+[{'count': 20, 'name': to-read}, {'count': 8, 'name': comics}, {'count': 4, 'name': graphic-novel},…`
+```
+
+Parsing the data out by unpacking and aggregating this field, we can see the most popular shelves:
+
+```sql
+select
+	shelf.name as category,
+	sum(cast(shelf.count as integer)) as category_count
+from (
+    select
+        unnest(popular_shelves) as shelf
+    from graphic_novels
+)
+group by 1
+order by 2 desc
+limit 15;
+```
+
+| category | category_count |
+| --- | --- |
+| to-read | 87252 |
+| comics | 76283 |
+| graphic-novels | 67923 |
+| graphic-novel | 58219 |
+| currently-reading | 57252 |
+| fiction | 50014 |
+| owned | 48936 |
+| favorites | 47256 |
+| comic | 46948 |
+| comics-graphic-novels | 38433 |
+| fantasy | 37003 |
+| comic-books | 36638 |
+| default | 35292 |
+| books-i-own | 34620 |
+| library | 31378 |
+
+A lot of these shelves would be hard to use for modeling (such as `owned` or `default`). But genres such as `fantasy` could be interesting. If we continued looking through shelves, these are the most popular genres:
+
+```python
+CATEGORIES = [
+    "fantasy", "horror", "humor", "adventure",
+    "action", "romance", "ya", "superheroes",
+    "comedy", "mystery", "supernatural", "drama",
+]
+```
+
+Using these categories, we can construct a table of the most common genres and select the single best genre for each book (assuming it was shelved that way at least three times). We can then wrap that query in an asset and materialize it as a table alongside our other DuckDB tables:
+
+<CodeExample path="project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" lineStart="64" lineEnd="105"/>
+
+## Enrichment table
+
+With our `book_category` asset created, we can combine that with the `author` and `graphic_novel` assets to create our final data set we will use for modeling. Here we will both create the table within DuckDB and select its contents into a DataFrame, which we can pass to our next series of assets:
+
+<CodeExample path="project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" lineStart="107" lineEnd="134"/>
+
+## Next steps
+
+- Continue this tutorial with [file creation](file-creation)
@@ -0,0 +1,41 @@
+---
+title: File creation
+description: File Creation and File Validation
+last_update:
+  author: Dennis Hume
+sidebar_position: 40
+---
+
+Using the data we prepared in the [previous step](feature-engineering), we will create two files: a training file and a validation file. A training file provides the model with labeled data to learn patterns, while a validation file evaluates the model's performance on unseen data to prevent overfitting. These will be used in our OpenAI fine-tuning job to create our model. The columnar data from our DuckDB assets needs to be fit into messages that resemble the conversation a user would have with a chatbot. Here we can inject the values of those fields into conversations:
+
+<CodeExample path="project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" lineStart="136" lineEnd="154"/>
+
+The fine-tuning process does not need all the data prepared from `enriched_graphic_novels`. We will simply take a sample of the DataFrame and write it to a `.jsonl` file. The assets to create the training and validation set are very similar (only the filename is different). They will take in the `enriched_graphic_novels` asset, generate the prompts, and write the outputs to a file stored locally:
+
+<CodeExample path="project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" lineStart="156" lineEnd="172"/>
+
+:::note
+
+Since these files save the data locally, it may not work with every type of deployment.
+
+:::
+
+## Validation
+
+The files are ready, but before we send the data to OpenAI to perform the training job, we should do some validation. It is always a good idea to put checkpoints in place as your workflows become more involved. Taking the time to ensure our data if formatted correctly can save debugging time before we get other APIs involved.
+
+Luckily, OpenAI provides a cookbook specifically about [format validation](https://cookbook.openai.com/examples/chat_finetuning_data_prep#format-validation). This contains a series of checks we can perform to ensure our data meets the requirements for OpenAI training jobs.
+
+Looking at this notebook. This would make a great asset check. Asset checks help ensure the assets in our DAG meet certain criteria that we define. Asset checks look very similar to assets, but are connected directly to the asset and do not appear as a separate node within the DAG.
+
+Since we want an asset check for both the training and validation files, we will write a general function that contains the logic from the cookbook:
+
+<CodeExample path="project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" lineStart="192" lineEnd="237"/>
+
+This looks like any other Python function, except it returns an `AssetCheckResult`, which is what Dagster uses to store the output of the asset check. Now we can use that function to create asset checks directly tied to our file assets. Again, they look similar to assets, except they use the `asset_check` decorator:
+
+<CodeExample path="project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" lineStart="239" lineEnd="249"/>
+
+## Next steps
+
+- Continue this tutorial with [OpenAI job](open-ai-job)
@@ -0,0 +1,75 @@
+---
+title: LLM fine-tuning
+description: Learn how to fine-tune an LLM
+last_update:
+   author: Dennis Hume
+sidebar_position: 10
+---
+
+# Fine-tune an LLM
+
+In this tutorial, you'll build a pipeline with Dagster that:
+
+- Loads a public Goodreads JSON dataset into DuckDB
+- Performs feature engineering to enhance the data
+- Creates and validates the data files needed for an OpenAI fine-tuning job
+- Generate a custom model and validate it
+
+<details>
+  <summary>Prerequisites</summary>
+
+To follow the steps in this guide, you'll need:
+
+- Basic Python knowledge
+- Python 3.9+ installed on your system. Refer to the [Installation guide](/getting-started/installation) for information.
+- Familiarity with SQL and Python data manipulation libraries, such as [Pandas](https://pandas.pydata.org/).
+- Understanding of data pipelines and the extract, transform, and load process (ETL).
+</details>
+
+
+## Step 1: Set up your Dagster environment
+
+First, set up a new Dagster project.
+
+1. Within the Dagster repo, navigate to the project:
+
+   ```bash
+   cd examples/dagster-llm-fine-tune
+   ```
+
+2. Create and activate a virtual environment:
+
+   <Tabs>
+   <TabItem value="macos" label="MacOS">
+   ```bash
+   uv venv dagster_tutorial
+   source dagster_tutorial/bin/activate
+   ```
+   </TabItem>
+   <TabItem value="windows" label="Windows">
+   ```bash
+   uv venv dagster_tutorial
+   dagster_tutorial\Scripts\activate
+   ```
+   </TabItem>
+   </Tabs>
+
+3. Install Dagster and the required dependencies:
+
+   ```bash
+   uv pip install -e ".[dev]"
+   ```
+
+## Step 2: Launch the Dagster webserver
+
+To make sure Dagster and its dependencies were installed correctly, navigate to the project root directory and start the Dagster webserver:
+
+followed by a bash code snippet for 
+
+```bash
+dagster dev
+```
+
+## Next steps
+
+- Continue this tutorial with [ingestion](ingestion)
@@ -0,0 +1,21 @@
+---
+title: Ingestion
+description: Ingest Data from Goodreads
+last_update:
+  author: Dennis Hume
+sidebar_position: 20
+---
+
+We will be working with a [Goodreads dataset](https://mengtingwan.github.io/data/goodreads#datasets) that consists of JSON files that contain different genres of books. The dataset consists of JSON files that contain different genres of books. We will focus on graphic novels to limit the amount of data we need to process. Within this domain, the files we need are `goodreads_books_comics_graphic.json.gz` and `goodreads_book_authors.json.gz`.
+
+Since the data is normalized across these two files, we will want to combine them in some way to produce a single dataset. This is a great use case for [DuckDB](https://duckdb.org/). DuckDB is an in-process database, similar to SQLite, optimized for analytical workloads. Using DuckDB, we can directly load the semi-structured data and work on it using SQL.
+
+We will start by creating two Dagster assets to load in the data. Each asset will load one of the files and create a DuckDB table (`graphic_novels` and `authors`). The asset will use the Dagster `DuckDBResource`, which gives us an easy way to interact with and run queries in DuckDB. Both files will create a table from their respective JSON files:
+
+<CodeExample path="project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" lineStart="22" lineEnd="41"/>
+
+Now that the base tables are loaded, we can move on to working with the data.
+
+## Next steps
+
+- Continue this tutorial with [feature engineering](feature-engineering)
@@ -0,0 +1,25 @@
+---
+title: Model validation
+description: Validate Fine-Tuned Model
+last_update:
+  author: Dennis Hume
+sidebar_position: 60
+---
+
+We are going to use another asset check and tie this to our `fine_tuned_model` asset. This will be slightly more sophisticated than our file validation asset check, since it will need to use both the OpenAI resource and the output of the `enriched_graphic_novels` asset.
+
+What we will do is take another sample of data (100 records) from our `enriched_graphic_novels`. Even though our asset check is for the `fine_tuned_model` model, we can still use the `enriched_graphic_novels` asset by including it as an `additional_ins`. Now that we have another sample of data, we can use OpenAI to try and determine the category. We will run the same sample record against the base model (`gpt-4o-mini-2024-07-18`) and our fine-tuned model (`ft:gpt-4o-mini-2024-07-18:test:goodreads:AoAYW0x3`). We can then compare the number of correct answers for both models:
+
+<CodeExample path="project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" lineStart="382" lineEnd="425"/>
+
+We will store the accuracy of both models as metadata in the check. Because this is an asset check, this will automatically every time we run our fine-tuning asset. When we execute the pipeline, you will see our check has passed since our fine-tuned model correctly identified 76 of the genres in our sample compared to the base model which was only correct in 44 instances.
+
+![2048 resolution](/images/tutorial/llm-fine-tuning/model_accuracy_1.png)
+
+We can also execute this asset check separately from the fine-tuning job if we ever want to compare the accuracy. Running it a few more times, we can see that the accuracy is plotted:
+
+![2048 resolution](/images/tutorial/llm-fine-tuning/model_accuracy_2.png)
+
+## Summary
+
+This should give you a good sense of how to fine-tune a model end to end, from ingesting the files, to creating features, and generating and validating the model.
@@ -0,0 +1,27 @@
+---
+title: OpenAI Job
+description: Execute the OpenAI Fine-Tuning Job
+last_update:
+  author: Dennis Hume
+sidebar_position: 50
+---
+
+Now that we are confident in the files we have generated, we can kick off our OpenAI fine-tuning job. The first step is uploading the files to the OpenAI storage endpoint. Like DuckDB, Dagster offers a resource to interact with OpenAI and provide us with a client we can use. After the files have been uploaded, OpenAI will return a file ID, which we will need for the fine-tuning job.
+
+We have an asset for each file to upload, but as in the file creation step, they will look similar:
+
+<CodeExample path="project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" lineStart="263" lineEnd="277"/>
+
+## Fine-tuning job
+
+We can now fine-tune our model. Using the OpenAI resource again, we will use the fine-tuning endpoint and submit a job using our two files. Executing a fine-tuning job may take a while, so after we submit it, we will want to the asset to poll and wait for it to succeed:
+
+<CodeExample path="project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" lineStart="295" lineEnd="328"/>
+
+After the fine-tuning job has succeeded, we are given the unique name of our new model (in this case `ft:gpt-4o-mini-2024-07-18:test:goodreads:AoAYW0x3`). Note that we used `context.add_output_metadata` to record this as metadata as it will be good to track all the fine-tuned models we create over time with this job.
+
+Now that we have a model, we should test if it is an improvement over our initial model.
+
+## Next steps
+
+- Continue this tutorial with [model validation](model-validation)
diff --git a/examples/project_llm_fine_tune/.env.example b/examples/project_llm_fine_tune/.env.example
@@ -0,0 +1 @@
+OPENAI_API_KEY=
diff --git a/examples/project_llm_fine_tune/.gitignore b/examples/project_llm_fine_tune/.gitignore
@@ -0,0 +1,12 @@
+*.egg-info/*
+*.mp3
+*.pyc
+.DS_Store
+.direnv/
+.env
+.envrc
+__pycache__/
+tmp*/
+data/data.duckdb
+goodreads-training.jsonl
+goodreads-validation.jsonl
diff --git a/examples/project_llm_fine_tune/Makefile b/examples/project_llm_fine_tune/Makefile
@@ -0,0 +1,9 @@
+ruff:
+	ruff check --select I --fix .
+
+clean:
+	find . -name "__pycache__" -exec rm -rf {} +
+	find . -name "*.pyc" -exec rm -f {} +
+
+install:
+	uv pip install -e .[dev]
diff --git a/examples/project_llm_fine_tune/README.md b/examples/project_llm_fine_tune/README.md
@@ -0,0 +1,46 @@
+## Dagster × OpenAI Fine-Tune Demo
+
+Fine-tune a custom model to detect specific features from Goodreads data.
+
+In this example project we show how you can write a pipeline that ingests data from Goodreads
+with DuckDB and then generates features for modeling. You can then use this data to fine-tune
+a model in OpenAI to identify that feature while also validating the model against the base
+model is was built against.
+
+### Example Asset Lineage
+
+![Screenshot Dagster Lineage](_static/screenshot_dagster_lineage.png)
+
+## Getting started
+
+Install the project dependencies:
+
+```sh
+pip install -e ".[dev]"
+```
+
+Run Dagster:
+
+```sh
+dagster dev
+```
+
+Open http://localhost:3000 in your browser.
+
+## References
+
+Dagster
+
+- [Dagster Docs](https://docs.dagster.io/)
+- [Dagster Docs: DuckDB](https://docs.dagster.io/_apidocs/libraries/dagster-duckdb)
+- [Dagster Docs: OpenAI Integration](https://docs.dagster.io/integrations/openai)
+
+DuckDB
+
+- [DuckDB Docs](https://duckdb.org/docs/)
+
+OpenAI
+
+- [OpenAI Fine-Tuning](https://platform.openai.com/docs/guides/fine-tuning)
+- [OpenAI Cookbook: How to fine-tune chat models](https://cookbook.openai.com/examples/how_to_finetune_chat_models)
+- [OpenAI Cookbook: Data preparation and analysis for chat model fine-tuning](https://cookbook.openai.com/examples/chat_finetuning_data_prep)
diff --git a/examples/project_llm_fine_tune/_static/screenshot_dagster_lineage.png b/examples/project_llm_fine_tune/_static/screenshot_dagster_lineage.png
diff --git a/examples/project_llm_fine_tune/data/.gitkeep b/examples/project_llm_fine_tune/data/.gitkeep
diff --git a/examples/project_llm_fine_tune/project_llm_fine_tune/__init__.py b/examples/project_llm_fine_tune/project_llm_fine_tune/__init__.py
@@ -0,0 +1 @@
+