-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLM fine-tuning tutorial #26940
base: master
Are you sure you want to change the base?
LLM fine-tuning tutorial #26940
Conversation
@cmpadden Putting up the code first if you want to review that. Still discussing how we want to structure tutorial content |
Just passing by: consider mentioning dynamic partitions to represent separate model checkpoints (experiments). I understand it will complicate the tutorial but training & using multiple model checkpoints at a time is what ML users tend to do a lot in practice. They'll probably need a ways for these checkpoints to a least co-exist in Dagster. |
query = f""" | ||
select * from enriched_graphic_novels | ||
""" | ||
with duckdb_resource.get_connection() as conn: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I couldn't figure out with an asset check, if it's possible to use the output of an asset that is not part of the check. Having to reread with DuckDB but if you could just take in enriched_graphic_novels
it would save some lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you give additional_deps
a shot?
https://docs.dagster.io/_apidocs/asset-checks#asset-checks
Assets that are upstream dependencies, but do not correspond to a parameter of the decorated function. These dependencies will apply to the underlying op that executes the check. These should not include the asset parameter, which is always included as a dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think the implementation and demo are great.
I would like to know if we can have a more general prompt than is_manga
, for example assigning into a predefined set of categories, but would like to know more if there was a limitation that you were running into.
Overall awesome work, thanks.
|
||
# Fine-Tune an LLM | ||
|
||
In this tutorial, you'll build an pipeline with Dagster that: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this tutorial, you'll build an pipeline with Dagster that: | |
In this tutorial, you'll build a pipeline with Dagster that: |
|
||
In this tutorial, you'll build an pipeline with Dagster that: | ||
|
||
- Imports data from Goodreads DuckDB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Imports data from Goodreads DuckDB | |
- Loads a public Goodreads JSON dataset into DuckDB |
|
||
- Imports data from Goodreads DuckDB | ||
- Does feature engineering to enhance the data | ||
- Creates and validates a training and validation file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dehume I'll likely know what this means once I go through the code, but at this point I'm not entirely sure what this means.
python -m venv dagster_tutorial | ||
source dagster_tutorial/bin/activate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an aside, I wonder if we can take the docs-revamp as an opportunity to use uv venv
here.
In the `dagster-llm-fine-tune` root directory, there are three configuration files that are common in Python package management. These files manage dependencies and identify the Dagster modules in the project. | ||
| File | Purpose | | ||
|------|---------| | ||
| pyproject.toml | This file is used to specify build system requirements and package metadata for Python projects. It is part of the Python packaging ecosystem. | | ||
| setup.cfg | This file is used for configuration of your Python package. It can include metadata about the package, dependencies, and other configuration options. | | ||
| setup.py | This script is used to build and distribute your Python package. It is a standard file in Python projects for specifying package details. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@neverett can provide more feedback here, but this general information might be better suited for another section of the docs, and then referenced across the tutorials.
|
||
|
||
@dg.asset( | ||
kinds={"OpenAI"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kinds={"OpenAI"}, | |
kinds={"openai"}, |
|
||
|
||
@dg.asset( | ||
kinds={"OpenAI"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kinds={"OpenAI"}, | |
kinds={"openai"}, |
query = f""" | ||
select * from enriched_graphic_novels | ||
""" | ||
with duckdb_resource.get_connection() as conn: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you give additional_deps
a shot?
https://docs.dagster.io/_apidocs/asset-checks#asset-checks
Assets that are upstream dependencies, but do not correspond to a parameter of the decorated function. These dependencies will apply to the underlying op that executes the check. These should not include the asset parameter, which is always included as a dependency.
""" | ||
with open(file_name, "r") as training_file: | ||
for line in training_file: | ||
yield eval(line.strip()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eval
from a file makes me nervous, and has security implications.
if we do go forward with it, we might need to add a warning.
Returns: | ||
Generator: transcript for podcast if present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect return description
So the main problem I was having was 2 fold.
|
Summary & Motivation
Create a tutorial for fine tuning an LLM with OpenAI and DuckDB
How I Tested These Changes
Changelog