Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLM fine-tuning tutorial #26940

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Conversation

dehume
Copy link
Contributor

@dehume dehume commented Jan 8, 2025

Summary & Motivation

Create a tutorial for fine tuning an LLM with OpenAI and DuckDB

How I Tested These Changes

Changelog

Insert changelog entry or delete this section.

@dehume
Copy link
Contributor Author

dehume commented Jan 8, 2025

@cmpadden Putting up the code first if you want to review that. Still discussing how we want to structure tutorial content

@danielgafni
Copy link
Contributor

Just passing by: consider mentioning dynamic partitions to represent separate model checkpoints (experiments).

I understand it will complicate the tutorial but training & using multiple model checkpoints at a time is what ML users tend to do a lot in practice.

They'll probably need a ways for these checkpoints to a least co-exist in Dagster.

query = f"""
select * from enriched_graphic_novels
"""
with duckdb_resource.get_connection() as conn:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I couldn't figure out with an asset check, if it's possible to use the output of an asset that is not part of the check. Having to reread with DuckDB but if you could just take in enriched_graphic_novels it would save some lines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you give additional_deps a shot?

https://docs.dagster.io/_apidocs/asset-checks#asset-checks

Assets that are upstream dependencies, but do not correspond to a parameter of the decorated function. These dependencies will apply to the underlying op that executes the check. These should not include the asset parameter, which is always included as a dependency.

Copy link
Contributor

@cmpadden cmpadden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think the implementation and demo are great.

I would like to know if we can have a more general prompt than is_manga, for example assigning into a predefined set of categories, but would like to know more if there was a limitation that you were running into.

Overall awesome work, thanks.


# Fine-Tune an LLM

In this tutorial, you'll build an pipeline with Dagster that:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In this tutorial, you'll build an pipeline with Dagster that:
In this tutorial, you'll build a pipeline with Dagster that:


In this tutorial, you'll build an pipeline with Dagster that:

- Imports data from Goodreads DuckDB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Imports data from Goodreads DuckDB
- Loads a public Goodreads JSON dataset into DuckDB


- Imports data from Goodreads DuckDB
- Does feature engineering to enhance the data
- Creates and validates a training and validation file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dehume I'll likely know what this means once I go through the code, but at this point I'm not entirely sure what this means.

Comment on lines +54 to +55
python -m venv dagster_tutorial
source dagster_tutorial/bin/activate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside, I wonder if we can take the docs-revamp as an opportunity to use uv venv here.

Comment on lines +106 to +111
In the `dagster-llm-fine-tune` root directory, there are three configuration files that are common in Python package management. These files manage dependencies and identify the Dagster modules in the project.
| File | Purpose |
|------|---------|
| pyproject.toml | This file is used to specify build system requirements and package metadata for Python projects. It is part of the Python packaging ecosystem. |
| setup.cfg | This file is used for configuration of your Python package. It can include metadata about the package, dependencies, and other configuration options. |
| setup.py | This script is used to build and distribute your Python package. It is a standard file in Python projects for specifying package details. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neverett can provide more feedback here, but this general information might be better suited for another section of the docs, and then referenced across the tutorials.



@dg.asset(
kinds={"OpenAI"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kinds={"OpenAI"},
kinds={"openai"},



@dg.asset(
kinds={"OpenAI"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kinds={"OpenAI"},
kinds={"openai"},

query = f"""
select * from enriched_graphic_novels
"""
with duckdb_resource.get_connection() as conn:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you give additional_deps a shot?

https://docs.dagster.io/_apidocs/asset-checks#asset-checks

Assets that are upstream dependencies, but do not correspond to a parameter of the decorated function. These dependencies will apply to the underlying op that executes the check. These should not include the asset parameter, which is always included as a dependency.

"""
with open(file_name, "r") as training_file:
for line in training_file:
yield eval(line.strip())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eval from a file makes me nervous, and has security implications.

if we do go forward with it, we might need to add a warning.

Comment on lines +26 to +27
Returns:
Generator: transcript for podcast if present
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect return description

@dehume
Copy link
Contributor Author

dehume commented Jan 9, 2025

I would like to know if we can have a more general prompt than is_manga, for example assigning into a predefined set of categories, but would like to know more if there was a limitation that you were running into.

So the main problem I was having was 2 fold.

  • Since the shelves are free form there isn't a set number of predefined categories we can use
  • If we limit it to the top shelves, most are things like "to read" or "own" which are somewhat meaningless. You can get more categories as you go down like "fantasy" or "horror" but the coverage is much lower and then risks most things not being classified

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants