New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

LLM fine-tuning tutorial #26940

Draft

dehume wants to merge 1 commit into master from dennis/doc-654-finetuning-llm-content

+684 −0

Contributor

dehume commented Jan 8, 2025

Summary & Motivation

Create a tutorial for fine tuning an LLM with OpenAI and DuckDB

How I Tested These Changes

Changelog

Insert changelog entry or delete this section.


          LLM fine-tuning init

f22c531

Contributor Author

dehume commented Jan 8, 2025

@cmpadden Putting up the code first if you want to review that. Still discussing how we want to structure tutorial content

Contributor

danielgafni commented Jan 8, 2025

Just passing by: consider mentioning dynamic partitions to represent separate model checkpoints (experiments).

I understand it will complicate the tutorial but training & using multiple model checkpoints at a time is what ML users tend to do a lot in practice.

They'll probably need a ways for these checkpoints to a least co-exist in Dagster.

dehume commented

View reviewed changes

examples/project_llm_fine_tune/project_llm_fine_fune/assets.py

+                  query = f"""
+                      select * from enriched_graphic_novels
+                  """
+                  with duckdb_resource.get_connection() as conn:

Contributor Author

dehume Jan 8, 2025

One thing I couldn't figure out with an asset check, if it's possible to use the output of an asset that is not part of the check. Having to reread with DuckDB but if you could just take in enriched_graphic_novels it would save some lines.

Contributor

cmpadden Jan 9, 2025

Did you give additional_deps a shot?

https://docs.dagster.io/_apidocs/asset-checks#asset-checks

Assets that are upstream dependencies, but do not correspond to a parameter of the decorated function. These dependencies will apply to the underlying op that executes the check. These should not include the asset parameter, which is always included as a dependency.

cmpadden requested changes

View reviewed changes

Contributor

cmpadden left a comment

Overall, I think the implementation and demo are great.

I would like to know if we can have a more general prompt than is_manga, for example assigning into a predefined set of categories, but would like to know more if there was a limitation that you were running into.

Overall awesome work, thanks.

docs/docs-beta/docs/tutorials/category-two/llm-fine-tuning/index.md

    
              # Fine-Tune an LLM

              In this tutorial, you'll build an pipeline with Dagster that:

Contributor

cmpadden Jan 9, 2025

Suggested change

      
            In this tutorial, you'll build an pipeline with Dagster that:
          
            In this tutorial, you'll build a pipeline with Dagster that:

docs/docs-beta/docs/tutorials/category-two/llm-fine-tuning/index.md

    
              In this tutorial, you'll build an pipeline with Dagster that:

              - Imports data from Goodreads DuckDB

Contributor

cmpadden Jan 9, 2025

Suggested change

      
            - Imports data from Goodreads DuckDB
          
            - Loads a public Goodreads JSON dataset into DuckDB

docs/docs-beta/docs/tutorials/category-two/llm-fine-tuning/index.md

    
              - Imports data from Goodreads DuckDB

              - Does feature engineering to enhance the data

              - Creates and validates a training and validation file

Contributor

cmpadden Jan 9, 2025

@dehume I'll likely know what this means once I go through the code, but at this point I'm not entirely sure what this means.

docs/docs-beta/docs/tutorials/category-two/llm-fine-tuning/index.md

Comment on lines +54 to +55

    
                 python -m venv dagster_tutorial

                 source dagster_tutorial/bin/activate

Contributor

cmpadden Jan 9, 2025

As an aside, I wonder if we can take the docs-revamp as an opportunity to use uv venv here.

docs/docs-beta/docs/tutorials/category-two/llm-fine-tuning/index.md

Comment on lines +106 to +111

    
              In the `dagster-llm-fine-tune` root directory, there are three configuration files that are common in Python package management. These files manage dependencies and identify the Dagster modules in the project.

              | File | Purpose |

              |------|---------|

              | pyproject.toml | This file is used to specify build system requirements and package metadata for Python projects. It is part of the Python packaging ecosystem. |

              | setup.cfg | This file is used for configuration of your Python package. It can include metadata about the package, dependencies, and other configuration options. |

              | setup.py | This script is used to build and distribute your Python package. It is a standard file in Python projects for specifying package details. |

Contributor

cmpadden Jan 9, 2025

@neverett can provide more feedback here, but this general information might be better suited for another section of the docs, and then referenced across the tutorials.

examples/project_llm_fine_tune/project_llm_fine_fune/assets.py



		@dg.asset(
		kinds={"OpenAI"},

Contributor

cmpadden Jan 9, 2025

Suggested change

      
                kinds={"OpenAI"},
          
                kinds={"openai"},

examples/project_llm_fine_tune/project_llm_fine_fune/assets.py



		@dg.asset(
		kinds={"OpenAI"},

Contributor

cmpadden Jan 9, 2025

Suggested change

      
                kinds={"OpenAI"},
          
                kinds={"openai"},

examples/project_llm_fine_tune/project_llm_fine_fune/assets.py

+                  query = f"""
+                      select * from enriched_graphic_novels
+                  """
+                  with duckdb_resource.get_connection() as conn:

Contributor

cmpadden Jan 9, 2025

Did you give additional_deps a shot?

https://docs.dagster.io/_apidocs/asset-checks#asset-checks

Assets that are upstream dependencies, but do not correspond to a parameter of the decorated function. These dependencies will apply to the underlying op that executes the check. These should not include the asset parameter, which is always included as a dependency.

examples/project_llm_fine_tune/project_llm_fine_fune/utils/__init__.py

    
                  """

                  with open(file_name, "r") as training_file:

                      for line in training_file:

                          yield eval(line.strip())

Contributor

cmpadden Jan 9, 2025

eval from a file makes me nervous, and has security implications.

if we do go forward with it, we might need to add a warning.

examples/project_llm_fine_tune/project_llm_fine_fune/utils/__init__.py

Comment on lines +26 to +27

    
                  Returns:

                      Generator: transcript for podcast if present

Contributor

cmpadden Jan 9, 2025

Incorrect return description

Contributor Author

dehume commented Jan 9, 2025

I would like to know if we can have a more general prompt than is_manga, for example assigning into a predefined set of categories, but would like to know more if there was a limitation that you were running into.

So the main problem I was having was 2 fold.

Since the shelves are free form there isn't a set number of predefined categories we can use
If we limit it to the top shelves, most are things like "to read" or "own" which are somewhat meaningless. You can get more categories as you go down like "fantasy" or "horror" but the coverage is much lower and then risks most things not being classified

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet