New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft document classifier #8

Open

vitawasalreadytaken wants to merge 2 commits into main from feature/draft-classifier

Contributor

vitawasalreadytaken commented Nov 27, 2024

I've placed two somewhat unrelated things into this branch because they both depend on #6 and on some research-y code from our old, exploratory repository.

research/consultation_topics/label_exploration.ipynb takes a look at the consultation topic labels we have.
research/document_types/VM_draft_classifier.ipynb is a bare-bone document type classifier, on Fedlex data only, making "is DRAFT/is not a DRAFT" predictions only.

We can split these up again if this branch stops making sense.

Handing this over to you now, @orieger 🙂

vitawasalreadytaken assigned orieger

vitawasalreadytaken mentioned this pull request

WIP: Topic label exploration #7

Closed

vitawasalreadytaken changed the base branch from main to feature/data-pipeline

November 27, 2024 10:23

vitawasalreadytaken mentioned this pull request

Build a binary document type classifier: "DRAFT vs. not-a-DRAFT" #9

Open

vitawasalreadytaken linked an issue

that may be closed by this pull request

Build a binary document type classifier: "DRAFT vs. not-a-DRAFT" #9

Open

Base automatically changed from feature/data-pipeline to main

December 2, 2024 13:49

vitawasalreadytaken force-pushed the feature/draft-classifier branch from fef60cd to 3127e5a Compare

December 2, 2024 13:51

Contributor Author

vitawasalreadytaken commented Dec 2, 2024

This is now rebased on main where #6 has been merged.


          Adding a first version of the training script for a draft classifier

a34fe53

vitawasalreadytaken force-pushed the feature/draft-classifier branch from 6ecde1a to a34fe53 Compare

December 10, 2024 10:49

Contributor Author

vitawasalreadytaken commented Dec 10, 2024

All dependencies such as the research libraries and code are now merged in main (via #14) so I have rebased this PR on main. Everything that exists in main now also conforms to our linter rules, so this branch should also become compliant before we can merge it.

vitawasalreadytaken commented

View reviewed changes

research/document_types/draft_classification.py Outdated

+                  print(f"Number of dropped empty texts: {empty_count} ({100 * empty_count / len(df_input):.1f}%)")
+                  return df.loc[~empty_index]
+              def create_embeddings(column_to_embed: str, cache_directory: pathlib.Path = REPOSITORY_ROOT / "data" / "embeddings-cache"):

Contributor Author

vitawasalreadytaken Dec 10, 2024

This function relies on global variables (df_input and embedding_model). Instead of using global state, let's pass the necessary data in as arguments. Something like this:

Suggested change

      
            def create_embeddings(column_to_embed: str, cache_directory: pathlib.Path = REPOSITORY_ROOT / "data" / "embeddings-cache"):
          
            def create_embeddings(strings_to_embed: pd.Series, embedding_model: embeddings.EmbeddingModel, cache_directory: pathlib.Path = REPOSITORY_ROOT / "data" / "embeddings-cache") -> np.ndarray:

research/document_types/draft_classification.py Outdated

+                      print(embeddings_doc_content_plain.shape)
+                  return embeddings_doc_content_plain
+              def save_model(model_file_name):

Contributor Author

vitawasalreadytaken Dec 10, 2024

This is similar - let's pass classifier in as an argument.

research/document_types/draft_classification.py Outdated

+              from research.lib import data_access, embeddings
+              REPOSITORY_ROOT = (pathlib.Path().cwd() / ".." / "..").resolve()
+              sys.path.append(str(REPOSITORY_ROOT))

Contributor Author

vitawasalreadytaken Dec 10, 2024

Is this necessary? Since the imports above (from research.lib ...) apparently work?

Contributor Author

vitawasalreadytaken Dec 13, 2024

Note: I had to run the script like this:

cd research/document_types
PYTHONPATH=../../ uv run draft_classification.py

so that a) the research.lib imports on line 22 work and b) the relative path to draft_classification.toml works.

research/document_types/draft_classification.py Outdated

+              REPOSITORY_ROOT = (pathlib.Path().cwd() / ".." / "..").resolve()
+              sys.path.append(str(REPOSITORY_ROOT))
+              dotenv.load_dotenv()

Contributor Author

vitawasalreadytaken Dec 10, 2024

Side effects such as this should be in the if __name__ == "__main__": block, otherwise they're triggered even when this module is imported by something else.

research/document_types/draft_classification.py Outdated

+              with open('draft_classification.toml', 'r') as f:
+                  config = toml.load(f)
+              os.environ["MLFLOW_TRACKING_URI"] = config["tracking"]["tracking_uri"]

Contributor Author

vitawasalreadytaken Dec 10, 2024

This is also a side effect that should not be triggered at import time. In addition, we can use https://mlflow.org/docs/1.22.0/python_api/mlflow.html#mlflow.set_tracking_uri instead of manipulating environment variables. (I know it was like that in the original notebook but we can improve things in this new script 🙂)

research/document_types/draft_classification.py Outdated

Comment on lines 29 to 30

		with open('draft_classification.toml', 'r') as f:
		config = toml.load(f)

Contributor Author

vitawasalreadytaken Dec 10, 2024

Suggested change

      
            with open('draft_classification.toml', 'r') as f:
          
                config = toml.load(f)
          
            config = toml.load("draft_classification.toml")

However, it's also a side effect that would preferably happen in if __name__ == "__main__"...

vitawasalreadytaken commented

View reviewed changes

research/document_types/draft_classification.py Outdated

+                  ### Preprocessing ###
+                  df_input = remove_rows_with_missing_text(df_input)
+                  # set target variable
+                  df_input.loc[:, "is_draft"] = (df_input.loc[:, "document_type"] == "DRAFT").astype(bool)

Contributor Author

vitawasalreadytaken Dec 13, 2024

Oddly, I'm getting this warning on this line. I don't know why.

SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

vitawasalreadytaken pushed a commit that referenced this pull request


          Try using the DRAFT classifier from #8 to pre-filter documents before…

4fcc755

… topic classification


          refactoring based on Vitas review

5b5d44d

vitawasalreadytaken changed the title ~~Topic label exploration & draft document classifier~~ Draft document classifier

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet