save and use pre-computed embeddings, replace tf_idf vectorizer method #16
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The LeetTopic class can take a parameter "embeddings" that can be used if you want to use a pre-computed embedding.
If no embedding is passed as a parameter, the default encoding process is followed.
Purpose: significantly improve the performance of LeetTopic, knowing that the document encoding is the most time-consuming part
Save the embeddings to a pickle object file after calculating them (this functionality should maybe be passed as a parameter of the LeetTopic class like "save_embeddings=True")
The "get_feature_names()" method of tfidf_vectorizer (scikit-learn) is deprecated (scikit-learn>=1.2) and should be replaced by "get_feature_names_out()".
Simply removed an import line that appeared twice for SentenceTransformer.
Usage of pre-computed embeddings: