Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

save and use pre-computed embeddings, replace tf_idf vectorizer method #16

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

EquinetPaul
Copy link

@EquinetPaul EquinetPaul commented Feb 17, 2023

  1. The LeetTopic class can take a parameter "embeddings" that can be used if you want to use a pre-computed embedding.
    If no embedding is passed as a parameter, the default encoding process is followed.
    Purpose: significantly improve the performance of LeetTopic, knowing that the document encoding is the most time-consuming part

  2. Save the embeddings to a pickle object file after calculating them (this functionality should maybe be passed as a parameter of the LeetTopic class like "save_embeddings=True")

  3. The "get_feature_names()" method of tfidf_vectorizer (scikit-learn) is deprecated (scikit-learn>=1.2) and should be replaced by "get_feature_names_out()".

  4. Simply removed an import line that appeared twice for SentenceTransformer.

Usage of pre-computed embeddings:

# Load embeddings
with open("embeddings.pickle", "rb") as fichier:
    precomputed_embeddings = pickle.load(fichier)

# LeetTopic
leet_df, topic_data = leet_topic.LeetTopic(df,
                                          document_field="descriptions",
                                          html_filename="demo.html",
                                          spacy_model="fr_core_news_md",
                                          embeddings = precomputed_embeddings )

remove duplicate importation of SentenceTransformer
replace "get_feature_names()" method by "get_feature_names_out()" for scikit-learn>=1.2
add a parameter to LeetTopic class to take pre-computed embeddings to accelerate general processing
Save the embeddings to a pickle object file after calculating them
@joelsjlee
Copy link
Collaborator

Hi Paul,
Thank you for the PR! Indeed I think being able to save and load the embeddings may be a useful part of this application. A couple things I'm thinking about, and I would also like to get @wjbmattingly's thoughts on this:

  1. I wonder if np.save and np.load would be easier here rather than pickle. I think that np.save defaults to using pickle anyways and the embeddings are numpy arrays. Maybe there is a performance difference?
  2. If we do this in either implementation, we should also probably put a parameter so that the user can name the embeddings file.
  3. The newer scikit learn function name will be updated soon! Thanks for the reminder.

@EquinetPaul
Copy link
Author

EquinetPaul commented Feb 17, 2023

Hi,
Oh yes, you are right about using numpy save/load since the output of the embedding is of type "numpy.ndarray".

# Save 
np.save(save_embeddings_file_name , doc_embeddings)

# Load 
doc_embeddings= np.load(embeddings_file_name)

Yes, the parameters to consider if we want to make the embedding parameterizable could be:

  1. if the embedding is passed as a parameter as a numpy.ndarray.
def LeetTopic(df: pd.DataFrame,
            ...
            embeddings = None,
            save_embeddings_file_name = "embeddings.save",
            ...
            ):

or
2. If the filename of the embedding is passed as a parameter (and then it needs to load it)

def LeetTopic(df: pd.DataFrame,
            ...
            embeddings_file_name = None,
            save_embeddings_file_name = "embeddings.save",
            ...
            ):

In any case:

  • The embedding is calculated if the variable embeddings:numpy.ndarray or embeddings_file_name:str is not passed as a parameter.
  • The embedding is saved to a file if the variable save_embeddings_file_name:str is passed as a parameter.

Up to you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants