save and use pre-computed embeddings, replace tf_idf vectorizer method #16

EquinetPaul · 2023-02-17T13:03:05Z

The LeetTopic class can take a parameter "embeddings" that can be used if you want to use a pre-computed embedding.
If no embedding is passed as a parameter, the default encoding process is followed.
Purpose: significantly improve the performance of LeetTopic, knowing that the document encoding is the most time-consuming part
Save the embeddings to a pickle object file after calculating them (this functionality should maybe be passed as a parameter of the LeetTopic class like "save_embeddings=True")
The "get_feature_names()" method of tfidf_vectorizer (scikit-learn) is deprecated (scikit-learn>=1.2) and should be replaced by "get_feature_names_out()".
Simply removed an import line that appeared twice for SentenceTransformer.

Usage of pre-computed embeddings:

# Load embeddings
with open("embeddings.pickle", "rb") as fichier:
    precomputed_embeddings = pickle.load(fichier)

# LeetTopic
leet_df, topic_data = leet_topic.LeetTopic(df,
                                          document_field="descriptions",
                                          html_filename="demo.html",
                                          spacy_model="fr_core_news_md",
                                          embeddings = precomputed_embeddings )

remove duplicate importation of SentenceTransformer

replace "get_feature_names()" method by "get_feature_names_out()" for scikit-learn>=1.2

add a parameter to LeetTopic class to take pre-computed embeddings to accelerate general processing

Save the embeddings to a pickle object file after calculating them

joelsjlee · 2023-02-17T18:23:31Z

Hi Paul,
Thank you for the PR! Indeed I think being able to save and load the embeddings may be a useful part of this application. A couple things I'm thinking about, and I would also like to get @wjbmattingly's thoughts on this:

I wonder if np.save and np.load would be easier here rather than pickle. I think that np.save defaults to using pickle anyways and the embeddings are numpy arrays. Maybe there is a performance difference?
If we do this in either implementation, we should also probably put a parameter so that the user can name the embeddings file.
The newer scikit learn function name will be updated soon! Thanks for the reminder.

EquinetPaul · 2023-02-17T18:51:05Z

Hi,
Oh yes, you are right about using numpy save/load since the output of the embedding is of type "numpy.ndarray".

# Save 
np.save(save_embeddings_file_name , doc_embeddings)

# Load 
doc_embeddings= np.load(embeddings_file_name)

Yes, the parameters to consider if we want to make the embedding parameterizable could be:

if the embedding is passed as a parameter as a numpy.ndarray.

def LeetTopic(df: pd.DataFrame,
            ...
            embeddings = None,
            save_embeddings_file_name = "embeddings.save",
            ...
            ):

or
2. If the filename of the embedding is passed as a parameter (and then it needs to load it)

def LeetTopic(df: pd.DataFrame,
            ...
            embeddings_file_name = None,
            save_embeddings_file_name = "embeddings.save",
            ...
            ):

In any case:

The embedding is calculated if the variable embeddings:numpy.ndarray or embeddings_file_name:str is not passed as a parameter.
The embedding is saved to a file if the variable save_embeddings_file_name:str is passed as a parameter.

Up to you :)

EquinetPaul added 4 commits February 17, 2023 13:37

Update leet_topic.py

eb2eff9

remove duplicate importation of SentenceTransformer

Update leet_topic.py

ff8cc52

replace "get_feature_names()" method by "get_feature_names_out()" for scikit-learn>=1.2

Update leet_topic.py

2784c91

add a parameter to LeetTopic class to take pre-computed embeddings to accelerate general processing

Update leet_topic.py

02932be

Save the embeddings to a pickle object file after calculating them

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

save and use pre-computed embeddings, replace tf_idf vectorizer method #16

save and use pre-computed embeddings, replace tf_idf vectorizer method #16

EquinetPaul commented Feb 17, 2023 •

edited

Loading

joelsjlee commented Feb 17, 2023

EquinetPaul commented Feb 17, 2023 •

edited

Loading

save and use pre-computed embeddings, replace tf_idf vectorizer method #16

Are you sure you want to change the base?

save and use pre-computed embeddings, replace tf_idf vectorizer method #16

Conversation

EquinetPaul commented Feb 17, 2023 • edited Loading

joelsjlee commented Feb 17, 2023

EquinetPaul commented Feb 17, 2023 • edited Loading

EquinetPaul commented Feb 17, 2023 •

edited

Loading

EquinetPaul commented Feb 17, 2023 •

edited

Loading