Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Relates to ptypes-nlesc/ptypes-topic-modeling#14
Text Cleaning
bigrams
option is enabled, generate bigrams from the list of words and add them to the list.Text Vectorization
The text vectorization process has been updated with the following changes:
CountVectorizer
is used to transform the cleaned titles into a frequency matrix.max_df
parameter is set to 0.9, discarding words that appear in more than 90% of the titles.min_df
parameter is set to 25, discarding words that appear in less than 25 titles.token_pattern
parameter is set to '\w+|$[\d.]+|\S+', which matches words, dollar amounts, and non-whitespace sequences.The frequency matrix is converted to an array and the feature names are retrieved using the
get_feature_names_out
method.