Add title topic modelling #4

lyashevska · 2024-03-13T15:43:47Z

Split the title into individual words.
Remove stopwords from the list of words.
Apply a word rooter function to each word, unless the word contains a '#'.
If the bigrams option is enabled, generate bigrams from the list of words and add them to the list.

The text vectorization process has been updated with the following changes:

A CountVectorizer is used to transform the cleaned titles into a frequency matrix.
The max_df parameter is set to 0.9, discarding words that appear in more than 90% of the titles.
The min_df parameter is set to 25, discarding words that appear in less than 25 titles.
The token_pattern parameter is set to '\w+|$[\d.]+|\S+', which matches words, dollar amounts, and non-whitespace sequences.

The frequency matrix is converted to an array and the feature names are retrieved using the get_feature_names_out method.

Olga Lyashevska added 2 commits March 13, 2024 16:38

Add title topic modelling

d24e46f

Fit LDA on titles

1a84336

lyashevska marked this pull request as ready for review March 20, 2024 14:07

lyashevska merged commit 19b736b into main Mar 20, 2024

lyashevska deleted the title-topic-modelling branch March 20, 2024 14:07

Provide feedback