Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add title topic modelling #4

Merged
merged 2 commits into from
Mar 20, 2024
Merged

Add title topic modelling #4

merged 2 commits into from
Mar 20, 2024

Conversation

lyashevska
Copy link
Contributor

Relates to ptypes-nlesc/ptypes-topic-modeling#14

Text Cleaning

  1. Split the title into individual words.
  2. Remove stopwords from the list of words.
  3. Apply a word rooter function to each word, unless the word contains a '#'.
  4. If the bigrams option is enabled, generate bigrams from the list of words and add them to the list.

Text Vectorization

The text vectorization process has been updated with the following changes:

  1. A CountVectorizer is used to transform the cleaned titles into a frequency matrix.
  2. The max_df parameter is set to 0.9, discarding words that appear in more than 90% of the titles.
  3. The min_df parameter is set to 25, discarding words that appear in less than 25 titles.
  4. The token_pattern parameter is set to '\w+|$[\d.]+|\S+', which matches words, dollar amounts, and non-whitespace sequences.

The frequency matrix is converted to an array and the feature names are retrieved using the get_feature_names_out method.

@lyashevska lyashevska marked this pull request as ready for review March 20, 2024 14:07
@lyashevska lyashevska merged commit 19b736b into main Mar 20, 2024
@lyashevska lyashevska deleted the title-topic-modelling branch March 20, 2024 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant