Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topic Modeling, Text Analysis #3

Open
gleporeNARA opened this issue Mar 20, 2017 · 0 comments
Open

Topic Modeling, Text Analysis #3

gleporeNARA opened this issue Mar 20, 2017 · 0 comments
Milestone

Comments

@gleporeNARA
Copy link
Owner

gleporeNARA commented Mar 20, 2017

Mallet:
http://mallet.cs.umass.edu/index.php

"MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text."

see also:
http://programminghistorian.org/lessons/topic-modeling-and-mallet

use standalone Java version at:
https://github.com/senderle/topic-modeling-tool

The below output used 15 as number of topics, and prints the top 5 keywords for each topic. The corpus was a collection of 450 Civil War obituaries, and 50 running race reports (two very different categories of data.)

#################OUTPUT#############
List of Topics

  1. Obituaries Russell County Death Date
  2. Alderson man Fields county Captain
  3. years Mr Rev home church
  4. Image Lebanon VA News 11
  5. Jones grandchildren Lewis great Browning
  6. mile race run Dwight Race
  7. Jackson Duff Bundy Steele left
  8. Kiser Ball Hurt Hendricks Norton
  9. Honaker Fogleman ago Love Buckles
  10. death good God friends friend
  11. Litton Vicars Vermillion Gap Va
  12. Bausell Bays Hill Camp Webb
  13. train Porter head man struck
  14. race miles time finish running
  15. Obituaries Russell County Obituary Soldiers
    ##########################
    Topics 5 and 13 are clearly the 50 running documents, and the other topics neatly highlight various aspects of the Civil War obituaries. We can massage this data by replacing the topic numbers with user-chosen categories:

List of Topics

Obituaries - Obituaries Russell County Death Date
Names, locations, and ranks - Alderson man Fields county Captain
Religious - years Mr Rev home church
Newspaper images - Image Lebanon VA News 11
Family - Jones grandchildren Lewis great Browning
Running and people - mile race run Dwight Race
Family names - Jackson Duff Bundy Steele left
Family names - Kiser Ball Hurt Hendricks Norton
Names and locations - Honaker Fogleman ago Love Buckles
Religious - death good God friends friend
Names and Locations - Litton Vicars Vermillion Gap Va
Names and Locations - Bausell Bays Hill Camp Webb
Death and names - train Porter head man struck
Running - race miles time finish running
Obituaries - Obituaries Russell County Obituary Soldiers

From this we can further group the data into topics, and simplify:

Category 1 - Obituaries, Names, locations, ranks, religious, newspaper images, family, family names, death.

Category 2 - Running and people.

Thus we have gone from close to 500 documents, to two categories in a few minutes.

@kentdphan kentdphan self-assigned this Mar 21, 2017
@gleporeNARA gleporeNARA added this to the 2.0 milestone Apr 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@gleporeNARA @kentdphan and others