Skip to content

Pseudo Machine Learning News headline categorizer in Ruby

Notifications You must be signed in to change notification settings

adamkovesdi/news-categorizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine Learning news headline categorizer in Ruby

In this primitive machine learning experiment a Ruby program is categorizing news article headlines into one of the following categories:

  • Business
  • Science and technology
  • Entertainment
  • Health

Aggregated historical news headline datasets are used to initially train the 'brain' of the machine.

These datasets are being parsed into dictionaries containing wordlists separated into the categories above. Each word has a frequency value associated to it. This value describes how many times this word appeared in headlines of the category.

Wordlists are adjusted to exclude stop words.

The algorithm cleans, tokenizes the given input then looks up the highest probability based on the sum of word frequency in each category.

Usage

  1. First download the uci-news-aggregator.csv dataset (102.9MB) from here
  2. Adjust 'brain size' by changing DEBUGLIMIT parsedata.rb or set DEBUG = 0 to completely disable limits
  3. Run program:
$ ruby main.rb /path/to/uci-news-aggregator.csv

Processing uci-news-aggregator.csv
Finished processing uci-news-aggregator.csv
Records processed: 20000

[...]

Brain initialization complete
Let me try to categorize your sentence (type quit or Ctrl+D to exit)
> Doctors found an effective drug for Alzheimers
Highest probability: m {"b"=>7, "t"=>27, "e"=>78, "m"=>338}
> 
  1. Enter news headlines on standard input to evaluate

Further steps of development

  • Training facility
  • Web aggregation
  • Human supervised brain training

Links, References

About

Pseudo Machine Learning News headline categorizer in Ruby

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages