Skip to content

Corpora

Siamak Barzegar edited this page Jul 14, 2017 · 7 revisions

Each pre-build models processed one of the following text corpora.

googlenews300neg - The Google News Corpus

Available only in English, this corpus was offered Google. For more information, please check the official page.

Preprocess

The preprocess identified multi-word expressions transforming them in one unique token with underscore (Barack Obama -> Barack_Obama).

  • remove stop-words: false
  • lowercase: false
  • minimum word size: 1
  • apply stem: false
  • apply lemma: false

wiki-2014 / wiki-2016 - Wikipedia 2014/2016

Available for all languages, this corpus is the dump of the Wikipedia in 2014 (except JA and KO) and 2016 (for JA and KO).

Preprocess

During the preprocess step, we ignored Redirect and Disambiguation pages.

  • remove stop-words: true
  • lowercase: true
  • minimum word size: 3
  • apply stem: true
  • apply lemma: false

cn-169

Available only in English, this corpus was offered by:

Clone this wiki locally