Skip to content

Corpora

Juliano Efson Sales edited this page May 24, 2018 · 7 revisions

Each pre-build models processed one of the following text corpora.

googlenews - The Google News Corpus

Available only in English, this corpus was offered Google. For more information, please check the official page.

Preprocess

The preprocess identified multi-word expressions transforming them in one unique token with underscore (Barack Obama -> Barack_Obama).

  • remove stop-words: false
  • lowercase: false
  • minimum word size: 1
  • apply stem: false
  • apply lemma: false

wiki-2018 / wiki-2014 - Wikipedia 2018/2014

Available for all languages, this corpus is the dump of the Wikipedia in 2018 for all languages and 2014 (only for English DEP model).

Preprocess

During the preprocess step, we ignored Redirect and Disambiguation pages.

  • remove stop-words: false
  • lowercase: true
  • minimum word size: 0
  • apply stem: false
  • apply lemma: false

cn-169

Available only in English at the moment, this corpus was offered by:

Preprocess

The preprocess identified multi-word expressions transforming them in one unique token with underscore (Barack Obama -> Barack_Obama).

  • remove stop-words: false
  • lowercase: true
  • minimum word size: 1
  • apply stem: false
  • apply lemma: false