-
Notifications
You must be signed in to change notification settings - Fork 14
Corpora
Juliano Efson Sales edited this page May 24, 2018
·
7 revisions
Each pre-build models processed one of the following text corpora.
Available only in English, this corpus was offered Google. For more information, please check the official page.
The preprocess identified multi-word expressions transforming them in one unique token with underscore (Barack Obama -> Barack_Obama).
- remove stop-words: false
- lowercase: false
- minimum word size: 1
- apply stem: false
- apply lemma: false
Available for all languages, this corpus is the dump of the Wikipedia in 2018 for all languages and 2014 (only for English DEP model).
During the preprocess step, we ignored Redirect and Disambiguation pages.
- remove stop-words: false
- lowercase: true
- minimum word size: 0
- apply stem: false
- apply lemma: false
Available only in English at the moment, this corpus was offered by:
- ConceptNet 5.5, which contains data from Wiktionary, WordNet, and many contributors to Open Mind Common Sense projects, edited by Rob Speer
- Common Crawl, this corpus was suggested by Jeffrey Pennington, Richard Socher, and Christopher Manning. For more information, please check the official page.
- Google News, this corpus was offered by Google. For more information, please check the official page.
The preprocess identified multi-word expressions transforming them in one unique token with underscore (Barack Obama -> Barack_Obama).
- remove stop-words: false
- lowercase: true
- minimum word size: 1
- apply stem: false
- apply lemma: false