Skip to content

A collection of scripts to download, clean and tokenize a wikipedia dump, and split the corpus into train/val/test sets.

Notifications You must be signed in to change notification settings

toizzy/wiki-corpus-creator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Scripts to easily create sentence-segmented, tokenized corpus in any language from Wikipedia dumps.

The pipeline is in make_lang_corpus. The script downloads a wikipedia dump and runs process_wiki_dump.py to process the dump into a file that has one sentence per line, with tokens separated by spaces. It then runs split_corpus.sh to split the corpus into train/val/test sets

./make_lang_corpus.sh LANG makes a corpus in language LANG

LANG is the language the code that Wikipedia uses. So, for example, the Greek wikipedia is at el.wikipedia.org, so if I wanted to make a Greek corpus I would run ./make_lang_corpus.sh el from this directory

split_corpus.sh splits the corpus into five fifths, and takes one fifth of the val and test set from each fifth. This is to avoid cases where the whole test or validation set is from one nonstandard document in the original wikipedia corpus, that would affect evaluation.

Parameters that you could change

  • As it holds now, process_wiki_dump.py stops when the corpus length exceeds 100,000,000 tokens. You can change this by changing the max_tokens parameter.
  • split_corpus.sh makes a test corpus of length 20,000 sentences, a val corpus of length 2,000 sentences, and puts the rest in train. You can change this by changing the numbers.

Python requirements

The python script process_wiki_dump.pyrequires stanfordnlp, gensim and xml.etree, all available in pip.

About

A collection of scripts to download, clean and tokenize a wikipedia dump, and split the corpus into train/val/test sets.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published