GitHub - toizzy/wiki-corpus-creator: A collection of scripts to download, clean and tokenize a wikipedia dump, and split the corpus into train/val/test sets.

Scripts to easily create sentence-segmented, tokenized corpus in any language from Wikipedia dumps.

The pipeline is in make_lang_corpus. The script downloads a wikipedia dump and runs process_wiki_dump.py to process the dump into a file that has one sentence per line, with tokens separated by spaces. It then runs split_corpus.sh to split the corpus into train/val/test sets

./make_lang_corpus.sh LANG makes a corpus in language LANG

LANG is the language the code that Wikipedia uses. So, for example, the Greek wikipedia is at el.wikipedia.org, so if I wanted to make a Greek corpus I would run ./make_lang_corpus.sh el from this directory

split_corpus.sh splits the corpus into five fifths, and takes one fifth of the val and test set from each fifth. This is to avoid cases where the whole test or validation set is from one nonstandard document in the original wikipedia corpus, that would affect evaluation.

Parameters that you could change

As it holds now, process_wiki_dump.py stops when the corpus length exceeds 100,000,000 tokens. You can change this by changing the max_tokens parameter.
split_corpus.sh makes a test corpus of length 20,000 sentences, a val corpus of length 2,000 sentences, and puts the rest in train. You can change this by changing the numbers.

Python requirements

The python script process_wiki_dump.pyrequires stanfordnlp, gensim and xml.etree, all available in pip.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
make_lang_corpus.sh		make_lang_corpus.sh
process_wiki_dump.py		process_wiki_dump.py
split_corpus.sh		split_corpus.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parameters that you could change

Python requirements

About

Releases

Packages

Languages

toizzy/wiki-corpus-creator

Folders and files

Latest commit

History

Repository files navigation

Parameters that you could change

Python requirements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages