DKPro Spelling is a highly configurable spellchecking application.
It is language-invariant: To process any language in a minimal version, only a tokenizer and dictionary are required.
A named entity recognizer and (unigram) language model are likely to improve results.
Resources included in this repository are:
- Corpora:
- The English, German and Czech MERLIN corpora and the German Litkey corpus in the LeSpell format.
- Scripts to convert the Italian CItA and English TOEFL-Spell corpora into the LeSpell format.
- Dictionaries for English, German, Italian and Czech and a script to generate a
.txt
dictionary from hunspell.dic
and.aff
files. - Phonetic versions (
*_phoneme_map.txt
) of the above dictionaries and a script to create a phonetic dictionary from a given .txt dictionary. - Keyboard distance files for English, German, Italian and Czech.
- Unigram language models for English, German and Italian (see here for an example on how to incorporate unigram frequencies to rerank correction candidates).
See an example end-to-end pipeline here and one that employs the different candidate correction methods here. For easy access we also provide jars for the default configuration (see User Mode).
Before you use phonetic spellchecking on a new corpus, please pre-generate phonetic representations of misspellings/out-of-dictionary words in it as shown here and place copies of the *_phoneme_map.txt
dictionaries as well as any custom phonetic dictionaries in the respective language folders here.
Please make sure to git clone
this repository rather than to download it as a .zip
. This ensures that the jars and other large files will be downloaded properly.
For Web 1T reranking to work, set WEB1T
system variable to point to the location of web1t (export WEB1T="PATH_TO_WEB1T"
).
In this folder you need subfolders /en
, /de
, /it
, /cz
for the respective languages. Within these, you need subfolders /*gms
as well as files index-*gms
and the aggregated_counts.cnt
file. You can obtain Web 1T from the Linguistic Data Consortium.
You may have to unzip some of the dictionaries.
<corpus name="EXAMPLE_NAME"><text id="EXAMPLE_ID" lang="en">
We mark <error correct="misspellings" type="typo">mispellings</error> as shown in this example.
</text></corpus>
Supports English (en), German (de), Italian (it) and Czech (cz).
Requires WEB1T system variable to be set.
Use the tool in its default configuration: Generate correction candidates based on Damerau-Levenshtein Distance, rerank them using Web 1T trigrams.
Spellcheck a .txt
file.
Outputs .tsv
with ranked list of corrections.
To run an example: java -jar DKPro_Spellcheck.jar de spelling/src/main/resources/corpora/test_de.txt
java -jar DKPro_Spellcheck.jar [LANGUAGE] [PATH_TO_TXT]
Evaluate error correction on a corpus annotated with errors (in the LeSpell XML format).
Outputs recall@k and lists of words that are corrected correctly/incorrectly.
To run an example: java -jar DKPro_Spellcheck_EvaluateCorrection.jar cz spelling/src/main/resources/corpora/merlin-CZ_spelling.xml
java -jar DKPro_Spellcheck_EvaluateCorrection.jar [LANGUAGE] [PATH_TO_XML]
As Web 1T is quite large, you may (especially for English) want to set -Djava.io.tmpdir="PATH_TO_DIR"
to a folder with enough space, making the full command java -Djava.io.tmpdir="PATH_TO_DIR" -jar [JAR_NAME] [LANG] [PATH_TO_FILE]
@InProceedings{le-spell-2022,
author = {Bexte, Marie and Laarmann-Quante, Ronja and Horbach, Andrea and Zesch, Torsten},
title = {LeSpell - A Multi-Lingual Benchmark Corpus of Spelling Errors to Develop Spellchecking Methods for Learner Language},
booktitle = {Proceedings of the Language Resources and Evaluation Conference},
month = {June},
year = {2022},
address = {Marseille, France},
publisher = {European Language Resources Association},
pages = {697--706},
url = {https://aclanthology.org/2022.lrec-1.73}
}