DKPro Spelling

DKPro Spelling is a highly configurable spellchecking application.
It is language-invariant: To process any language in a minimal version, only a tokenizer and dictionary are required.
A named entity recognizer and (unigram) language model are likely to improve results.

Resources included in this repository are:

Corpora:
- The English, German and Czech MERLIN corpora and the German Litkey corpus in the LeSpell format.
- Scripts to convert the Italian CItA and English TOEFL-Spell corpora into the LeSpell format.
Dictionaries for English, German, Italian and Czech and a script to generate a .txt dictionary from hunspell .dic and .aff files.
Phonetic versions (*_phoneme_map.txt) of the above dictionaries and a script to create a phonetic dictionary from a given .txt dictionary.
Keyboard distance files for English, German, Italian and Czech.
Unigram language models for English, German and Italian (see here for an example on how to incorporate unigram frequencies to rerank correction candidates).

See an example end-to-end pipeline here and one that employs the different candidate correction methods here. For easy access we also provide jars for the default configuration (see User Mode).

Before you use phonetic spellchecking on a new corpus, please pre-generate phonetic representations of misspellings/out-of-dictionary words in it as shown here and place copies of the *_phoneme_map.txt dictionaries as well as any custom phonetic dictionaries in the respective language folders here.

Setup

Please make sure to git clone this repository rather than to download it as a .zip. This ensures that the jars and other large files will be downloaded properly.

For Web 1T reranking to work, set WEB1T system variable to point to the location of web1t (export WEB1T="PATH_TO_WEB1T"). In this folder you need subfolders /en, /de, /it, /cz for the respective languages. Within these, you need subfolders /*gms as well as files index-*gms and the aggregated_counts.cnt file. You can obtain Web 1T from the Linguistic Data Consortium.

You may have to unzip some of the dictionaries.

LeSpell Error Annotation Format

<corpus name="EXAMPLE_NAME"><text id="EXAMPLE_ID" lang="en">
  We mark <error correct="misspellings" type="typo">mispellings</error> as shown in this example.
</text></corpus>

User Mode

Supports English (en), German (de), Italian (it) and Czech (cz).
Requires WEB1T system variable to be set.

Use the tool in its default configuration: Generate correction candidates based on Damerau-Levenshtein Distance, rerank them using Web 1T trigrams.

Spellcheck a .txt file.
Outputs .tsv with ranked list of corrections.
To run an example: java -jar DKPro_Spellcheck.jar de spelling/src/main/resources/corpora/test_de.txt

java -jar DKPro_Spellcheck.jar [LANGUAGE] [PATH_TO_TXT]

Evaluate error correction on a corpus annotated with errors (in the LeSpell XML format).
Outputs recall@k and lists of words that are corrected correctly/incorrectly.
To run an example: java -jar DKPro_Spellcheck_EvaluateCorrection.jar cz spelling/src/main/resources/corpora/merlin-CZ_spelling.xml

java -jar DKPro_Spellcheck_EvaluateCorrection.jar [LANGUAGE] [PATH_TO_XML]

As Web 1T is quite large, you may (especially for English) want to set -Djava.io.tmpdir="PATH_TO_DIR" to a folder with enough space, making the full command java -Djava.io.tmpdir="PATH_TO_DIR" -jar [JAR_NAME] [LANG] [PATH_TO_FILE]

Cite

@InProceedings{le-spell-2022,
  author    = {Bexte, Marie  and  Laarmann-Quante, Ronja  and  Horbach, Andrea  and  Zesch, Torsten},
  title     = {LeSpell - A Multi-Lingual Benchmark Corpus of Spelling Errors to Develop Spellchecking Methods for Learner Language},
  booktitle = {Proceedings of the Language Resources and Evaluation Conference},
  month     = {June},
  year      = {2022},
  address   = {Marseille, France},
  publisher = {European Language Resources Association},
  pages     = {697--706},
  url       = {https://aclanthology.org/2022.lrec-1.73}
}

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
data_prep		data_prep
python_analysis		python_analysis
python_languageTool		python_languageTool
spelling		spelling
.gitattributes		.gitattributes
.gitignore		.gitignore
DKPro_Spellcheck.jar		DKPro_Spellcheck.jar
DKPro_Spellcheck_EvaluateCorrection.jar		DKPro_Spellcheck_EvaluateCorrection.jar
README.md		README.md
pipeline_overview.png		pipeline_overview.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DKPro Spelling

Setup

LeSpell Error Annotation Format

User Mode

Supports English (en), German (de), Italian (it) and Czech (cz).
Requires WEB1T system variable to be set.

Cite

About

Releases

Packages

Contributors 3

Languages

catalpa-cl/ltl-spelling

Folders and files

Latest commit

History

Repository files navigation

DKPro Spelling

Setup

LeSpell Error Annotation Format

User Mode

Supports English (en), German (de), Italian (it) and Czech (cz). Requires WEB1T system variable to be set.

Cite

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Supports English (en), German (de), Italian (it) and Czech (cz).
Requires WEB1T system variable to be set.

Packages