GitHub - grammatek/GreynirCorrect4LT: Spelling and grammar correction for Icelandic

https://github.com/mideind/GreynirCorrect/workflows/Python%20package/badge.svg?branch=master

GreynirCorrect4LT: Spelling and grammar correction for Icelandic Language Technology

This is a slightly adapted version of Miðeind's GreynirCorrect, from which this repository is forked.

This version is implemented for use in a text-to-speech text pre-processing pipeline, but the documentation below shows some access points for a quick adaptation to other use cases in language technology applications.

Overview

GreynirCorrect is a Python 3 (>= 3.6) package and command line tool for checking and correcting spelling and grammar in Icelandic text.

GreynirCorrect relies on the Greynir package, by the same authors, to tokenize and parse text.

GreynirCorrect is documented in detail here.

For the instructions on how to use GreynirCorrect as is, please consult the original repository

Getting started

To try out GreynirCorrect for your use case, a good place to start is to create your own API script in src/reynir_correct/tools/. For examples, see the existing scripts diffchecker.py and tts_frontend.py. This of course only applies if you plan to change the core code base, otherwise you should use GreynirCorrect or GreynirCorrect4LT as a library.

The correction pipeline

GreynirCorrect relies heavily on the Greynir package, and a lot of "magic" happens under the hood, not directly accessible from GreynirCorrect. The tokenizing process also performs corrections along the way, and essentially three tokenizers work together in GreynirCorrect:

tokenizer.py from Tokenizer

bintokenizer.py from GreynirPackage

errtokenizer.py from GreynirCorrect: src/reynir_correct/errtokenizer.py

Since we cannot - and should not - interfere with the tokenizers from Tokenizer and GreynirPackage, we use errtokenizer to override processes as needed. A spell correction pipeline is defined in bintokenizer as class DefaultPipeline. Familiarize yourself with this pipeline to get an overview of how GreynirCorrect works.

The DefaultPipeline is overridden as class CorrectionPipeline in errtokenizer.py. Here, we have the possibility to override PhaseFunctions defined in the DefaultPipeline to change the default tokenizing/correcting process. In the default implementation of GreynirCorret, the pipeline in CorrectionPipeline looks like this (remember: DefaultPipeline is defined in bintokenizer.py in GreynirPackage, the CorrectionPipeline in errtokenizer.py):

DefaultPipeline.tokenize_without_annotation  # core function: tokenizer.tokenize()
CorrectionPipeline.correct_tokens            # core function: errtokenizer.parse_errors()
DefaultPipeline.parse_static_phrases         # core function: bintokenizer.StaticPhraseStream.process()
DefaultPipeline.annotate                     # core function: bintokenizer.annotate()
DefaultPipeline.recognize_entities           # not implemented
CorrectionPipeline.check_spelling            # core function: CorrectionPipeline.check_spelling()
DefaultPipeline.parse_phrases_1              # core function: bintokenizer.parse_phrases_1()
DefaultPipeline.parse_phrases_2              # core function: bintokenizer.parse_phrases_2()
DefaultPipeline.parse_phrases_3              # core function: bintokenizer.parse_phrases_3()
DefaultPipeline.fix_abbreviations            # core function: bintokenizer.fix_abbreviations()
DefaultPipeline.disambiguate_phrases         # core function: bintokenizer.DisambiguationStream.process()
CorrectionPipeline.final_correct             # core function: CorrectionPipeline.final_correct()

To change the tokenization itself, i.e. to change the way the input text is split up into tokens, we need to override the first PhaseFunction and change DefaultPipeline.tokenize_without_annotation to CorrectionPipeline.tokenize_without_annotation. In the current 4LT-implementation, tokens starting with < are not split up as happens in the original tokenizer. For example, the original tokenizer would split <sil> into < sil >, making the token sil an object for correction to til. The text-to-speech (TTS) pipeline relies on all tags being kept as is at this stage, in previous analysis steps for example SSML-tags for the TTS system might have been added.

Adding word lists

GreynirCorrect uses different word lists and maps to assist with spell checking and correction. For LT-applications we might want to add our own lists of words that need special treatment during the process. In case of the TTS frontend pipeline, we add a list of lemmas that represent words that are a typical output of a text normalization system for TTS. The normalization system expands abbreviations and replaces digits with their written-out forms, e.g. 5 becomes five in a TTS-normalized text. In Icelandic, many of these words are inflected and while the normalizer has certain means to determine the correct case, gender and number of an expanded word, it does make errors. To increase the possibility of a correct form, we attach all possible forms to these words, should they occur in the input, and let the parser in GreynirPackage choose the most likeliy PoS tags for the token. In a post-processing step we can then correct the token according to the PoS tags.

To add a new word list or word information file, you should add it to reynir_correct/config/. There you can see the files GreynirCorrect already uses. To read a word information file on initialization and to be able to use an object containing this information, you need to edit two files: reynir_correct/config/GreynirCorrect.conf and reynir_correct/settings.py.

GreynirCorrect.conf does also contain several word lists, but for additional data it is better to create separate files. To tell GreynirCorrect to read your data on initialization, define the data structure you want to use (set, map, ...) in the settings file, add a corresponding handling method, and add an entry to CONFIG_HANDLERS in settings.read(). In GreynirCorrect.conf, add an include statement at the end of the file, see tts_normalizer_words as an example.

Adding or changing functionality

To change or add functionality of GreynirCorrect, you most likely will edit, override, or add functions to errtokenizer.py. Study the CorrectionPipeline to find the best fitting access point for your adaptation. Also, make sure to import your data from settings, as described in the previous section, if you have added any extra word information data. To see how functionality may be added, study errtokenizer.check_normalized_words().

Prerequisites

GreynirCorrect runs on CPython 3.6 or newer, and on PyPy 3.6 or newer. It has been tested on Linux, macOS and Windows. The PyPi package includes binary wheels for common environments, but if the setup on your OS requires compilation from sources, you may need

$ sudo apt-get install python3-dev

...or something to similar effect to enable this.

Installation

To install this package (assuming you have Python >= 3.6 with pip installed):

$ pip install reynir-correct

If you want to be able to edit the source, do like so (assuming you have git installed):

$ git clone https://github.com/mideind/GreynirCorrect
$ cd GreynirCorrect
$ # [ Activate your virtualenv here if you have one ]
$ pip install -e .

The package source code is now in GreynirCorrect/src/reynir_correct.

The command line tool

After installation, the corrector can be invoked directly from the command line:

$ correct input.txt output.txt

...or:

$ echo "Þinngið samþikkti tilöguna" | correct
Þingið samþykkti tillöguna

Input and output files are encoded in UTF-8. If the files are not given explicitly, stdin and stdout are used for input and output, respectively.

Empty lines in the input are treated as sentence boundaries.

By default, the output consists of one sentence per line, where each line ends with a single newline character (ASCII LF, chr(10), "\n"). Within each line, tokens are separated by spaces.

The following (mutually exclusive) options can be specified on the command line:

`--csv`	Output token objects in CSV format, one per line. Sentences are separated by lines containing `0,"",""`
`--json`	Output token objects in JSON format, one per line.
`--normalize`	Normalize punctuation, causing e.g. quotes to be output in Icelandic form and hyphens to be regularized.
`--grammar`	Output whole-sentence annotations, including corrections and suggestions for spelling and grammar. Each sentence in the input is output as a text line containing a JSON object, terminated by a newline.

The CSV and JSON formats of token objects are identical to those documented for the Tokenizer package.

The JSON format of whole-sentence annotations is identical to the one documented for the Yfirlestur.is HTTPS REST API.

Type correct -h to get a short help message.

Command Line Examples

$ echo "Atvinuleysi jógst um 3%" | correct
Atvinnuleysi jókst um 3%

$ echo "Barnið vil grænann lit" | correct --csv
6,"Barnið",""
6,"vil",""
6,"grænan",""
6,"lit",""
0,"",""

Note how vil is not corrected, as it is a valid and common word, and the correct command does not perform grammar checking by default.

$ echo "Pakkin er fyrir hestin" | correct --json
{"k":"BEGIN SENT"}
{"k":"WORD","t":"Pakkinn"}
{"k":"WORD","t":"er"}
{"k":"WORD","t":"fyrir"}
{"k":"WORD","t":"hestinn"}
{"k":"END SENT"}

To perform whole-sentence grammar checking and annotation as well as spell checking, use the --grammar option:

$ echo "Ég kláraði verkefnið þrátt fyrir að ég var þreittur." | correct --grammar
   {
      "original":"Ég kláraði verkefnið þrátt fyrir að ég var þreittur.",
      "corrected":"Ég kláraði verkefnið þrátt fyrir að ég var þreyttur.",
      "tokens":[
         {"k":6,"x":"Ég","o":"Ég"},
         {"k":6,"x":"kláraði","o":" kláraði"},
         {"k":6,"x":"verkefnið","o":" verkefnið"},
         {"k":6,"x":"þrátt fyrir","o":" þrátt fyrir"},
         {"k":6,"x":"að","o":" að"},
         {"k":6,"x":"ég","o":" ég"},
         {"k":6,"x":"var","o":" var"},
         {"k":6,"x":"þreyttur","o":" þreittur"},
         {"k":1,"x":".","o":"."}
      ],
      "annotations":[
         {
            "start":6,
            "end":6,
            "start_char":35,
            "end_char":37,
            "code":"P_MOOD_ACK",
            "text":"Hér er réttara að nota viðtengingarhátt
               sagnarinnar 'vera', þ.e. 'væri'.",
            "detail":"Í viðurkenningarsetningum á borð við 'Z'
               í dæminu 'X gerði Y þrátt fyrir að Z' á sögnin að vera
               í viðtengingarhætti fremur en framsöguhætti.",
            "suggest":"væri"
         },
         {
            "start":7,
            "end":7,
            "start_char":38,
            "end_char":41,
            "code":"S004",
            "text":"Orðið 'þreittur' var leiðrétt í 'þreyttur'",
            "detail":"",
            "suggest":"þreyttur"
         }
      ]
   }

The output has been formatted for legibility - each input sentence is actually represented by a JSON object in a single line of text, terminated by newline.

Note that the corrected field only includes token-level spelling correction (in this case þreittur -> þreyttur), but no grammar corrections. The grammar corrections are found in the annotations list. To apply corrections and suggestions from the annotations, replace source text or tokens (as identified by the start and end, or start_char and end_char properties) with the suggest field, if present.

Tests

To run the built-in tests, install pytest, cd to your GreynirCorrect subdirectory (and optionally activate your virtualenv), then run:

$ python -m pytest

Acknowledgements

Parts of this software are developed under the auspices of the Icelandic Government's 5-year Language Technology Programme for Icelandic, which is managed by Almannarómur and described here (English version here).

Copyright and License

GreynirCorrect's original author is Vilhjálmur Þorsteinsson.

This software is licensed under the MIT License:

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

GreynirCorrect indirectly embeds the Database of Icelandic Morphology (Beygingarlýsing íslensks nútímamáls), abbreviated BÍN, along with directly using Ritmyndir <https://bin.arnastofnun.is/DMII/LTdata/comp-format/nonstand-form/>, a collection of non-standard word forms. Miðeind does not claim any endorsement by the BÍN authors or copyright holders.

The BÍN source data are publicly available under the CC BY-SA 4.0 license, as further detailed here in English and here in Icelandic.

In accordance with the BÍN license terms, credit is hereby given as follows:

Beygingarlýsing íslensks nútímamáls. Stofnun Árna Magnússonar í íslenskum fræðum. Höfundur og ritstjóri Kristín Bjarnadóttir.

GreynirCorrect4LT is implemented and maintained by Grammatek ehf. as a part of Icelandic Government's 5-year Language Technology Programme for Icelandic.

Name		Name	Last commit message	Last commit date
Latest commit History 769 Commits
.github/workflows		.github/workflows
doc		doc
eval		eval
src/reynir_correct		src/reynir_correct
test		test
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
setup.py		setup.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GreynirCorrect4LT: Spelling and grammar correction for Icelandic Language Technology

Overview

Getting started

The correction pipeline

Adding word lists

Adding or changing functionality

Prerequisites

Installation

The command line tool

Command Line Examples

Tests

Acknowledgements

Copyright and License

About

Releases 1

Packages

Languages

License

grammatek/GreynirCorrect4LT

Folders and files

Latest commit

History

Repository files navigation

GreynirCorrect4LT: Spelling and grammar correction for Icelandic Language Technology

Overview

Getting started

The correction pipeline

Adding word lists

Adding or changing functionality

Prerequisites

Installation

The command line tool

Command Line Examples

Tests

Acknowledgements

Copyright and License

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages