Skip to content

Latest commit

 

History

History
 
 

normalization

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Turkish Text Normalization

Turkish Spell Checker

Spell checker provides methods for checking if a word is correctly written and can give suggestions for a word. TurkishSpellChecker class is used for this.

Instantiation:

TurkishMorphology morphology = TurkishMorphology.createWithDefaults();
TurkishSpellChecker spellChecker = new TurkishSpellChecker(morphology);

After this, several methods are available. check(String input) method is used for detecting if a word is spelled correctly.

String[] words = {"okuyabileceğimden","okuyablirim", "Ankara", "Ankar'ada", "3'de", "3'te"};
for (String word : words) {
    System.out.println(word + " = " + spellChecker.check(word));
} 

Output will be

okuyabileceğimden = true
okuyablirim = false
Ankara = true
Ankar'ada = false
3'de = false
3'te = true

Spelling Suggestion

Currently Zemberek provides a simple suggestion mechanism.

For getting suggestions for incorrect words suggestForWord method can be used:

String[] words = {"okuyablirim", "tartısıyor", "Ankar'ada", "knlıca"};
for (String word : words) {
    System.out.println(word + " = " + spellChecker.suggestForWord(word));
} 

Will give these suggestions:

okuyablirim = [okuyabilirim]
tartısıyor = [tartışıyor, tartılıyor, tartınıyor]
Ankar'ada = [Ankara'da, Ankara'ya, Ankara'dan, Ankaray'da, Ankara'ma, Antara'da, Ankara'ca, Cankara'da, Anakarada, Ankara'na, Angara'da, Ankara'mda, Ankara'nda, Ankaray'a]
knlıca = [Kanlıca, kanlıca, In'lıca, kılıca, anlıca, Kanlı'ca, kınlıca, kalıca]

If user provides a higher order language model (bi-gram models are sufficient) and context words, ranking of the suggestions may improve. Method below is used for this.

public List<String> suggestForWord(
            String word,
            String leftContext,
            String rightContext,
            NgramLanguageModel lm)

Limitations

  • It only may correct for 1 insertion, 1 deletion, 1 substitution and 1 transposition errors.
  • It ranks the results with an internal unigram language model.
  • There is no deasciifier.
  • It does not correct numbers, dates and times.
  • There may be junk results.
  • For shorter words, there will be a lot of suggestions (sometimes >50 ).
  • Suggestion function is not so fast (Around 500-1000 words/second).