Skip to content

Commit

Permalink
Settings: Settings - Part-of-speeach Tagging - Tagsets - Mapping Sett…
Browse files Browse the repository at this point in the history
…ings - Allow editing of tagset mapping of Stanza's Armenian (Eastern), Armenian (Western), Basque, Buryat (Russia), Danish, French, Greek (Modern), Hebrew (Modern), Hungarian, Ligurian, Manx, Marathi, Nigerian Pidgin, Pomak, Portuguese, Russian, Sanskrit, Sindhi, Sorbian (Upper), and Telugu part-of-speech taggers
  • Loading branch information
BLKSerene committed Jan 13, 2024
1 parent e00ca2a commit f5693ac
Show file tree
Hide file tree
Showing 67 changed files with 529 additions and 331 deletions.
10 changes: 5 additions & 5 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,8 @@
- Work Area: Add Profiler - Lexical Diversity - Brunét's Index / Honoré's statistic

### ✨ Improvements
- Menu: Allow editing of tagset mapping of spaCy's Catalan, Danish, French, Greek (Modern), Macedonian, Norwegian (Bokmål), Portuguese, Russian, Spanish, and Ukrainian part-of-speech taggers
- Settings: Settings - Part-of-speeach Tagging - Tagsets - Mapping Settings - Allow editing of tagset mapping of spaCy's Catalan, Danish, French, Greek (Modern), Macedonian, Norwegian (Bokmål), Portuguese, Russian, Spanish, and Ukrainian part-of-speech taggers
- Settings: Settings - Part-of-speeach Tagging - Tagsets - Mapping Settings - Allow editing of tagset mapping of Stanza's Armenian (Eastern), Armenian (Western), Basque, Buryat (Russia), Danish, French, Greek (Modern), Hebrew (Modern), Hungarian, Ligurian, Manx, Marathi, Nigerian Pidgin, Pomak, Portuguese, Russian, Sanskrit, Sindhi, Sorbian (Upper), and Telugu part-of-speech taggers
- Utils: Update custom stop word lists

### 📌 Bugfixes
Expand Down Expand Up @@ -534,12 +535,11 @@
### ✨ Improvements
- File Area: Update Tokenized/Tagged
- File Area: Update support for XML files
- Menu: Disable editing of part-of-speech tag mappings for spaCy's part-of-speech taggers
- Settings: Settings - POS Tagging - Tagsets - Mapping Settings - Disable editing of tagset mapping of spaCy's part-of-speech taggers
- Settings: Update Settings - Files - Tags
- Utils: Update botok's Tibetan word tokenizer, part-of-speech tagger, and lemmatizer
- Utils: Update Chinese (Traditional) stop word lists
- Utils: Update NLTK's word tokenizers
- Utils: Update part-of-speech tag mappings for spaCy's part-of-speech taggers
- Utils: Update PyThaiNLP's CRFCut
- Utils: Update PyThaiNLP's part-of-speech taggers
- Utils: Update PyThaiNLP's Thai word tokenizers
Expand Down Expand Up @@ -643,7 +643,7 @@
- Work Area: Add Overview - Count of Clauses / Clause Length / Paragraph/Sentence/Token Length (Standard Deviation)

### ✨ Improvements
- Utils: Update part-of-speech tag mappings for pybo's Tibetan part-of-speech tagger
- Utils: Update tagset mapping of pybo's Tibetan part-of-speech tagger
- Utils: Update pybo's Tibetan tokenizers, part-of-speech tagger, and lemmatizer
- Utils: Update PyThaiNLP's Thai stop word list
- Utils: Update Sacremoses's tokenizers and detokenizer
Expand Down Expand Up @@ -681,7 +681,7 @@
### ✨ Improvements
- Misc: Disable mouse wheel events for combo boxes and spin boxes when they are not focused
- Utils: Update spaCy's sentencizer
- Utils: Update part-of-speech tag mappings for spaCy's English part-of-speech tagger
- Utils: Update tagset mapping of spaCy's English part-of-speech tagger

### 📌 Bugfixes
- File Area: Fix Open Folder
Expand Down
34 changes: 13 additions & 21 deletions tests/tests_nlp/tests_spacy/test_spacy.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,10 +39,17 @@ def wl_test_spacy(
wl_test_sentence_tokenize(lang, results_sentence_tokenize_trf, results_sentence_tokenize_lg)
wl_test_word_tokenize(lang, results_word_tokenize)

# Tokenized
tokens = wl_word_tokenization.wl_word_tokenize_flat(
main,
text = getattr(wl_test_lang_examples, f'SENTENCE_{lang.upper()}'),
lang = lang
)

if lang != 'other':
wl_test_pos_tag(lang, results_pos_tag, results_pos_tag_universal)
wl_test_lemmatize(lang, results_lemmatize)
wl_test_dependency_parse(lang, results_dependency_parse)
wl_test_pos_tag(lang, tokens, results_pos_tag, results_pos_tag_universal)
wl_test_lemmatize(lang, tokens, results_lemmatize)
wl_test_dependency_parse(lang, tokens, results_dependency_parse)

def wl_test_sentence_tokenize(lang, results_trf, results_lg):
lang_no_suffix = wl_conversion.remove_lang_code_suffixes(main, lang)
Expand Down Expand Up @@ -109,7 +116,7 @@ def wl_test_word_tokenize(lang, results):

assert tokens == results

def wl_test_pos_tag(lang, results, results_universal):
def wl_test_pos_tag(lang, tokens, results, results_universal):
lang_no_suffix = wl_conversion.remove_lang_code_suffixes(main, lang)
test_sentence = getattr(wl_test_lang_examples, f'SENTENCE_{lang.upper()}')
pos_tagger = f'spacy_{lang_no_suffix}'
Expand All @@ -130,11 +137,6 @@ def wl_test_pos_tag(lang, results, results_universal):
)

# Tokenized
tokens = wl_word_tokenization.wl_word_tokenize_flat(
main,
text = test_sentence,
lang = lang
)
tokens_tagged_tokenized = wl_pos_tagging.wl_pos_tag(
main,
inputs = tokens,
Expand Down Expand Up @@ -179,7 +181,7 @@ def wl_test_pos_tag(lang, results, results_universal):

assert [token[0] for token in tokens_tagged_tokenized_long] == [str(i) for i in range(101) for j in range(10)]

def wl_test_lemmatize(lang, results):
def wl_test_lemmatize(lang, tokens, results):
lang_no_suffix = wl_conversion.remove_lang_code_suffixes(main, lang)
test_sentence = getattr(wl_test_lang_examples, f'SENTENCE_{lang.upper()}')
lemmatizer = f'spacy_{lang_no_suffix}'
Expand All @@ -193,11 +195,6 @@ def wl_test_lemmatize(lang, results):
)

# Tokenized
tokens = wl_word_tokenization.wl_word_tokenize_flat(
main,
text = test_sentence,
lang = lang
)
lemmas_tokenized = wl_lemmatization.wl_lemmatize(
main,
inputs = tokens,
Expand Down Expand Up @@ -240,7 +237,7 @@ def wl_test_lemmatize(lang, results):

assert lemmas_tokenized_long == [str(i) for i in range(101) for j in range(10)]

def wl_test_dependency_parse(lang, results):
def wl_test_dependency_parse(lang, tokens, results):
lang_no_suffix = wl_conversion.remove_lang_code_suffixes(main, lang)
test_sentence = getattr(wl_test_lang_examples, f'SENTENCE_{lang.upper()}')
dependency_parser = f'spacy_{lang_no_suffix}'
Expand All @@ -254,11 +251,6 @@ def wl_test_dependency_parse(lang, results):
)

# Tokenized
tokens = wl_word_tokenization.wl_word_tokenize_flat(
main,
text = test_sentence,
lang = lang
)
dependencies_tokenized = wl_dependency_parsing.wl_dependency_parse(
main,
inputs = tokens,
Expand Down
72 changes: 27 additions & 45 deletions tests/tests_nlp/tests_stanza/test_stanza.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,32 +36,34 @@ def wl_test_stanza(
):
wl_nlp_utils.check_models(main, langs = [lang], lang_utils = [[wl_test_get_lang_util(main, lang)]])

if lang not in ['zho_cn', 'zho_tw', 'srp_latn']:
lang_stanza = wl_conversion.remove_lang_code_suffixes(main, lang)
else:
lang_stanza = lang

if lang_stanza in wl_nlp_utils.get_langs_stanza(main, util_type = 'word_tokenizers'):
if lang in wl_nlp_utils.get_langs_stanza(main, util_type = 'word_tokenizers'):
wl_test_sentence_tokenize(lang, results_sentence_tokenize)
wl_test_word_tokenize(lang, results_word_tokenize)

if lang_stanza in wl_nlp_utils.get_langs_stanza(main, util_type = 'pos_taggers'):
wl_test_pos_tag(lang, results_pos_tag, results_pos_tag_universal)
# Tokenized
tokens = wl_word_tokenization.wl_word_tokenize_flat(
main,
text = getattr(wl_test_lang_examples, f'SENTENCE_{lang.upper()}'),
lang = lang
)

if lang in wl_nlp_utils.get_langs_stanza(main, util_type = 'pos_taggers'):
wl_test_pos_tag(lang, tokens, results_pos_tag, results_pos_tag_universal)

if lang_stanza in wl_nlp_utils.get_langs_stanza(main, util_type = 'lemmatizers'):
wl_test_lemmatize(lang, results_lemmatize)
if lang in wl_nlp_utils.get_langs_stanza(main, util_type = 'lemmatizers'):
wl_test_lemmatize(lang, tokens, results_lemmatize)

if lang_stanza in wl_nlp_utils.get_langs_stanza(main, util_type = 'dependency_parsers'):
wl_test_dependency_parse(lang, results_dependency_parse)
if lang in wl_nlp_utils.get_langs_stanza(main, util_type = 'dependency_parsers'):
wl_test_dependency_parse(lang, tokens, results_dependency_parse)

if lang_stanza in wl_nlp_utils.get_langs_stanza(main, util_type = 'sentiment_analyzers'):
wl_test_sentiment_analyze(lang, results_sentiment_analayze)
if lang in wl_nlp_utils.get_langs_stanza(main, util_type = 'sentiment_analyzers'):
wl_test_sentiment_analyze(lang, tokens, results_sentiment_analayze)

def wl_test_get_lang_util(main, lang):
if lang not in ['zho_cn', 'zho_tw', 'srp_latn']:
lang_util = f'stanza_{wl_conversion.remove_lang_code_suffixes(main, lang)}'
else:
if lang in ['zho_cn', 'zho_tw', 'srp_latn']:
lang_util = f'stanza_{lang}'
else:
lang_util = f'stanza_{wl_conversion.remove_lang_code_suffixes(main, lang)}'

return lang_util

Expand All @@ -80,7 +82,7 @@ def wl_test_sentence_tokenize(lang, results):
print(f'{sentences}\n')

# The count of sentences should be more than 1
if lang in ['cop', 'fro', 'kaz', 'pcm', 'qpm', 'san', 'srp_latn']:
if lang in ['fro', 'kaz', 'pcm', 'qpm']:
assert len(sentences) == 1
else:
assert len(sentences) > 1
Expand All @@ -104,7 +106,7 @@ def wl_test_word_tokenize(lang, results):
# The count of tokens should be more than 1
assert len(tokens) > 1
# The count of tokens should be more than the length of tokens split by space
if lang in ['chu', 'cop', 'grc', 'pcm', 'orv', 'san', 'tel']:
if lang in ['chu', 'cop', 'pcm', 'orv']:
assert len(tokens) == len(test_sentence.split())
elif lang == 'vie':
assert len(tokens) < len(test_sentence.split())
Expand All @@ -113,7 +115,7 @@ def wl_test_word_tokenize(lang, results):

assert tokens == results

def wl_test_pos_tag(lang, results, results_universal):
def wl_test_pos_tag(lang, tokens, results, results_universal):
test_sentence = getattr(wl_test_lang_examples, f'SENTENCE_{lang.upper()}')
pos_tagger = wl_test_get_lang_util(main, lang)

Expand All @@ -133,11 +135,6 @@ def wl_test_pos_tag(lang, results, results_universal):
)

# Tokenized
tokens = wl_word_tokenization.wl_word_tokenize_flat(
main,
text = test_sentence,
lang = lang
)
tokens_tagged_tokenized = wl_pos_tagging.wl_pos_tag(
main,
inputs = tokens,
Expand Down Expand Up @@ -182,7 +179,7 @@ def wl_test_pos_tag(lang, results, results_universal):

assert [token[0] for token in tokens_tagged_tokenized_long] == [str(i) for i in range(101) for j in range(10)]

def wl_test_lemmatize(lang, results):
def wl_test_lemmatize(lang, tokens, results):
test_sentence = getattr(wl_test_lang_examples, f'SENTENCE_{lang.upper()}')
lemmatizer = wl_test_get_lang_util(main, lang)

Expand All @@ -195,11 +192,6 @@ def wl_test_lemmatize(lang, results):
)

# Tokenized
tokens = wl_word_tokenization.wl_word_tokenize_flat(
main,
text = test_sentence,
lang = lang
)
lemmas_tokenized = wl_lemmatization.wl_lemmatize(
main,
inputs = tokens,
Expand Down Expand Up @@ -241,14 +233,14 @@ def wl_test_lemmatize(lang, results):
)

if lang in [
'bul', 'cop', 'grc', 'ell', 'hin', 'isl', 'lit', 'glv', 'pcm', 'pol',
'orv', 'sme', 'san', 'cym'
'bul', 'chu', 'cop', 'est', 'got', 'grc', 'ell', 'hin', 'isl', 'lij',
'lit', 'glv', 'pcm', 'pol', 'orv', 'sme', 'san', 'tur', 'cym'
]:
assert len(lemmas_tokenized_long) == 101 * 10
else:
assert lemmas_tokenized_long == [str(i) for i in range(101) for j in range(10)]

def wl_test_dependency_parse(lang, results):
def wl_test_dependency_parse(lang, tokens, results):
test_sentence = getattr(wl_test_lang_examples, f'SENTENCE_{lang.upper()}')
dependency_parser = wl_test_get_lang_util(main, lang)

Expand All @@ -261,11 +253,6 @@ def wl_test_dependency_parse(lang, results):
)

# Tokenized
tokens = wl_word_tokenization.wl_word_tokenize_flat(
main,
text = test_sentence,
lang = lang
)
dependencies_tokenized = wl_dependency_parsing.wl_dependency_parse(
main,
inputs = tokens,
Expand Down Expand Up @@ -316,7 +303,7 @@ def wl_test_dependency_parse(lang, results):

assert [dependency[0] for dependency in dependencies_tokenized_long] == [str(i) for i in range(101) for j in range(10)]

def wl_test_sentiment_analyze(lang, results):
def wl_test_sentiment_analyze(lang, tokens, results):
test_sentence = getattr(wl_test_lang_examples, f'SENTENCE_{lang.upper()}')
sentiment_analyzer = wl_test_get_lang_util(main, lang)

Expand All @@ -329,11 +316,6 @@ def wl_test_sentiment_analyze(lang, results):
)

# Tokenized
tokens = wl_word_tokenization.wl_word_tokenize_flat(
main,
text = test_sentence,
lang = lang
)
sentiment_scores_tokenized = wl_sentiment_analysis.wl_sentiment_analyze(
main,
inputs = [tokens],
Expand Down
Loading

0 comments on commit f5693ac

Please sign in to comment.