Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatch between pred_tag and root list size #7

Open
faraday opened this issue Nov 29, 2019 · 1 comment
Open

Mismatch between pred_tag and root list size #7

faraday opened this issue Nov 29, 2019 · 1 comment

Comments

@faraday
Copy link

faraday commented Nov 29, 2019

@onurgu Thank you for sharing this project.

In train.py file this reference exists when resolving proper disambiguation for a word:
first_sentence['roots'][word_idx][pred_tag]

However training can provide an incorrect index through pred_tag

An example:
[{'sentence_length': 4, 'surface_forms': ['Ali', 'ata', 'bakabilir', '.'], 'surface_form_lengths': [3, 3, 9, 1], 'roots': [['Ali'], ['at', 'at', 'ata', 'ata'], ['bak', 'bak'], ['.']], 'root_lengths': [[3], [2, 2, 3, 3], [3, 3], [1]], 'morph_tokens': [[['Noun', 'Prop', 'A3sg', 'Pnon', 'Nom']], [['Noun', 'A3sg', 'Pnon', 'Dat'], ['Verb', 'Pos', 'Opt', 'A3sg'], ['Noun', 'A3sg', 'Pnon', 'Nom'], ['Verb', 'Pos', 'Imp', 'A2sg']], [['Verb', 'Pos^DB', 'Verb', 'Able', 'Aor', 'A3sg'], ['Verb', 'Pos^DB', 'Verb', 'Able', 'Aor^DB', 'Adj', 'Zero']], [['Punc']]], 'morph_token_lengths': [[5], [4, 4, 4, 4], [6, 7], [1]]}]

For the word ata , pred_tag can turn up to be 4, leading to a list reference mismatch considering ['at', 'at', 'ata', 'ata']. This list size is 4 as well. pred_tag=4 cannot address a proper list item.

This bug is not related to training data size. I can train a model without this problem using a much smaller sample.

@onurgu
Copy link
Owner

onurgu commented Dec 27, 2019

Hi,

I couldn't reproduce this problem.

When does the condition pred_tag == 4 occurs? Does it happen when the training data size is high?

My model gives this output:

Ali ata bakabilir.
Reading script from "tfeatures.scr"
0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100%
0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100%
0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100%

  *****  LEXICON LOOK-UP  *****


LOOKUP STATISTICS (success with different strategies):
strategy 0:     3 times         (75.00 %)
strategy 1:     1 times         (25.00 %)
strategy 2:     0 times         (0.00 %)
strategy 3:     0 times         (0.00 %)
not found:      0 times         (0.00 %)

corpus size:    4 words
execution time: 0 sec
speed:          4 words/sec

  *****  END OF LEXICON LOOK-UP  *****

file processed
file processed
1/1 [==============================] - 2s
{'surface_form_lengths': [3, 3, 9, 1], 'root_lengths': [[3], [2, 2, 3, 3], [3, 3], [1]], 'surface_forms': [u'Ali', u'ata', u'bakabilir', u'.'], 'morph_token_lengths': [[5], [4, 4, 4, 4], [6, 7], [1]], 'morph_tokens': [[[u'Noun', u'Prop', u'A3sg', u'Pnon', u'Nom']], [[u'Noun', u'A3sg', u'Pnon', u'Dat'], [u'Verb', u'Pos', u'Opt', u'A3sg'], [u'Noun', u'A3sg', u'Pnon', u'Nom'], [u'Verb', u'Pos', u'Imp', u'A2sg']], [[u'Verb', u'Pos^DB', u'Verb', u'Able', u'Aor', u'A3sg'], [u'Verb', u'Pos^DB', u'Verb', u'Able', u'Aor^DB', u'Adj', u'Zero']], [[u'Punc']]], 'sentence_length': 4, 'roots': [[u'Ali'], [u'at', u'at', u'ata', u'ata'], [u'bak', u'bak'], [u'.']]}
Ali Ali+Noun+Prop+A3sg+Pnon+Nom
ata ata+Noun+A3sg+Pnon+Nom
bakabilir bak+Verb+Pos^DB+Verb+Able+Aor+A3sg
. .+Punc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants