Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to evaluate models? #51

Open
annaproxy opened this issue Jul 19, 2021 · 2 comments
Open

How to evaluate models? #51

annaproxy opened this issue Jul 19, 2021 · 2 comments
Assignees
Labels
Discussion: Research and Philosophy Discussion help wanted Extra attention is needed

Comments

@annaproxy
Copy link
Collaborator

annaproxy commented Jul 19, 2021

More of a philosophical question.

  • We don't want models to have perfect training accuracy, few novel words are generated
  • We don't want models to have training accuracy that is too low, implausible words are generated (bvbvmcnv)

Any papers/blog posts on this are welcome!

@Sasafrass
Copy link
Owner

This is a very interesting question.
Some potential avenues of research:

  • Optimize for models with lowest training accuracy that still generate coherent and phonetically sound words. The soundness of the words could be assessed with some kind of NLP parser. However, this raises the issue that for instance (cool new) slang words may not always adhere to (Dutch) language norms in terms of phonetics.

@annaproxy
Copy link
Collaborator Author

annaproxy commented Jul 19, 2021

Under the assumption that the model only gets better on the train data when you train for longer, non-converged models are one way to still get a certain level of 'babbling' behaviour. (term stolen from https://arxiv.org/pdf/2010.04637.pdf).

One easy way would be to check every (few) epochs if the model is memorizing too many words.
So, simply sample a large amount of words (with various temperatures) from a model, n% of these words will be memorized from train data.

We would have to experimentally find a good value for n. The n% can also be used a sort of proxy for full word accuracy, but we can also have a threshold level of accuracy (at character level).

Edit: A quick look at n for an upcoming model of plaatsnamen :)
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion: Research and Philosophy Discussion help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants