Skip to content

A character-based language model for generating new words similar to those in the training data. Includes a trained model for generating new Pokemon names. Implemented in Tensorflow.

Notifications You must be signed in to change notification settings

sarahwolf32/Language-Model-RNN

Repository files navigation

Generating Pokemon names with a character-level language model RNN

This is a letter-based language model for inventing words using an RNN. Here, I will walk through how this sort of language model works, why it is an interesting topic, and then use it to generate new Pokemon names.

Why RNNs?

A great deal of the data we care about - music, language, anything with a time dimension - comes in the form of sequences. Unfortunately, standard fully connected neural networks do not handle sequences well. They expect input and output data to be of a fixed size, which is not well-suited to working with words or sentences, which can vary in length. While one could pad the sequences to a max size, this would still require a larger network than needed, an inefficiency we'd like to avoid.

The more serious limitation is that if you did put a sequence into a standard neural network, it would not be able to generalize its learnings at a given input index to other input positions. For example, if the sentence 'Harry Potter took off his glasses' helped it learn that 'Harry' is a name, this would not necessarily improve its ability to predict that 'Harry' is a name in the sequence 'She looked up at Harry', because it occurs at a different position.

Recurrent Neural Networks (RNNs) are a class of neural networks that share parameters over input positions. This allows learnings to at one time-step to be potentially generalized to others. They also maintain an internal state (which we can think of as "memory") that is passed forward between time-steps.

While there are many variations on RNNs, this is an implementation of a vanilla RNN, the simplest version.

Language Models

Broadly speaking, language models are built to predict the probability of a sequence. This can be used to:

  • Generate new plausible sequences
    • E.g., inventing words, writing sentences, writing music
  • Pick the most plausible sequence given a few options
    • E.g., helping a handwriting recognition system decide that an unreadable letter in a three-letter word between a and d is probably n.
  • Suggest likely ways to complete a partial sequence
    • E.g., autocomplete.

In our case, we will train a model to predict P(word), given the sequence of letters within it. At each time step, the RNN attempts to predict the next letter, given the previous letter and its internal memory state.

For example, in the space of English words, we would expect the P(a|ca) to be low, since "caa" is not a common combination in English. By the same token, P(r|ca) should be higher, because there are many English words that include "car", like "car", "carry", "carnival", etc. This is the sort of thing we want our model to learn.

If we input a word like "cat" into our language model, it will output the probability of each letter given the previous ones. So we will have values for: P(c), P(a|c), P(t|ca), P(end|cat). Here, end is a special tag that tells the language model to stop adding letters. It is important when words can be of variable length.

Since the basic rule of conditional probability is that:

It follows that:

In other words, we can compute P(word) by multiplying together the conditional probabilities of each letter.

The Architecture

Our vanilla RNN consists of one simple "cell". At each time-step t we feed the previous letter y<t-1> and the previous cell activation a<t-1> into the cell, and it outputs a probability distribution for the current letter, ŷ<t>. Since there will be no previous letter or activation for the first letter of a word, we'll simply feed in a vector of zeros for both.

The Loss Function

To compute the loss for a given word, we first compute the losses at each time-step t (each letter). We use a loss function commonly used for softmax outputs, that considers only the probability assigned to the "correct" letter. You can see this in the equation below. Since y<t> is a one-hot letter vector with zero entries in all but one index, and anything times zero is zero, only the index where yi<t> = 1 will count toward the sum.

We can then compute the loss for a word by simply summing its per-letter losses.

Generating Words

Once we have trained the model, our goal is to invent new words of similar style to the training words. To sample words from our model, we pick the letter y<t> randomly, weighting by the probability distribution output for ŷ<t>. Our chosen y<t> is then fed into the model at the next time step. This continues until the end tag is chosen.

Generating Pokemon Names

I used this code to train a language model on a list of 796 Pokemon names (see word_lists/pokemon_names.txt), for the purpose of generating new Pokemon names. Here are a few examples of names it learned to generate:

Generated Pokemon names:

  • Tintorn
  • Fyreion
  • Benelon
  • Rantio
  • Zoreion
  • Wireoon
  • Sirg
  • Qindlor
  • Fergai
  • Siltion

For comparison, here are some real Pokemon names:

  • Yanmega
  • Leafeon
  • Glaceon
  • Gliscor
  • Mamoswine
  • Gallade
  • Palpitoad
  • Seismitoad
  • Throh
  • Sawk

You can find this model in the models folder. To load and generate words with it, just type python task.py --model-dir models/pokemon_names_model --num-samples [NUMBER OF WORDS YOU WANT TO GENERATE]. This will save the generated words in output/sample.txt.

Overall, I am quite happy with quality of the generated names. I can even imagine what kind of Pokemon Fyreion, Rantio, or Tintorn might be. And Sirg is just plain cool.

However, when I generated 500 Pokemon names and compared them to the real list, I noticed that the generated names did not have as much variation. For example, most of the generated names ended in n, and almost all were exactly 7 characters long. It is common for real Pokemon names to have those qualities, but not nearly to this extent. If I were to put more time into this project, I would focus on addressing this loss of variation.

Training Your Own

To train your own character-level language model using this code, you only need a list of words stored in a .txt file. There should be one word per line.

To train:

  1. Download this code and navigate to the project directory
  2. Run python task.py --train True --data-dir [PATH_TO_YOUR_WORD_LIST]
  3. The model will be saved into a checkpoints directory by default. You can save it somewhere else by adding --save-dir [YOUR_SAVE_LOCATION] to the train command.

After some experimentation, I found that the following hyperparameters worked well:

nodes = 80
learning_rate = 0.001
num_epochs = 150
optimizer = Adam

These are set as the defaults, but if you wish to change them, there are command line arguments to set them all, except for the Adam optimizer, which you must switch in the code.

In addition to Pokemon names, there is training data for dinosaur species names, and for Tolkien-style Elvish words in the word_lists folder if you want to give those a try.

Acknowledgements

About

A character-based language model for generating new words similar to those in the training data. Includes a trained model for generating new Pokemon names. Implemented in Tensorflow.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages