Skip to content

MaximumEntropyIRTGs

akoller edited this page Jul 19, 2015 · 2 revisions

Maximum-entropy IRTGs

In a ordinary IRTG, we can equip each rule with a weight. We can use this to represent PCFG-style probability models. Due to the built-in restrictions on locality and independence assumptions, these models can be ineffective for many uses.

Alto therefore offers you the option to use maximum-entropy IRTGs, in which probabilities are defined via feature functions in the context of a log-linear model. Together with a vector of weights for these feature functions, a maxent IRTG represents a conditional probability distribution over derivation trees given the input(s).

In this tutorial, we talk briefly about how to define, train, and use maximum-entropy IRTGs.

Representing max-ent IRTGs

Let's start with looking at an example grammar. Load the grammar maxent-cfg.irtg from the examples archive into Alto. At first glance, this grammar looks like an ordinary IRTG over the string algebra, representing a context-free grammar:

Screen Shot 2015-07-19 at 16.05.52.png

However, a closer look at the grammar file reveals that unlike the IRTGs we have worked with so far, maxent-cfg.irtg defines a number of feature functions (after the interpretation block and before the first grammar rule):

interpretation i: de.up.ling.irtg.algebra.StringAlgebra

feature f1: de.up.ling.irtg.maxent.ChildOfFeature('VP','PP')
feature f2: de.up.ling.irtg.maxent.ChildOfFeature('N','PP')

S! -> r1(NP,VP)
  [i] *(?1,?2)

VP -> r4(V,NP)
  [i] *(?1,?2)

This means that the log-linear probability distribution is defined using two feature functions called f1 and f2. Each of them is an instance of the class ChildOfFeature. f1 counts how many times the derivation tree uses a rule that has parent nonterminal VP and a child nonterminal PP, while f2 counts rules with parent N and a PP child. In this way, the model can learn preferences for PP attachment.

Training and parsing

If you look into the "Tools" window in Alto, you will see that the "Maximum Entropy" item, which was always grayed out so far, is now available. That's because Alto recognized that maxent-cfg.irtg is a maximum-entropy IRTG (from the presence of the feature declarations), and thus made the maxent-specific functionality available.

You can train weights for the features by selecting Tools -> Maximum Entropy -> Train and choosing a suitable annotated corpus (e.g., pcfg-annotated-training.txt from the examples distribution). After a short moment, this will open a new window showing weights for f1 and f2:

Screen Shot 2015-07-19 at 16.36.39.png

You can now go back to the grammar window for maxent-cfg.irtg and select Tools -> Parse to parse a sentence (e.g., "john watches the woman with the telescope"). Select Tools -> Show Language from chart window. This will open the language viewer for the two parse trees of the sentence. Observe that the weights are now no longer necessarily numbers between zero and one, but 64 and 0.043:

Screen Shot 2015-07-19 at 16.39.16.png

These are the scores of the two derivations, i.e. exp(w1f1(t) + ... + wnfn(t)). The conditional probabilities are obtained by normalizing to one; that is, the probability of the derivation with VP attachment is 64/64.043 = 0.9993.

Observe also that you can save feature weights in a file for later use with Maximum Entropy -> Save Weights.

Feature functions

Above, we described f1 and f2 as returning counts of rules of a certain type in the derivation tree. Technically, the feature functions we can use in Alto must be local -- that is, the value of the feature function on a derivation tree t is the sum of the values of a "local" version of the feature function over the different nodes of t. In the example grammar, f1 can be seen as a function that looks at an individual node u of t and returns 1 if the rule that was used at u has parent VP and a child PP. Thus if we sum its value over all nodes of t, we get the rule-counting behavior described above. This locality assumption is standard in log-linear parsing, because dynamic programming is much harder for nonlocal features.

Technically, feature functions in Alto are functions that look at a chart entry (i.e., a rule in the RTG that represents the chart) and return a double value. Alto comes with two predefined feature functions that you can use in your grammars:

  • ChildOfFeature(A,B) returns 1 for rules whose parent is A and that have a child with nonterminal B, and 0 otherwise.
  • RuleNameFeature(r) returns 1 for rules whose label (= terminal symbol) is r, and 0 otherwise. It thus counts occurrences of a specific rule.

In addition, you can also define your own feature functions. To do this, implement a subclass of de.up.ling.irtg.maxent.FeatureFunction and put it on your classpath when you start Alto. FeatureFunction contains a single abstract method which you need to implement:

public abstract Double evaluate(Rule rule, TreeAutomaton<State> chart, MaximumEntropyIrtg irtg);

Here chart is the parse chart (an object of class TreeAutomaton), and rule is the rule that you want to evaluate.

Your class can have a constructor that takes an arbitrary number of parameters of class String (or none). When you define the grammar, the individual features are constructed as instances of your class with the given arguments. So for instance, f1 above was an instance of ChildOfFeature to whose constructor we passed VP and PP as the two String arguments it expected.

Notes

We have so far tried maximum-entropy IRTGs only for relatively small grammars and relatively modest numbers of features. The code may contain bugs, and may not be as efficient as it could be. If you have suggestions for improvements, or would simply like to share your experiences, please get in touch.

Clone this wiki locally