-
Notifications
You must be signed in to change notification settings - Fork 0
Training
This section of the Wiki will explain how to estimate rule probabilities for IRTGs in order to obtain generative models. There are two main techniques available in Alto for this task: maximum likelihood estimation from explicitly given gold derivation trees - which simply estimate the probability of rules based on the frequency of their use - or approximate maximum likelihood estimation using an expectation maximization algorithm. These algorithms are available both via code and the GUI and they will set the rule weights of the derivation tree automaton to the newly estimated values.
For in order for derivation trees to be unambiguous with respect to the rules of the automaton which where used to generate them, it is necessary that the rules of the automaton each use a different terminal. Enforcing this condition is unproblematic for IRTGs, as the way the labels contribute to the language is determined only by their homomorphic images. Therefore, if any two rules use the same label, the label for one of the rules can be replaced with a new unique label. The new label is then associated with the same homomorphic images as the original one. This transformation must be applied before estimating probabilities from parse charts.
Maximum Likelihood training requires an annotated corpus file like the one shown here:
/// IRTG annotated corpus file, v1.0
///
/// Alto Lab corpus #2: PTB Section 00, <= 100 characters
/// (exported on 2017-03-30 16:15:22)
///
/// interpretation string: class de.up.ling.irtg.algebra.WideStringAlgebra
/// interpretation tree: class de.up.ling.irtg.algebra.TreeWithAritiesAlgebra
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
S(NP-SBJ(NP(NNP(Pierre),NNP(Vinken)),','(','),ADJP(NP(CD('61'),NNS(years)),JJ(old)),','(',')),VP(MD(will),VP(VB(join),NP(DT(the),NN(board)),PP-CLR(IN(as),NP(DT(a),JJ(nonexecutive),NN(director))),NP-TMP(NNP('Nov.'),CD('29')))),'.'('.'))
r28(r10(r3(r1,r2),r4,r9(r7(r5,r6),r8),r4),r26(r11,r25(r12,r15(r13,r14),r21(r16,r20(r17,r18,r19)),r24(r22,r23))),r27)
in which every instance is followed directly by a tree that expresses a derivation tree from the IRTG to be trained. If such a corpus is available, then training through the GUI is implemented by loading the IRTG to be trained from the ALTO GUI, selecting "Maximum Likelihood Training" under "Tools" and then selecting the annotated corpus for which the grammar is to be trained. The trained grammar can then be saved under "File->Save IRTG"
For any IRTG it is possible to call the method trainML(Corpus trainingData). This method will work if each instance of the training data is associated with a derivation tree and if each rule of automaton that underlies the IRTG has a unique label. If the former condition is not met a NullPointerException will be thrown. If the latter condition is not met, then an UnsupportedOperationException will be thrown.
Expectation Maximization does not require explicitly given derivations. It is an iterative algorithm which will use numerical methods to approximate a local optimum of the rule weight settings with respect to the likelihood of the corpus.
In order to use Expectation Maximization training from the GUI, it is necessary to first load the IRTG. Then EM can be run on an unannotated corpus which has a form like:
/// IRTG unannotated corpus file, v1.0
///
/// Alto Lab corpus #2: PTB Section 00, <= 100 characters
/// (exported on 2017-03-31 11:47:04)
///
/// interpretation i: class de.up.ling.irtg.algebra.StringAlgebra
the woman watches the woman
To run EM training by selecting "Tools->EM Training" and then selecting a corpus to train on. Training might take some time, depending on both grammar and corpus size. The resulting trained grammar can be stored by using "File->Save IRTG".
The trainEM(Corpus trainingData, int iterations, double threshold, ProgressListener listener) in the IRTG class is used to train the IRTG using expectation maximization. "iteration" is the maximum number of iterations for which the EM training runs. "threshold" is the minimum amount by which the log-likelihood must improve in each iteration before the training is stopped. The ProgressListener will be informed about the progress of training iteration and will be ignored if it is null. Either "iterations" or "threshold" must be greater than 0. If iterations is less than or equal to 0 then it is set to Integer.MAX_VALUE. If threshold is less than 0 it is set to Double.POSITIVE_INFINITY. The corpus must have attached charts.