Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06 History • The concept of Maximum Entropy can be traced back along multiple threads to Biblical times. • Introduced to NLP area by Berger et. Al. (1996). • Used in many NLP tasks: MT, Tagging, Parsing, PP attachment, LM, … Outline • • • • Modeling: Intuition, basic concepts, … Parameter training Feature selection Case study Reference papers • • • • (Ratnaparkhi, 1997) (Ratnaparkhi, 1996) (Berger et. al., 1996) (Klein and Manning, 2003) Different notations. Modeling The basic idea • Goal: estimate p • Choose p with maximum entropy (or “uncertainty”) subject to the constraints (or “evidence”). H ( p) p( x) log p( x) xA B x (a, b), where a A b B Setting • From training data, collect (a, b) pairs: – a: thing to be predicted (e.g., a class in a classification problem) – b: the context – Ex: POS tagging: • a=NN • b=the words in a window and previous two tags • Learn the prob of each (a, b): p(a, b) Features in POS tagging (Ratnaparkhi, 1996) context (a.k.a. history) allowable classes Maximum Entropy • Why maximum entropy? • Maximize entropy = Minimize commitment • Model all that is known and assume nothing about what is unknown. – Model all that is known: satisfy a set of constraints that must hold – Assume nothing about what is unknown: choose the most “uniform” distribution choose the one with maximum entropy Ex1: Coin-flip example (Klein & Manning 2003) • • • • Toss a coin: p(H)=p1, p(T)=p2. Constraint: p1 + p2 = 1 Question: what’s your estimation of p=(p1, p2)? Answer: choose the p that maximizes H(p) H ( p) p( x) log p( x) x H p1 p1=0.3 Coin-flip example (cont) H p1 + p2 = 1 p1 p2 p1+p2=1.0, p1=0.3 Ex2: An MT example (Berger et. al., 1996) Possible translation for the word “in” is: Constraint: Intuitive answer: An MT example (cont) Constraints: Intuitive answer: An MT example (cont) Constraints: Intuitive answer: ?? Ex3: POS tagging (Klein and Manning, 2003) Ex3 (cont) Ex4: overlapping features (Klein and Manning, 2003) Modeling the problem • Objective function: H(p) • Goal: Among all the distributions that satisfy the constraints, choose the one, p*, that maximizes H(p). p* arg max H ( p) pP • Question: How to represent constraints? Features • Feature (a.k.a. feature function, Indicator function) is a binaryvalued function on events: f j : {0,1}, A B • A: the set of possible classes (e.g., tags in POS tagging) • B: space of contexts (e.g., neighboring words/ tags in POS tagging) • Ex: 1 if a DET & curWord (b) " that" f j ( a, b ) 0 o.w. Some notations Finite training sample of events: S Observed probability of x in S: ~ p ( x) The model p’s probability of x: p (x ) The jth feature: fj Observed expectation of f j : (empirical count of f j ) Model expectation of f :j E ~p f j ~ p ( x) f j ( x) x E p f j p ( x) f j ( x) x Constraints • Model’s feature expectation = observed feature expectation E p f j E ~p f j • How to calculate E ~p f j ? N E ~p f j ~ p ( x) f j ( x) x f i 1 j ( x) N 1 if a DET & curWord (b) " that" f j ( a, b ) 0 o.w. Training data observed events Restating the problem The task: find p* s.t. p* arg max H ( p) pP where P { p | E p f j E ~p f j , j {1,..., k}} Objective function: -H(p) Constraints: {E p f j E ~p f j d j , j {1,..., k}} p ( x) 1 x Add a feature f 0 (a, b) 1 a, b E p f 0 E ~p f 0 1 Questions • Is P empty? • Does p* exist? • Is p* unique? • What is the form of p*? • How to find p*? What is the form of p*? (Ratnaparkhi, 1997) P { p | E p f j E ~p f j , j {1,..., k}} k Q { p | p ( x) j f j ( x) , j 0} j 1 H ( p) Theorem: if p* P Q then p* argpmax P Furthermore, p* is unique. Using Lagrangian multipliers k A( p) H ( p) j ( E p f j d j ) Minimize A(p): j 0 A' ( p) 0 k ( p( x) log p( x) j (( p( x) f j ( x)) d j ) / p( x) 0 j 0 x x k 1 log p( x) j f j ( x) 0 j 0 k log p( x) j f j ( x) 1 j 0 k j f j ( x ) 1 p( x) e j 0 k j f j ( x ) 0 1 e j 1 k j f j ( x) p( x) e j 1 Z where Z e10 Two equivalent forms k p ( x) j f j ( x) j 1 k j f j ( x) p( x) e j 1 Z 1 Z j ln j Relation to Maximum Likelihood The log-likelihood of the empirical distribution ~ p as predicted by a model q is defined as ~ L(q) p ( x) log q( x) x L(q) Theorem: if p* P Q then p* argqmax Q Furthermore, p* is unique. Summary (so far) Goal: find p* in P, which maximizes H(p). P { p | E p f j E ~p f j , j {1,..., k}} It can be proved that when p* exists it is unique. The model p* in P with maximum entropy is the model in Q that maximizes the likelihood of the training sample ~ p k Q { p | p ( x) j j 1 f j ( x) , j 0} Summary (cont) • Adding constraints (features): (Klein and Manning, 2003) – Lower maximum entropy – Raise maximum likelihood of data – Bring the distribution further from uniform – Bring the distribution closer to data Parameter estimation Algorithms • Generalized Iterative Scaling (GIS): (Darroch and Ratcliff, 1972) • Improved Iterative Scaling (IIS): (Della Pietra et al., 1995) GIS: setup Requirements for running GIS: • Obey form of model and constraints: k p( x) e j f j ( x) Z • An additional constraint: Let Ep f j d j j 1 x k C max f j ( x) x k f j 1 j ( x) C j 1 Add a new feature fk+1: x k f k 1 ( x) C f j ( x) j 1 GIS algorithm • Compute dj, j=1, …, k+1 • Initialize (j1) (any values, e.g., 0) • Repeat until converge – For each j • Compute E p ( n ) f j p ( n ) ( x) f j ( x) x k 1 where p ( x) (n) • Update ( n 1) j (n) j e j (n) j 1 f j ( x) Z di 1 (log ) C E p( n ) f j Approximation for calculating feature expectation E p f j p ( x) f j ( x) p ( a, b) f x a A,bB p(b) p(a | b) f j ( a, b) ~p (b) p(a | b) f j ( a, b) a A,bB a A,bB ~ p (b) p (a | b) f j (a, b) bB 1 N a A N p(a | b ) f i 1 a A i j (a,b i ) j ( a, b) Properties of GIS • L(p(n+1)) >= L(p(n)) • The sequence is guaranteed to converge to p*. • The converge can be very slow. • The running time of each iteration is O(NPA): – N: the training set size – P: the number of classes – A: the average number of features that are active for a given event (a, b). IIS algorithm k # f ( x) f j ( x) • Compute dj, j=1, …, k+1 and j 1 (1) • Initialize j (any values, e.g., 0) • Repeat until converge – For each j • Let j p x • Update (n) be the solution to ( x ) f j ( x )e j f # ( x ) (jn1) (jn) j dj Calculating j If x k f j 1 Then j ( x) C di 1 j (log ) C E p( n ) f j GIS is the same as IIS Else j must be calcuated numerically. Feature selection Feature selection • Throw in many features and let the machine select the weights – Manually specify feature templates • Problem: too many features • An alternative: greedy algorithm – Start with an empty set S – Add a feature at each iteration Notation With the feature set S: After adding a feature: The gain in the log-likelihood of the training data: Feature selection algorithm (Berger et al., 1996) • Start with S being empty; thus ps is uniform. • Repeat until the gain is small enough – For each candidate feature f • Computer the model p S f using IIS • Calculate the log-likelihood gain – Choose the feature with maximal gain, and add it to S Problem: too expensive Approximating gains (Berger et. al., 1996) • Instead of recalculating all the weights, calculate only the weight of the new feature. Training a MaxEnt Model Scenario #1: • Define features templates • Create the feature set • Determine the optimum feature weights via GIS or IIS Scenario #2: • Define feature templates • Create candidate feature set S • At every iteration, choose the feature from S (with max gain) and determine its weight (or choose top-n features and their weights). Case study POS tagging (Ratnaparkhi, 1996) • Notation variation: – fj(a, b): a: class, b: context – fj(hi, ti): h: history for ith word, t: tag for ith word • History: hi {wi , wi 1 , wi 2 , wi 1 , wi 2 , ti 1 , ti 2 } • Training data: – Treat it as a list of (hi, ti) pairs. – How many pairs are there? Using a MaxEnt Model • Modeling: • Training: – Define features templates – Create the feature set – Determine the optimum feature weights via GIS or IIS • Decoding: Modeling P (t1 ,..., t n | w1 ,..., wn ) n p (ti | w1n , t1i 1 ) i 1 n p (ti | hi ) i 1 p(h, t ) p(t | h) p(h, t ' ) t 'T Training step 1: define feature templates History hi Tag ti Step 2: Create feature set Collect all the features from the training data Throw away features that appear less than 10 times Step 3: determine the feature weights • GIS • Training time: – Each iteration: O(NTA): • N: the training set size • T: the number of allowable tags • A: average number of features that are active for a (h, t). – About 24 hours on an IBM RS/6000 Model 380. • How many features? Decoding: Beam search • Generate tags for w1, find top N, set s1j accordingly, j=1, 2, …, N • For i=2 to n (n is the sentence length) – For j=1 to N • Generate tags for wi, given s(i-1)j as previous tag context • Append each tag to s(i-1)j to make a new sequence. – Find N highest prob sequences generated above, and set sij accordingly, j=1, …, N • Return highest prob sequence sn1. Beam search Viterbi search Decoding (cont) • Tags for words: – Known words: use tag dictionary – Unknown words: try all possible tags • Ex: “time flies like an arrow” • Running time: O(NTAB) – – – – N: sentence length B: beam size T: tagset size A: average number of features that are active for a given event Experiment results Comparison with other learners • HMM: MaxEnt uses more context • SDT: MaxEnt does not split data • TBL: MaxEnt is statistical and it provides probability distributions. MaxEnt Summary • Concept: choose the p* that maximizes entropy while satisfying all the constraints. • Max likelihood: p* is also the model within a model family that maximizes the log-likelihood of the training data. • Training: GIS or IIS, which can be slow. • MaxEnt handles overlapping features well. • In general, MaxEnt achieves good performances on many NLP tasks. Additional slides Ex4 (cont) ??