6-up - SEAS
Transcription
6-up - SEAS
Introduction to Markov Models But first: A few preliminaries Estimating the probability of phrases of words, sentences, etc.… Corrected 3/30 CIS 521 - Intro to AI 2 TexPoint fonts used in EMF. ARead the TexPoint manual before you delete this box.: AAAAAAAAAAAAAAAAA What’s a Word? Tokenization Is Tricky… CIS 521 - Intro to AI How to find Sentences?? 3 CIS 521 - Intro to AI 4 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAAA Q1: How to estimate the probability of a given sentence? Predicting a word sequence II A crucial step in speech recognition (and lots of other applications) First guess: products of unigrams Pˆ (W ) P( w) wW Proposed word lattice: form subsidy for farm subsidies far Next guess: products of bigrams • For W=w1w2w3… wn, Pˆ (W ) i 1 form subsidy for farm subsidies far form subsidy 0 106 subsidy 15 for 18185 form subsidies 0 subsidy far 0 subsidies 55 far 570 farm subsidy 0 subsidies for 6 farm subsidies 4 subsidies far 0 5 ) words of AP text): farm 74 CIS 521 - Intro to AI i 1 subsidy for 2 form 183 Not quite right… i Proposed word lattice: Bigram counts (in 1.7 * Unigram counts (in 1.7 * 106 words of AP text): n 1 P( w w Just right… CIS 521 - Intro to AI But the counts are tiny! Why? 6 1 and finally, Estimating P(wn|w1w2…wn-1) How can we estimate P correctly? Problem: Naïve Bayes model for bigrams violates independence assumptions. Again, we can estimate P(wn|w1w2…wn-1) with the MLE Count ( w1w2 ...wn ) Count ( w1 w2 ...wn 1 ) Let’s do this right…. Let W=w1w2w3… wn. Then, by the chain rule, P(W ) P( w1 ) * P( w2 | w1 ) * P( w3 | w1 w2 ) *...* P( wn | w1 ...wn 1 ) We can estimate P(w2|w1) by the Maximum Likelihood Estimator and P(w3|w1w2) by So to decide pat vs. pot in Heat up the oil in a large p?t, compute for pot Count ( w1w2 ) Count ( w1 ) Count ("Heat up the oil in a large pot") 0 Count ("Heat up the oil in a large") 0 Count ( w1w2 w3 ) Count ( w1 w2 ) and so on… CIS 521 - Intro to AI 7 Hmm..The Web Changes Things (2008 or so) 8 Statistics and the Web II So, P(“pot”|”heat up the oil in a large___”) = 8/49 0.16 Even the web in 2008 yields low counts! CIS 521 - Intro to AI CIS 521 - Intro to AI 9 CIS 521 - Intro to AI 10 …. But the web has grown!!! 165/891=0.185 CIS 521 - Intro to AI 11 CIS 521 - Intro to AI 12 2 So…. A BOTEC Estimate of What We Can Estimate A larger corpus won’t help much unless it’s HUGE …. but the web is!!! What parameters can we estimate with 100 million words of training data?? Assuming (for now) uniform distribution over only 5000 words What if we only have 100 million words for our estimates?? So even with 108 words of data, for even trigrams we encounter the sparse data problem….. CIS 521 - Intro to AI 13 The Markov Assumption: Only the Immediate Past Matters CIS 521 - Intro to AI 14 The Markov Assumption: Estimation We estimate the probability of each wi given previous context by P(wi|w1w2…wi-1) = P(wi|wi-1) which can be estimated by Count ( wi 1wi ) Count ( wi 1 ) So we’re back to counting only unigrams and bigrams!! AND we have a correct practical estimation method for P(W) given the Markov assumption! CIS 521 - Intro to AI 15 CIS 521 - Intro to AI 16 Visualizing an n-gram based language model: the Shannon/Miller/Selfridge method Markov Models To generate a sequence of n words given unigram estimates: • Fix some ordering of the vocabulary v1 v2 v3 …vk. • For each word wi , 1 ≤ i ≤ n —Choose a random value ri between 0 and 1 j — wi = the first vj such that CIS 521 - Intro to AI 17 CIS 521 - Intro to AI P (v m 1 m ) ri 18 3 Visualizing an n-gram based language model: the Shannon/Miller/Selfridge method The Shannon/Miller/Selfridge method trained on Shakespeare To generate a sequence of n words given a 1st order Markov model (i.e. conditioned on one previous word): • Fix some ordering of the vocabulary v1 v2 v3 …vk. • Use unigram method to generate an initial word w1 • For each remaining wi , 2 ≤ i ≤ n —Choose a random value ri between 0 and 1 j — wi = the first vj such that P (v m 1 m | wi 1 ) ri (This and next two slides from Jurafsky) CIS 521 - Intro to AI 19 Wall Street Journal just isn’t Shakespeare CIS 521 - Intro to AI 20 Shakespeare as corpus N=884,647 tokens, V=29,066 Shakespeare produced 300,000 bigram types out of V2= 844 million possible bigrams. • So 99.96% of the possible bigrams were never seen (have zero entries in the table) Quadrigrams worse: What's coming out looks like Shakespeare because it is Shakespeare The Sparse Data Problem Again English word frequencies well described by Zipf’s Law Zipf (1949) characterized the relation between word frequency and rank as: f r C (for constant C ) r C/f log(r) log(C) - log (f) Purely Zipfian data plots as a straight line on a loglog scale How likely is a 0 count? Much more likely than I let on!!! CIS 521 - Intro to AI 23 *Rank (r): The numerical position of a word in a list sorted by decreasing frequency (f ). CIS 521 - Intro to AI 24 4 Word frequency & rank in Brown Corpus vs Zipf Zipf’s law for the Brown corpus Lots of area under the tail of this curve! From: Interactive mathematics http://www.intmath.com CIS 521 - Intro to AI 25 CIS 521 - Intro to AI 26 Smoothing At least one unknown word likely per sentence given Zipf!! To fix 0’s caused by this, we can smooth the data. • Assume we know how many types never occur in the data. • Steal probability mass from types that occur at least once. • Distribute this probability mass over the types that never occur. Smoothing This black art is why NLP is taught in the engineering school – Jason Eisner CIS 521 - Intro to AI 28 Smoothing Solution: Add-One Smoothing (again) ….is like Robin Hood: Pro: Very simple technique Cons: • it steals from the rich • Probability of frequent n-grams is underestimated • Probability of rare (or unseen) n-grams is overestimated • Therefore, too much probability mass is shifted towards unseen ngrams • All unseen n-grams are smoothed in the same way • and gives to the poor Using a smaller added-count improves things but only some More advanced techniques (Kneser Ney, Witten-Bell) use properties of component n-1 grams and the like... (Hint for this homework ) CIS 521 - Intro to AI 29 CIS 521 - Intro to AI 31 5
Similar documents
– Developing Rhythm Guitar Lesson DR6a Pearl Jam’s 'Just Breathe’ Travis Picking 4:
More information
Insurance contributions for temporary disability will be required in relation to foreigners
More information