Thesis defense slides
Transcription
Thesis defense slides
Generalizing Local Coherence Modeling Thesis Defense Micha Elsner Department of Computer Science Brown University January 10, 2011 Too much data, too little time Wikileaks release contains 250,000 documents! One document per minute, nonstop: nearly 6 months of reading. 2 Too much data, too little time Wikileaks release contains 250,000 documents! One document per minute, nonstop: nearly 6 months of reading. Analysts, journalists and historians need tools for dealing with massive corpora. 2 Too much data, too little time Wikileaks release contains 250,000 documents! One document per minute, nonstop: nearly 6 months of reading. Analysts, journalists and historians need tools for dealing with massive corpora. Automated search, extraction and the only way ordinary people can cope! summary: 2 Multidocument summarization I Sentence selection 3 Multidocument summarization I Sentence selection I Patients who had symptoms of what? 3 Multidocument summarization I Sentence selection I Patients who had symptoms of what? I Reordering 3 Multidocument summarization I Sentence selection I Patients who had symptoms of what? I Reordering 3 How to reorder Researchers start to build models... I (Marcu `97), (Mellish+al `98), (Barzilay+al `02) Using linguistic theories of coherence (Halliday+Hasan `76) 4 How to reorder Researchers start to build models... I (Marcu `97), (Mellish+al `98), (Barzilay+al `02) Using linguistic theories of coherence (Halliday+Hasan `76) Coherence Structure of information in a discourse Gives readers context they need... to understand new information 4 A solution looking for a problem A model of coherence scores documents... 5 A solution looking for a problem A model of coherence scores documents... ...but what documents should we test on? I Building a summarizer is time-consuming... I Hard to separate ordering from sentence selection! Several proxy tasks based on summarization. 5 Discrimination Most common evaluation for coherence models (Barzilay+Lapata `05) following (Karamanis+al `04) I Single news article instead of summary I Compares permutations of same document But is this all there is to coherence? 6 Thesis statement The question we're asking: How do we know if current models generalize? I The theory of coherence applies to all discourse... I Do the models? 7 Thesis statement The question we're asking: How do we know if current models generalize? I The theory of coherence applies to all discourse... I Do the models? Our answer: Chat disentanglement models: is a good evaluation for coherence I Using conversation instead of news I And different documents instead of permutations I Well-dened task with measurable performance 7 A crowded chat room You intuitively know the structure is this: 8 A crowded chat room Not this: 8 A crowded chat room Especially not this! 8 Relating the problems Both incorrect structures lack coherence News: Patients who had symptoms... A crow infected with West Nile... Chat: Of course, that's how they make money! You deserve a trophy. People lose limbs, or get killed. But before we apply this idea, we have some work to do! 9 Overview Introduction Disentangling chat A corpus... And a baseline model Local coherence models ...used in this study Phone dialogues An intermediate domain Internet chat A difcult domain Conclusions 10 Overview Introduction Disentangling chat A corpus... And a baseline model Local coherence models ...used in this study Phone dialogues An intermediate domain Internet chat A difcult domain Conclusions 11 Infrastructure To study this task, we need: I Data I Metrics for scoring our results I Baseline and ceiling performance We provide all these I A few previous studies: (Shen+al `06), (Aoki+al `03) I No prior published corpora 12 Making a corpus Six annotators marked 800 lines of chat I From a Linux tech support forum on IRC I Corpus is now publically available I Used in (Adams `08), (Wang+Oard `09) 13 An initial model Correlation clustering framework: I Classify each pair of utterances as same thread or I different thread Partition the transcript to keep same utterances together and split different ones apart I NP-hard, so we use heuristics 14 An initial model Correlation clustering framework: I Classify each pair of utterances as same thread or I different thread Partition the transcript to keep same utterances together and split different ones apart I NP-hard, so we use heuristics 14 Classifying utterances Pair of utterances: same conversation or different? Chat-based features (F 66%) I Time between utterances I Same speaker I Speaker's name mentioned 15 Classifying utterances Pair of utterances: same conversation or different? Chat-based features (F 66%) What's F-score? 15 Classifying utterances Pair of utterances: same conversation or different? Chat-based features (F 66%) Discourse features (F 58%) I Questions, answers, greetings, etc. 15 Classifying utterances Pair of utterances: same conversation or different? Chat-based features (F 66%) Discourse features (F 58%) Word overlap (F 56%) I Weighted by word probability in corpus I Simplistic coherence feature 15 Classifying utterances Pair of utterances: same conversation or different? Chat-based features (F 66%) Discourse features (F 58%) Word overlap (F 56%) Combined model (F 71%) 15 Assigning a single sentence It's easy to maximize the objective locally... I Even though the global problem is hard Same as previous Accuracy 56 Corr. Clustering 76 16 Full-scale partitioning Via a greedy algorithm: 17 Full-scale partitioning Via a greedy algorithm: 17 One-to-one overlap 18 One-to-one overlap 18 One-to-one overlap 18 One-to-one overlap 18 Results Agreement Annotators 53 I Annotators don't always agree... I Some make ner distinctions than others 19 Results Agreement Annotators 53 Best Baseline 35 (Pause 35) I Annotators don't always agree... I Some make ner distinctions than others I Tested a variety of simple baselines 19 Results Agreement Annotators 53 Best Baseline 35 (Pause 35) Corr. Clustering 41 I Annotators don't always agree... I Some make ner distinctions than others I Tested a variety of simple baselines I Model outperforms all baselines 19 Overview Introduction Disentangling chat A corpus... And a baseline model Local coherence models ...used in this study Phone dialogues An intermediate domain Internet chat A difcult domain Conclusions 20 A menagerie of models We've explored a lot of models... and created several new ones! We choose to present a few: I Local: coherence at the sentence-to-sentence level I Important for efciency I Different aspects: broad spectrum of models I Powerful: all the ingredients of a state-of-the-art ordering I system Not all our own work. 21 Entity grid Model of transitions from sentence to sentence (Lapata+Barzilay `05,Barzilay+Lapata `05): Text Suddenly a White Rabbit ran by her. Alice heard the Rabbit say I shall be late! The Rabbit took a watch out of its pocket. Alice started to her feet. Syntactic role subject object subject missing 22 Topical entity grid Relationships between different words a crow infected with West Nile... the outbreak was the rst... Our own work. I Represents words in a semantic space: LDA (Blei+al `01) I Entity-grid-like model of transitions I Semantics can be noisy... I More sensitive than the Entity Grid, but easy to fool! 23 IBM Model 1 Single sentence of context Learns word-to-word relationships directly 24 Pronouns Detect passages with stranded pronouns: (Charniak+Elsner `09), (Elsner+Charniak `08) 25 Old vs new information New information needs complex packaging Secretary of State Hillary Clinton Old information doesn't Clinton Soft constraints: put the new-looking phrase rst (Elsner+Charniak `08) following (Poesio+al `05) 26 Discrimination results Chance EGrid Topical EGrid IBM-1 Pronouns New info Combined 50 76 72 77 70 72 82 27 Our plan From news... to phone conversations... to IRC chat! Phone dialogues on selected topics I Manually transcribed/parsed I Switchboard corpus 28 Overview Introduction Disentangling chat A corpus... And a baseline model Local coherence models ...used in this study Phone dialogues An intermediate domain Internet chat A difcult domain Conclusions 29 Discrimination on dialogue Chance EGrid Topical EGrid IBM-1 Pronouns New info Combined 50 86 71 85 72 55 88 Main effect of document length (uninteresting) 30 Discrimination on dialogue Chance EGrid Topical EGrid IBM-1 Pronouns New info Combined 50 86 71 85 72 55 88 Main effect of document length (uninteresting) New-information model restricted to news 30 Synthetic transcripts (Experiments in this section ignore speaker identity) 31 Single-utterance disentanglement Chance Corr. Clustering EGrid Topical EGrid IBM-1 Pronouns Time Combined 50 76 77 78 69 52 58 83 I Coherence approach outperforms previous 32 Single-utterance disentanglement Chance Corr. Clustering EGrid Topical EGrid IBM-1 Pronouns Time Combined 50 76 77 78 69 52 58 83 I Coherence approach outperforms previous I Topical model improves 32 Single-utterance disentanglement Chance Corr. Clustering EGrid Topical EGrid IBM-1 Pronouns Time Combined 50 76 77 78 69 52 58 83 I Coherence approach outperforms previous I Topical model improves I Pronouns much worse 32 Single-utterance disentanglement Chance Corr. Clustering EGrid Topical EGrid IBM-1 Pronouns Time Combined 50 76 77 78 69 52 58 83 I Coherence approach outperforms previous I Topical model improves I Pronouns much worse Best models: sensitive, many-sentence context 32 Full-scale disentanglement on dialogue Chance Corr. Clustering EGrid Topical EGrid IBM-1 Pronouns Time Combined 50 59 59 60 56 54 55 64 I Coherence approach outperforms previous I Results predictable from single utterance 33 What we've learned so far Conversational data I Most models work I New-information model doesn't work I Too tied to news 34 What we've learned so far Conversational data I Most models work I New-information model doesn't work I Too tied to news Disentanglement I Can be solved with coherence models! I Models should be sensitive to small changes... I And use lots of context sentences I Even if this makes them vulnerable to noise 34 Overview Introduction Disentangling chat A corpus... And a baseline model Local coherence models ...used in this study Phone dialogues An intermediate domain Internet chat A difcult domain Conclusions 35 Our plan From news... to phone conversations... to IRC chat! IRC data from previous study I Automatically parsed 36 Chat-specic features Important cues not based on coherence: I Time between utterances I Same speaker I Speaker's name mentioned Same as in earlier study A strong starting point We evaluate other models combined with these features 37 Single-utterance disentanglement Chat-specic 74 38 Single-utterance disentanglement Chat-specic Corr. Clustering 74 76 38 Single-utterance disentanglement Chat-specic Corr. Clustering Chat+EGrid 74 76 79 I Still outperform previous 38 Single-utterance disentanglement Chat-specic Corr. Clustering Chat+EGrid Chat+Topical EGrid Chat+IBM-1 74 76 79 77 76 I Still outperform previous I Lexical models not as good I Lack of data: trained on phone conversations 38 Single-utterance disentanglement Chat-specic Corr. Clustering Chat+EGrid Chat+Topical EGrid Chat+IBM-1 Chat+Pronouns 74 76 79 77 76 74 I Still outperform previous I Lexical models not as good I Lack of data: trained on phone conversations I Pronouns same as before 38 More data 800 utterances not enough for you? I Much larger corpora from (Martell+Adams `08) I Using our annotation software and protocol I 20000 total utterances from three newsgroups Corr. Clustering EGrid M+A corpora 89 93 39 Full-scale? Search for the most coherent structure? Doesn't work... I Creates too many conversations I Coherence models: relationships between sentences I Not how many sentences! Coherence seems necessary, not sufcient... need to know: How long do chats last? How often do they start? Our main goal: learn about coherence, not chat 40 What we've learned now IRC data I Unlexicalized entity grid works I Lexicalized models worse... I Need IRC training data 41 What we've learned now IRC data I Unlexicalized entity grid works I Lexicalized models worse... I Need IRC training data Disentanglement I Single utterance performance still good I Arbitrary number of threads requires non-coherence information 41 Overview Introduction Disentangling chat A corpus... And a baseline model Local coherence models ...used in this study Phone dialogues An intermediate domain Internet chat A difcult domain Conclusions 42 Thesis statement The question we asked: How do we know if current models generalize? I The theory of coherence applies to all discourse... I Do the models? Our answer: Used Chat disentanglement to evaluate generalization: I Most models work on conversation I More context important for disentanglement I Word-to-word and topical models work in principle... I Given suitable training data 43 Thanks! Eugene Charniak Regina Barzilay Mark Johnson 44 Thanks! Innitely indebted to... Will Headden, Matt Lease, Rebecca Mason, David McClosky, Ben Swanson, Jenine Turner Joe Austerweil, Sasha Berkoff, Stu Black, Li-Juan Cai, Carleton Coffrin, Adam Darlow, Aparna Das, Jennie Duggan, Naomi Feldman, Steve Gomez, Dan Grollman, Brendan Hickey, Suman Karumuri, Dae-Il Kim, Dan L. Klein, Yuri Malitsky, Yulia Malitskaia, Erik Murphy, Jason Pacheco, Deepak Santhanam, Warren Schudy, Becca Schaffner, Allison Smith, Ben Swanson, Engin Ural... and my family. and paid for by... Brown, DARPA, NSF, and the Google Fellowship for NLP 45 Supporting publications I Chat corpus and baseline model: I (Elsner+Charniak ACL `08) I (E+Schudy ILP-NLP `09) I (E+Austerweil+C NAACL `07) (E+C ACL short `08) (E+C in preparation) I (E+C `11 submitted) , (E+C Journal of CL `10) I Heuristics for correlation clustering: I Modeling coherence: , I Applying coherence to disentanglement: , I Coreference: I (C+E EACL `09) (E+C+Johnson NAACL `09) (E+C ACL short `10) I (E+Santhanam in preparation) I (E+Swift+Allen+Gildea IWPT `05) (C+Johnson+E+al NAACL `06) , , I Sentence fusion: I Parsing: , 46 Special bonus material! Other applications of coherence Correlation clustering algorithms Complete correlation clustering numbers No, really... why doesn't full IRC disentanglement work? 47 Evaluating coherence models I I showed you this example task (Barzilay+Lapata `05)... I In the same framework, you can search for an optimal order (NP-hard) I Or do a single-utterance task called insertion What non-summary applications are there? 48 Scoring documents Evaluating the scores directly... Human ratings of coherence I Experiments: how coherent/readable is this: 1-5? I eg (Pitler+Nenkova `08) I Expensive, and somewhat subjective I Essay grades: (Miltsakaki+Kukich `04, Burstein+al `10)) I Data is proprietary, grades not entirely coherence-based Data expensive... domain still informative writing 49 Segmentation Discourse segments Topical units: paragraph, section, chapter... I Useful for building table-of-contents or index I First model: TextTiling (Hearst `97) I Since then many others... I Often time-dependent language modeling (eg HMM-topic model (Blei+Moreno `01)) I Usually global, not local, models I Used on many domains Not all coherence models segment... have to build in segmental structure explicitly with latent variables 50 Afnity matrix 51 Correlation clustering w Given green edges + and red edges Partition to minimize disagreement. s.t. x ij min x xw ij ij + (1 w ... x )w + ij ij form a consistent clustering relation must be transitive: x ij and x !x jk ik Minimization is NP-hard (Bansal et al. `04). 52 Bounds 53 Results Bounds Trivial bound SDP bound Objective 0% 13.0% Local One-to-one 54 Results Bounds Local search Trivial bound SDP bound First/BOEM Vote/BOEM Sim Anneal Best/BOEM BOEM Pivot/BOEM Objective 0% 13.0% 19.3% 20.0% 20.3% 21.3% 21.5% 22.0% Local One-to-one 54 Results Bounds Local search Greedy Trivial bound SDP bound First/BOEM Vote/BOEM Sim Anneal Best/BOEM BOEM Pivot/BOEM Vote Best Pivot First Objective 0% 13.0% 19.3% 20.0% 20.3% 21.3% 21.5% 22.0% 26.3% 37.1% 44.4% 58.3% Local One-to-one 54 Results Bounds Local search Greedy Trivial bound SDP bound First/BOEM Vote/BOEM Sim Anneal Best/BOEM BOEM Pivot/BOEM Vote Best Pivot First Objective 0% 13.0% 19.3% 20.0% 20.3% 21.3% 21.5% 22.0% 26.3% 37.1% 44.4% 58.3% Local 74 73 73 73 72 72 72 67 66 62 One-to-one 41 46 42 43 22 45 44 40 39 39 54 Objective doesn't always predict performance Most edges have weight .5: I Some systems link too much. I Doesn't affect local metric much... I But global metric suffers. In this situation, useful to have an external measure of quality. Better inference is still useful: I Vote/BOEM 12% better than (Elsner+Charniak `08). I Exact same classier! 55 Results Mean 1-to-1 Max 1-to-1 Min 1-to-1 Mean local Max local Min local Mean F Max F Min F Annotators 53 64 36 81 87 75 54 66 35 Model 41 52 32 73 75 71 43 58 33 Best Baseline 35 (Pause 35) 56 (Pause 65) 28 (Blocks 80) 62 (Speaker) 69 (Speaker) 54 (Speaker) 37 (Speaker) 47 (Speaker) 29 (Blocks 65) 56 Reminder: what I told you before Search for the most coherent structure? Doesn't work... I Creates too many conversations I Coherence models: relationships between sentences I Not how many sentences! Coherence seems necessary, not sufcient... need to know: How long do chats last? How often do they start? Our main goal: learn about coherence, not chat 57 Model of chat length: why not? It's not hard to model chat length, is it? I No, but... integrating it is hard! 58 Model of chat length: why not? It's not hard to model chat length, is it? I No, but... integrating it is hard! Combining models Log-linear framework (Soricut+Marcu `06) P (d ) / (egrid) + (ibm) + (time gap) + (speaker ids)::: 1 2 3 4 Framework is discriminative... I Trained to maximize P (d ) relative to P (d 0 ) I No generative version: intractable to normalize over all d 58 Choice of contrast documents Training the combined model Need good document d and contrast d0 I Discrimination: I d I d 0 is permutation of d I Single-utterance disentanglement: 0 is d with one utterance swapped 59 Choice of contrast documents Training the combined model Need good document d and contrast d0 I Discrimination: I d I d 0 is permutation of d I Single-utterance disentanglement: 0 is d with one utterance swapped I Full-scale disentanglement: I Should base d and d on decisions in search tree (like 0 SEARN (Daume `06)) I Requires search inside training 10hrs/doc! 59 Choice of contrast documents Training the combined model Need good document d and contrast d0 I Discrimination: I d I d 0 is permutation of d I Single-utterance disentanglement: 0 is d with one utterance swapped I Full-scale disentanglement: I Should base d and d on decisions in search tree (like 0 SEARN (Daume `06)) I Requires search inside training 10hrs/doc! I Use combination from single-utterance disentanglement for full task I But the chat length model is useless for single-utterance disentanglement! 59