Thesis defense slides

Transcription

Thesis defense slides
Generalizing Local Coherence Modeling
Thesis Defense
Micha Elsner
Department of Computer Science
Brown University
January 10, 2011
Too much data, too little time
Wikileaks release contains
250,000 documents!
One document per minute, nonstop:
nearly 6 months of reading.
2
Too much data, too little time
Wikileaks release contains
250,000 documents!
One document per minute, nonstop:
nearly 6 months of reading.
Analysts, journalists and historians
need tools for dealing with massive
corpora.
2
Too much data, too little time
Wikileaks release contains
250,000 documents!
One document per minute, nonstop:
nearly 6 months of reading.
Analysts, journalists and historians
need tools for dealing with massive
corpora.
Automated
search, extraction and
the only way ordinary people can cope!
summary:
2
Multidocument summarization
I Sentence selection
3
Multidocument summarization
I Sentence selection
I “Patients who had symptoms”– of what?
3
Multidocument summarization
I Sentence selection
I “Patients who had symptoms”– of what?
I Reordering
3
Multidocument summarization
I Sentence selection
I “Patients who had symptoms”– of what?
I Reordering
3
How to reorder
Researchers start to build models...
I
(Marcu `97), (Mellish+al `98), (Barzilay+al `02)
Using linguistic theories of coherence
(Halliday+Hasan `76)
4
How to reorder
Researchers start to build models...
I
(Marcu `97), (Mellish+al `98), (Barzilay+al `02)
Using linguistic theories of coherence
(Halliday+Hasan `76)
Coherence
Structure of information in a discourse–
Gives readers context they need...
to understand new information
4
A solution looking for a problem
A model of coherence scores documents...
5
A solution looking for a problem
A model of coherence scores documents...
...but what documents should we test on?
I Building a summarizer is time-consuming...
I Hard to separate ordering from sentence selection!
Several proxy tasks based on summarization.
5
Discrimination
Most common evaluation for coherence models
(Barzilay+Lapata `05)
following (Karamanis+al `04)
I Single news article instead of summary
I Compares permutations of same document
But is this all there is to coherence?
6
Thesis statement
The question we're asking:
How do we know if current models generalize?
I The theory of coherence applies to all discourse...
I Do the models?
7
Thesis statement
The question we're asking:
How do we know if current models generalize?
I The theory of coherence applies to all discourse...
I Do the models?
Our answer:
Chat disentanglement
models:
is a good evaluation for coherence
I Using conversation instead of news
I And different documents instead of permutations
I Well-dened task with measurable performance
7
A crowded chat room
You intuitively know the structure is this:
8
A crowded chat room
Not this:
8
A crowded chat room
Especially not this!
8
Relating the problems
Both incorrect structures lack coherence
News:
Patients who had symptoms...
A crow infected with West
Nile...
Chat:
Of course, that's how they
make money!
You deserve a trophy.
People lose limbs, or get killed.
But before we apply this idea, we have some
work to do!
9
Overview
Introduction
Disentangling chat
A corpus...
And a baseline model
Local coherence models
...used in this study
Phone dialogues
An intermediate domain
Internet chat
A difcult domain
Conclusions
10
Overview
Introduction
Disentangling chat
A corpus...
And a baseline model
Local coherence models
...used in this study
Phone dialogues
An intermediate domain
Internet chat
A difcult domain
Conclusions
11
Infrastructure
To study this task, we need:
I Data
I Metrics for scoring our results
I Baseline and ceiling performance
We provide all these
I A few previous studies: (Shen+al `06), (Aoki+al `03)
I No prior published corpora
12
Making a corpus
Six annotators marked 800 lines of chat
I From a Linux tech support forum on IRC
I Corpus is now publically available
I Used in (Adams `08), (Wang+Oard `09)
13
An initial model
Correlation clustering framework:
I Classify each pair of utterances as “same thread” or
I
“different thread”
Partition the transcript to keep “same” utterances together
and split “different” ones apart
I NP-hard, so we use heuristics
14
An initial model
Correlation clustering framework:
I Classify each pair of utterances as “same thread” or
I
“different thread”
Partition the transcript to keep “same” utterances together
and split “different” ones apart
I NP-hard, so we use heuristics
14
Classifying utterances
Pair of utterances: same conversation or different?
Chat-based features (F 66%)
I Time between utterances
I Same speaker
I Speaker's name mentioned
15
Classifying utterances
Pair of utterances: same conversation or different?
Chat-based features (F 66%)
What's F-score?
15
Classifying utterances
Pair of utterances: same conversation or different?
Chat-based features (F 66%)
Discourse features (F 58%)
I Questions, answers, greetings, etc.
15
Classifying utterances
Pair of utterances: same conversation or different?
Chat-based features (F 66%)
Discourse features (F 58%)
Word overlap (F 56%)
I Weighted by word probability in corpus
I Simplistic coherence feature
15
Classifying utterances
Pair of utterances: same conversation or different?
Chat-based features (F 66%)
Discourse features (F 58%)
Word overlap (F 56%)
Combined model (F 71%)
15
Assigning a single sentence
It's easy to maximize the objective locally...
I Even though the global problem is hard
Same as previous
Accuracy
56
Corr. Clustering
76
16
Full-scale partitioning
Via a greedy algorithm:
17
Full-scale partitioning
Via a greedy algorithm:
17
One-to-one overlap
18
One-to-one overlap
18
One-to-one overlap
18
One-to-one overlap
18
Results
Agreement
Annotators
53
I Annotators don't always agree...
I Some make ner distinctions than others
19
Results
Agreement
Annotators
53
Best Baseline
35 (Pause 35)
I Annotators don't always agree...
I Some make ner distinctions than others
I Tested a variety of simple baselines
19
Results
Agreement
Annotators
53
Best Baseline
35 (Pause 35)
Corr. Clustering
41
I Annotators don't always agree...
I Some make ner distinctions than others
I Tested a variety of simple baselines
I Model outperforms all baselines
19
Overview
Introduction
Disentangling chat
A corpus...
And a baseline model
Local coherence models
...used in this study
Phone dialogues
An intermediate domain
Internet chat
A difcult domain
Conclusions
20
A menagerie of models
We've explored a lot of models...
and created several new ones!
We choose to present a few:
I Local: coherence at the sentence-to-sentence level
I Important for efciency
I Different aspects: broad spectrum of models
I Powerful: all the ingredients of a state-of-the-art ordering
I
system
Not all our own work.
21
Entity grid
Model of transitions from sentence to sentence
(Lapata+Barzilay `05,Barzilay+Lapata `05):
Text
Suddenly a White Rabbit ran by her.
Alice heard the Rabbit say “I shall be late!”
The Rabbit took a watch out of its pocket.
Alice started to her feet.
Syntactic role
subject
object
subject
missing
22
Topical entity grid
Relationships between different words
“a crow infected with West Nile...”
“the outbreak was the rst...”
Our own work.
I Represents words in a “semantic space”: LDA (Blei+al `01)
I Entity-grid-like model of transitions
I “Semantics” can be noisy...
I More sensitive than the Entity Grid, but easy to fool!
23
IBM Model 1
Single sentence of context
Learns word-to-word relationships directly
24
Pronouns
Detect passages with stranded pronouns:
(Charniak+Elsner `09), (Elsner+Charniak `08)
25
Old vs new information
New information needs complex packaging
“Secretary of State Hillary Clinton”
Old information doesn't
“Clinton”
Soft constraints: put the “new”-looking phrase
rst
(Elsner+Charniak `08) following (Poesio+al `05)
26
Discrimination results
Chance
EGrid
Topical EGrid
IBM-1
Pronouns
New info
Combined
50
76
72
77
70
72
82
27
Our plan
From news...
to phone conversations...
to IRC chat!
Phone dialogues on selected topics
I Manually transcribed/parsed
I Switchboard corpus
28
Overview
Introduction
Disentangling chat
A corpus...
And a baseline model
Local coherence models
...used in this study
Phone dialogues
An intermediate domain
Internet chat
A difcult domain
Conclusions
29
Discrimination on dialogue
Chance
EGrid
Topical EGrid
IBM-1
Pronouns
New info
Combined
50
86
71
85
72
55
88
Main effect of document length (uninteresting)
30
Discrimination on dialogue
Chance
EGrid
Topical EGrid
IBM-1
Pronouns
New info
Combined
50
86
71
85
72
55
88
Main effect of document length (uninteresting)
New-information model restricted to news
30
Synthetic transcripts
(Experiments in this section
ignore speaker identity)
31
Single-utterance disentanglement
Chance
Corr. Clustering
EGrid
Topical EGrid
IBM-1
Pronouns
Time
Combined
50
76
77
78
69
52
58
83
I Coherence approach outperforms previous
32
Single-utterance disentanglement
Chance
Corr. Clustering
EGrid
Topical EGrid
IBM-1
Pronouns
Time
Combined
50
76
77
78
69
52
58
83
I Coherence approach outperforms previous
I Topical model improves
32
Single-utterance disentanglement
Chance
Corr. Clustering
EGrid
Topical EGrid
IBM-1
Pronouns
Time
Combined
50
76
77
78
69
52
58
83
I Coherence approach outperforms previous
I Topical model improves
I Pronouns much worse
32
Single-utterance disentanglement
Chance
Corr. Clustering
EGrid
Topical EGrid
IBM-1
Pronouns
Time
Combined
50
76
77
78
69
52
58
83
I Coherence approach outperforms previous
I Topical model improves
I Pronouns much worse
Best models: sensitive, many-sentence context
32
Full-scale disentanglement on dialogue
Chance
Corr. Clustering
EGrid
Topical EGrid
IBM-1
Pronouns
Time
Combined
50
59
59
60
56
54
55
64
I Coherence approach outperforms previous
I Results predictable from single utterance
33
What we've learned so far
Conversational data
I Most models work
I New-information model doesn't work
I Too tied to news
34
What we've learned so far
Conversational data
I Most models work
I New-information model doesn't work
I Too tied to news
Disentanglement
I Can be solved with coherence models!
I Models should be sensitive to small changes...
I And use lots of context sentences
I Even if this makes them vulnerable to noise
34
Overview
Introduction
Disentangling chat
A corpus...
And a baseline model
Local coherence models
...used in this study
Phone dialogues
An intermediate domain
Internet chat
A difcult domain
Conclusions
35
Our plan
From news...
to phone conversations...
to IRC chat!
IRC data from previous study
I Automatically parsed
36
Chat-specic features
Important cues not based on coherence:
I Time between utterances
I Same speaker
I Speaker's name mentioned
Same as in earlier study
A strong starting point
We evaluate other models combined with these
features
37
Single-utterance disentanglement
Chat-specic
74
38
Single-utterance disentanglement
Chat-specic
Corr. Clustering
74
76
38
Single-utterance disentanglement
Chat-specic
Corr. Clustering
Chat+EGrid
74
76
79
I Still outperform previous
38
Single-utterance disentanglement
Chat-specic
Corr. Clustering
Chat+EGrid
Chat+Topical EGrid
Chat+IBM-1
74
76
79
77
76
I Still outperform previous
I Lexical models not as good
I Lack of data: trained on phone conversations
38
Single-utterance disentanglement
Chat-specic
Corr. Clustering
Chat+EGrid
Chat+Topical EGrid
Chat+IBM-1
Chat+Pronouns
74
76
79
77
76
74
I Still outperform previous
I Lexical models not as good
I Lack of data: trained on phone conversations
I Pronouns same as before
38
More data
800 utterances not enough for you?
I Much larger corpora from (Martell+Adams `08)
I Using our annotation software and protocol
I 20000 total utterances from three newsgroups
Corr. Clustering
EGrid
M+A corpora
89
93
39
Full-scale?
Search for the most coherent structure?
Doesn't work...
I Creates too many conversations
I Coherence models: relationships
between sentences
I Not how many sentences!
Coherence seems necessary, not sufcient...
need to know: How long do chats last?
How often do they start?
Our main goal: learn about coherence, not chat
40
What we've learned now
IRC data
I Unlexicalized entity grid works
I Lexicalized models worse...
I Need IRC training data
41
What we've learned now
IRC data
I Unlexicalized entity grid works
I Lexicalized models worse...
I Need IRC training data
Disentanglement
I Single utterance performance still good
I Arbitrary number of threads requires non-coherence
information
41
Overview
Introduction
Disentangling chat
A corpus...
And a baseline model
Local coherence models
...used in this study
Phone dialogues
An intermediate domain
Internet chat
A difcult domain
Conclusions
42
Thesis statement
The question we asked:
How do we know if current models generalize?
I The theory of coherence applies to all discourse...
I Do the models?
Our answer:
Used Chat disentanglement to evaluate generalization:
I Most models work on conversation
I More context important for disentanglement
I Word-to-word and topical models work in principle...
I Given suitable training data
43
Thanks!
Eugene Charniak
Regina Barzilay
Mark Johnson
44
Thanks!
Innitely indebted to...
Will Headden, Matt Lease, Rebecca Mason, David McClosky,
Ben Swanson, Jenine Turner
Joe Austerweil, Sasha Berkoff, Stu Black, Li-Juan Cai, Carleton
Coffrin, Adam Darlow, Aparna Das, Jennie Duggan, Naomi
Feldman, Steve Gomez, Dan Grollman, Brendan Hickey,
Suman Karumuri, Dae-Il Kim, Dan L. Klein, Yuri Malitsky, Yulia
Malitskaia, Erik Murphy, Jason Pacheco, Deepak Santhanam,
Warren Schudy, Becca Schaffner, Allison Smith, Ben Swanson,
Engin Ural...
and my family.
and paid for by...
Brown, DARPA, NSF, and the Google Fellowship for NLP
45
Supporting publications
I Chat corpus and baseline model:
I
(Elsner+Charniak ACL `08)
I
(E+Schudy ILP-NLP `09)
I
(E+Austerweil+C NAACL `07) (E+C ACL short `08) (E+C in preparation)
I
(E+C `11 submitted)
, (E+C Journal of CL `10)
I Heuristics for correlation clustering:
I Modeling coherence:
,
I Applying coherence to disentanglement:
,
I Coreference:
I
(C+E EACL `09) (E+C+Johnson NAACL `09) (E+C ACL short `10)
I
(E+Santhanam in preparation)
I
(E+Swift+Allen+Gildea IWPT `05) (C+Johnson+E+al NAACL `06)
,
,
I Sentence fusion:
I Parsing:
,
46
Special bonus material!
Other applications of coherence
Correlation clustering algorithms
Complete correlation clustering numbers
No, really... why doesn't full IRC disentanglement work?
47
Evaluating coherence models
I I showed you this example task (Barzilay+Lapata `05)...
I In the same framework, you can search for an optimal
order (NP-hard)
I Or do a single-utterance task called insertion
What non-summary applications are there?
48
Scoring documents
Evaluating the scores directly...
Human ratings of coherence
I Experiments: “how coherent/readable is this: 1-5?”
I eg (Pitler+Nenkova `08)
I Expensive, and somewhat subjective
I Essay grades: (Miltsakaki+Kukich `04, Burstein+al `10))
I Data is proprietary, grades not entirely coherence-based
Data expensive... domain still informative writing
49
Segmentation
Discourse segments
Topical units: paragraph, section, chapter...
I Useful for building table-of-contents or index
I First model: TextTiling (Hearst `97)
I Since then many others...
I Often time-dependent language modeling (eg HMM-topic
model (Blei+Moreno `01))
I Usually global, not local, models
I Used on many domains
Not all coherence models segment...
have to build in segmental structure explicitly
with latent variables
50
Afnity matrix
51
Correlation clustering
w
Given green edges + and red edges
Partition to minimize disagreement.
s.t.
x
ij
min
x
xw
ij
ij
+ (1
w
...
x )w +
ij
ij
form a consistent clustering
relation must be transitive:
x
ij
and
x !x
jk
ik
Minimization is NP-hard (Bansal et al. `04).
52
Bounds
53
Results
Bounds
Trivial bound
SDP bound
Objective
0%
13.0%
Local
One-to-one
54
Results
Bounds
Local
search
Trivial bound
SDP bound
First/BOEM
Vote/BOEM
Sim Anneal
Best/BOEM
BOEM
Pivot/BOEM
Objective
0%
13.0%
19.3%
20.0%
20.3%
21.3%
21.5%
22.0%
Local
One-to-one
54
Results
Bounds
Local
search
Greedy
Trivial bound
SDP bound
First/BOEM
Vote/BOEM
Sim Anneal
Best/BOEM
BOEM
Pivot/BOEM
Vote
Best
Pivot
First
Objective
0%
13.0%
19.3%
20.0%
20.3%
21.3%
21.5%
22.0%
26.3%
37.1%
44.4%
58.3%
Local
One-to-one
54
Results
Bounds
Local
search
Greedy
Trivial bound
SDP bound
First/BOEM
Vote/BOEM
Sim Anneal
Best/BOEM
BOEM
Pivot/BOEM
Vote
Best
Pivot
First
Objective
0%
13.0%
19.3%
20.0%
20.3%
21.3%
21.5%
22.0%
26.3%
37.1%
44.4%
58.3%
Local
74
73
73
73
72
72
72
67
66
62
One-to-one
41
46
42
43
22
45
44
40
39
39
54
Objective doesn't always predict performance
Most edges have weight .5:
I Some systems link too much.
I Doesn't affect local metric much...
I But global metric suffers.
In this situation, useful to have an external measure of quality.
Better inference is still useful:
I Vote/BOEM 12% better than (Elsner+Charniak `08).
I Exact same classier!
55
Results
Mean 1-to-1
Max 1-to-1
Min 1-to-1
Mean local
Max local
Min local
Mean F
Max F
Min F
Annotators
53
64
36
81
87
75
54
66
35
Model
41
52
32
73
75
71
43
58
33
Best Baseline
35 (Pause 35)
56 (Pause 65)
28 (Blocks 80)
62 (Speaker)
69 (Speaker)
54 (Speaker)
37 (Speaker)
47 (Speaker)
29 (Blocks 65)
56
Reminder: what I told you before
Search for the most coherent structure?
Doesn't work...
I Creates too many conversations
I Coherence models: relationships
between sentences
I Not how many sentences!
Coherence seems necessary, not sufcient...
need to know: How long do chats last?
How often do they start?
Our main goal: learn about coherence, not chat
57
Model of chat length: why not?
It's not hard to model chat length, is it?
I No, but... integrating it is hard!
58
Model of chat length: why not?
It's not hard to model chat length, is it?
I No, but... integrating it is hard!
Combining models
Log-linear framework (Soricut+Marcu `06)
P (d ) / (egrid) + (ibm) + (time gap) + (speaker ids):::
1
2
3
4
Framework is discriminative...
I Trained to maximize P (d ) relative to P (d 0 )
I No generative version: intractable to normalize over all d
58
Choice of contrast documents
Training the combined model
Need good document
d and
contrast
d0
I Discrimination:
I
d
I
d
0
is permutation of d
I Single-utterance disentanglement:
0
is d with one utterance swapped
59
Choice of contrast documents
Training the combined model
Need good document
d and
contrast
d0
I Discrimination:
I
d
I
d
0
is permutation of d
I Single-utterance disentanglement:
0
is d with one utterance swapped
I Full-scale disentanglement:
I Should base d and d on decisions in search tree (like
0
SEARN (Daume `06))
I Requires search inside training– 10hrs/doc!
59
Choice of contrast documents
Training the combined model
Need good document
d and
contrast
d0
I Discrimination:
I
d
I
d
0
is permutation of d
I Single-utterance disentanglement:
0
is d with one utterance swapped
I Full-scale disentanglement:
I Should base d and d on decisions in search tree (like
0
SEARN (Daume `06))
I Requires search inside training– 10hrs/doc!
I Use combination from single-utterance disentanglement
for full task
I But the chat length model is useless for single-utterance
disentanglement!
59