Fernando Pereira with Axel Bernal, Koby Crammer, Kuzman

Transcription

Fernando Pereira with Axel Bernal, Koby Crammer, Kuzman
Learning to Analyze Sequences
Fernando Pereira
with
Axel Bernal, Koby Crammer, Kuzman Ganchev,
Ryan McDonald
Department of Computer and Information Science
University of Pennsylvania
Thanks: Steve Carroll, Artemis Hatzigeorgiu, John Lafferty, Kevin Lerman,
Gideon Mann, Andrew McCallum, Fei Sha, Peter White
NSF EIA 0205456, EIA 0205448, IIS 0428193
Annotation Markup
02/10/2007 10:05 PM
BioIE Annotation File: source_file_1156_28611.src (PMID-10027390)
Sequences Everywhere
Annotation Legend
Am J Pathol
Annotation Display Controls
1999 Feb;154(2):325-9
PubMed Article (#10027390)
Text
Frequent nuclear/cytoplasmic localization of beta-catenin without exon 3
mutations in malignant melanoma.
syntax
Rimm DL, Caca K, Hu G, Harrison FB, Fearon ER.
Department
Pathology,
root of
John
saw Yale
MaryUniversity
in
the School
parkof Medicine, New Haven,
Connecticut 06510, USA. david.rimm@yale.edu
content (entities)
Beta-Catenin has a critical role in E-cadherin-mediated cell-cell adhesion, and
it also functions as a downstream signaling molecule in the wnt pathway.
Mutations in the putative glycogen synthase kinase 3beta phosphorylation sites
near the beta-catenin amino terminus have been found in some cancers and cancer
cell lines. The mutations render beta-catenin resistant to regulation by a
complex containing the glycogen synthase kinase 3beta, adenomatous polyposis
coli, and axin proteins. As a result, beta-catenin accumulates in the cytosol
and nucleus and activates T-cell factor/ lymphoid enhancing factor transcription
factors. Previously,
6 Eof 27 melanoma
were found
Igenic
E1 cell lines
Eterm
Igenic to have beta-catenin
init
exon 3 mutations affecting the N-terminal phosphorylation sites (Rubinfeld B,
Robbins P, Elgamil M, Albert I, Porfiri E, Polakis P: Stabilization of
beta-catenin by genetic defects in melanoma cell lines. Science 1997,
275:1790-1792). To assess the role of beta-catenin defects in primary melanomas,
we undertook immunohistochemical and DNA sequencing studies in 65 melanoma
specimens. Nuclear and/or cytoplasmic localization of beta-catenin, a potential
indicator of wnt pathway activation, was seen focally within roughly one third
of the tumors, though a clonal somatic mutation in beta-catenin was found in
Genes
Analyzing Sequences
• Mapping from sequences (documents,
sentences, genes) to structures
Analysis Challenges
• Interacting decisions
fake
news
show
fake
news
show
• Diverse sequence features
• Inference – computing best analysis – may
be costly
General Setting
• Sequences over some alphabet
x∈Σ
∗
• A set of possible analyses for each
sequence
y ∈ Y(x)
• A parametric inference method for selecting
a “best” analysis
ŷ = h(x, w)
• Learn the parameters from examples
Previous Approaches
• Generative: learn stochastic process
parameters
ŷ = arg max P (x, y|w)
y
fake
news
show
• Sequential: learn how to build an analysis
ŷi = c(x, y1 , . . . , yi−1 , w)
fake
news
show
Previous Approaches
• Generative
• HMMs, probabilistic grammars
• Require a complete representation of the
input-output relation
• Hard to model non-independent features
• Sequential
• Allow full exploitation of input features
• But cannot trade-off decisions at different
positions: label-bias problem
Structured Linear Models
• Generalize linear classification
ŷ = arg max w · F (x, y)
y
• Features based!on local domains
F (x, y)
f C (x, y)
JJ
NNS
NN
fake
news
show
=
C∈C(x) f C (x, y)
= f C (x, y C )
show
fake
• Efficient dynamic programming for treestructured interactions
news
Learning
• Prior knowledge
• local domains C(x)
f
local
feature
functions
•
• Adjust w to optimize objective function on
C
some training data
2
w = arg min λ !w! +
∗
w
regularizer
!
i
L(xi , y i ; w)
loss
Margin
• Score advantage between correct and
candidate classifications
m(x, y, y ; w) = w · F (x, y) − w · F (x, y )
!
!
Losses
• Log loss ⇒ maximize probability of correct
output
!
!
−m(x,y,y ;w)
L(x, y; w) = log
e
y!
• Hamming loss ⇒ minimize distanceadjusted misclassification
L(x, y; w) = max
[d(y,
y
)
−
m(x,
y,
y
;
w)]
+
!
y
!
• Search over y′: dynamic programming on
“good” graphs
!
Online Training
• Process one training instance at a time
• Very simple
• Predictable runtime, small memory
• Adaptable to different loss functions
• Basic idea: w = 0
for t = 1, . . . , T :
for i = 1, . . . , N :
classify xi incurring loss l
update w to reduce l
Online maximum margin
(MIRA)
• Project onto subspace where the correct
structure scores “far enough” above all
incorrect ones
w=0
for t = 1, . . . , T :
for i = 1, . . . , N :
2
1
!
w ← arg minw! 2 "w − w"
s.t. ∀y : w! · F (xi , y i ) − w! · F (xi , y) ≥ d(y i , y)
• Exponentially many ys: select best k
instead
Analysis by Tagging
x = x1 · · · xn
y = y1 · · · yn
Structured
classifier
x = x1 · · · xn
y = y1 · · · yn
• Labels give the role of corresponding inputs
• Information extraction
• Part-of-speech tagging
• Shallow parsing
• Gene prediction
Metrics
positive negative
correct
TP
TN
incorrect
FP
FN
precision/
specificity
recall/
sensitivity
F1
TP
Sp = P =
FP + TP
TP
Sn = R =
TP + FN
2PR
F1 =
P+R
Features
• Conjunctions of
• Label configuration
• Input properties
• Term identity
• Membership in term list
• Orthographic patterns
• Conjunctions of the these for current
and surrounding words
• Feature induction: generate only those
conjunctions that help prediction
Gene/protein results
features
log loss
81.2
textual
82.6
+clusters
+dictionaries 82.5
83.5
all
Hamming FP+2FN
81.8
84.6
82.2
85
83.3
85.7
84.4
86.4
http://fable.chop.edu/index.jsp
Gene Structure
first exon
Igenic
internal exon
Einit
intron
E1
last exon
intron Eterm
DNA sequence
Igenic
Gene Prediction
• Accurate predictions very important for
biomedical research uses
• Ambiguous evidence
• Especially for first and last exons
• Many noisy evidence sources
• Only incremental advances in machinelearning methods in the last ten years
Training Methods
Previous
Igenic
Einit
E1
Training
Set
Ours
Eterm
Igenic
Igenic
Einit
E1
Eterm
Training Set
Annotation
Predict genes on
Training Set
Parameters
Train Signal
Detector
Train Content
Sensor
Sensor
Parameters
Detector
Parameters
Parameter
Update
Prediction
Annotation
TS
Compute Parameter
Update
Predict genes on
Development Set
NO
Prediction
Parameters
Train Gene Structure
Parameters
Igenic
Development Set
Structure
Parameters
Parameters
Converged?
Annotation
DS
YES
Gene
Model
Gene
Model
Possible Analyses
E0
s
I 1
Is0
E1
Is2
E2
IL0
IL1
IL2
Eterm
Einit
Esingle
INI
+
-
IG
intergenic
region
.
.
.
+
-
END
Gene Features
• Statistics of protein-coding DNA
• Individual codons and bicodons
• Length of exon states: semi-Markov model
• Motifs and motif conjunctions
sets. Transcript and gene-level accuracies are respectively 30% and 30.6% better than
Augustus, the second-best program overall. This means that our better accuracy results
obtained in the first two single-gene sequence sets scale well to chromosomal regions with
multiple, alternatively-spliced genes.
!!
Level
Base
Exon
All
Initial
Internal
Terminal
Single
Transcript
Gene
GenScan
Sn
Sp
84.0 62.1
Genezilla
Sn
Sp
87.6 50.9
GenScan++
Sn
Sp
76.7
79.3
Augustus
Sn
Sp
76.9
76.1
CRAIG
Sn
Sp
84.4 80.8
59.6
28.0
72.6
33.0
28.1
8.1
16.7
62.5
36.4
73.9
36.7
44.1
10.3
20.6
51.6
25.5
68.0
25.7
35.0
6.0
12.5
52.1
34.7
59.1
37.6
43.9
10.9
22.3
60.8
37.3
71.7
33.3
55.9
13.5
26.6
47.7
23.5
54.3
31.6
31.0
11.4
11.4
50.5
25.0
63.2
28.5
14.5
9.9
9.9
64.8
47.8
62.8
53.9
45.7
17.0
17.0
63.6
38.1
74.7
45.5
25.5
16.9
16.9
Table 4: Accuracy Results for ENCODE294.
Sensitivity (Sn) and specificity (Sp) results for each level and exon type.
Dataset
BGHM953
TIGR251
ENCODE294
GenScan
0.03
2.2 x 3.1–7
0.17
–6
Genezilla
1.66 x 10–6
0.22
>= 0.5
–4
GenScan++
5.2 x 10–66
1.4 x 10–23
2.33 x 3–22
–105
Augustus
1.3 x 10–5
1.4 x 10–25
4.7 x 10–16
–35
72.7
55.2
81.2
52.6
26.4
23.8
23.8
380000|
400000|
420000|
440000|
ENm002:
Prediction of Gene SLC22A4
HAVANA Annot
CRAIG
Augustus
GenScan
GenScan++
Genezilla
HMMGene
960000|
980000|
1000000|
1020000|
ENm005:
Prediction of Gene IFNAR1
HAVANA Annot
CRAIG
Augustus
GenScan
GenScan++
Genezilla
HMMGene
Higher Accuracy Gene Prediction with CRAIG
Learning to Parse
root
John
saw
Mary
in
the
park
S(saw)
compared with
lexicalized phrase
structure trees
VP(saw)
PP(in)
NP(park)
PN(John) V(saw) PN(Mary) P(in) D(the)
John
saw
Mary
in
the
N(park)
park
Why Dependencies?
• Capture meaningful predicate-argument
•
•
relations in language
O(n3) parsing with small grammar constant
vs O(n5) with large grammar constant for
lexicalized phrase structure
Useful in many language processing tasks
machine translation
relation extraction
question answering
textual entailment
•
•
•
•
ical2.1
information.
EdgeEdge
Based
Factorization
2.1
Based
Factorization
For
Dependency Parsing and Spanning Trees graph
InDependency
what
follows,
x = xx1=and
· ·x· 1x·Spanning
a
root,
x1
2Edge
Parsing
Tre
In
what
follows,
x
represen
n· ·represents
n
Based Factorization
V
, xj
input input
sentence,
and y
a generic
sentence,
andrepresents
y represents
axgen
what
follows,
x
=
x
·
·
·
x
represents
a
generic
2.1dency
Edge
Based
Factorization
1
n
Basic
(first-order)
version:
all
the
• dency
tree
for
sentence
x.
Seeing
y
as
the
se
tree
for
sentence
x.
Seeing
y
as
th
ut sentence, and y represents a generic depen- tices a
In
what
follows,
x
=
x
·
·
·
x
represents
gene
edges,
we
write
(i,
j)
∈
y
if
there
is
a
depe
edges,
we
write
(i,
j)
∈
y
if
there
is
1
n
ncy tree for sentence x. Seeing y as the set of tree pair aofd
input
depeI
in(i,
y j)
from
xword
in write
ysentence,
from
word
x
torepresents
xj . axgeneric
ges,
we
∈and
yword
ifiythere
dependency
iistoa word
j.
word.
y from
xIn
topaper
word
xwe
dencyword
sentence
Seeing
as
set
of
ifor
j . x.
spannin
this
paper
we
follow
thethe
common
Intree
this
follow
they common
met
s(i,
j)
=
w
·
f
(i,
j)
In edges,
thisfactoring
paper
we
follow
the
common
method
of
trees
a
we
write
(i,
j)
∈
y
if
there
is
a
dependen
factoring
the
score
of
a
dependency
tree
the score of a dependency
tree as t
!
toring
the
score
of
a
dependency
tree
as
the
sum
tence.
in y
from
word
x
to
word
x
.
i ·all
j=
ofscores
the
scores
all
intree
the
tree
s(x,
y) =
w
Fof
(x,
y)edges
s(i,
j) (Eisn
of
the
of
edges
in
the
(Eisner,
1
the scores of all edges in the tree (Eisner, 1996) . tree of
In
this
paper
we
follow
the
common
method
(i,j)∈y
In
particular,
we
define
the
score
of
an
In
particular,
we
define
the
score
of
an
edg
particular, we define the score of an edge to be imum (
factoring
the
score
of
a
dependency
tree
as
the
su
the
dot
product
between
a
high
dimensio
dot product
dimensional
the dotbetween
producta high
between
a highfeature
dimensional
dummy
Parse Scoring
root
hit
Scoring
a
Parse
the
bat
John
ball
with
the
root
0
John
1
hit
2
the
3
ball
4
with
5
the
6
bat
7
Figure 1: An example dependency tree.
s(x, y) = s(0, 1) + s(2, 1) + s(2, 4)+
s(4, 3) + s(2, 5) + s(5, 7) + s(7, 6)
long history (Hudson, 1984). Figure 1 shows a dependency tree for the sentence, John hit the ball with
the bat. A dependency tree must satisfy the tree constraint: each word must have exactly one incoming
Finding the Best Parse
y = arg max s(x, y)
∗
• General idea: maximum spanning tree
9
10
0
20
30
30
root
John
saw
3
11
Mary
Inference Algorithms
• Nested dependencies
• O(n ) Eisner algorithm
• dynamic programming
• related to CKY for probabilistic CFGs
• Crossing dependencies
• O(n ) Chu-Liu-Edmonds algorithm
• greedy
3
2
Features
...
...
...
part-of-speech
word
...
Parsing Multiple
Languages
• Same methods, standard feature templates
DATA SET
A RABIC
B ULGARIAN
C HINESE
C ZECH
DANISH
D UTCH
G ERMAN
JAPANESE
P ORTUGUESE
S LOVENE
S PANISH
S WEDISH
T URKISH
AVERAGE
UA
79.3
92.0
91.1
87.3
90.6
83.6
90.4
92.8
91.4
83.2
86.1
88.9
74.7
87.0
LA
66.9
87.6
85.9
80.2
84.8
79.2
87.3
90.7
86.8
73.4
82.3
82.5
63.2
80.8
Table 2: Erro
over Arabic,
Portuguese, S
ish. N/P: A
S/A: Sequent
clude morpho
What’s Next?
• Reduce training data requirements
• Structural correspondence learning:
•
•
unsupervised adaptation from one
(training) domain to another
What if inference is intractable? Can we
approximate?
Self-training: introduce diversity in learning
so that trained predictor can estimate its
own confidence and label new training
data