Notes - desdevises

Transcription

Notes - desdevises
Phylogenetic
Reconstruction
Yves Desdevises
Université Pierre et Marie Curie (Paris 6)
Observatoire Océanologique de Banyuls
France
desdevises@obs-banyuls.fr
http://desdevises.free.fr
http://desdevises.free.fr/Phylogenetic_reconstruction
1
References
• Felsenstein J. 2004. Inferring phylogenies. Sinauer.
• Lemey P., Salemi M. et Vandamme A.-M. 2009. The
phylogenetic handbook. Second Edition. Cambridge
University Press.
• Hall B. 2007. Phylogenetic trees made easy. Third Edition.
Sinauer.
• Page R. & Holmes E. 1998. Molecular evolution: a
phylogenetic approach. Blackwell.
• Nei M. & Kumar S. 2000. Molecular Evolution and
Phylogenetics. Oxford University Press.
2
• Goal: propose a hypothesis of relationships
between several taxa
• Phylogeny = tree (≠ ladder)
• Speciation: binary
• Based on homology: similarity from a common
ancestor
• Indicates the existence of a common ancestor
• Identified from a phylogenetic tree, and basis
to build it!
3
Labrus viridis
Cheilinus trilobatus
Cheilinus chlorourus
Stetojulis albovittata
Stetojulis bandanensis
Halichoeres
margaritace
us
albovittata
Stetojulis bandanensis
Stetojulis
rus
lorou
nus ch
Cheili
Ch
eil
in
us
tril
ob
a
Labrus merula
viridis tus
Labropsis australis
Halichoeres marginatus
Labroides dimidiatus
Labrichthys unilineatus
Coris julis
Hemigymnus melapterus
Hemigymnus fasciatus
Thalassoma bifasciatum
Thalassoma lunare
Notolabrus tetricus
Bodianus rufus
Clepticus parrae
Pagrus major
Symphodus roissali
Symphodus roissali
Symphodus cinereus
Symphodus tinca
Symphodus tinca
Symphodus ocellatus
Symphodus ocellatus
Symphodus mediterraneus
Symphodus mediterraneus
Ctenolabrus rupestris
Ctenolabrus rupestris
Labrus merula
Labrus viridis
Labrus viridis
Cheilinus chlorourus
Epibulus incidiator
Cheilinus trilobatus
Cheilinus chlorourus
Epibulus incidiator
Stetojulis albovittata
Stetojulis albovittata
Stetojulis bandanensis
Stetojulis bandanensis
Halichoeres hortulanus
Halichoeres hortulanus
Halichoeres margaritaceus
Halichoeres margaritaceus
Labropsis australis
Labropsis australis
Halichoeres marginatus
Halichoeres marginatus
Anampses geographicus
Anampses geographicus
Anampses caeruleopunctatus
Anampses caeruleopunctatus
Labroides dimidiatus
Labroides dimidiatus
Labrichthys unilineatus
Labrichthys unilineatus
Coris julis
Coris julis
Hemigymnus melapterus
Hemigymnus melapterus
Hemigymnus fasciatus
Hemigymnus fasciatus
Thalassoma bifasciatum
Thalassoma bifasciatum
Thalassoma lunare
Thalassoma lunare
Thalassoma lutescens
Pictilabrus laticlavius
Notolabrus tetricus
Bodianus rufus
Sympho
dus cin
ereus
Sym
phod
Sy
us tin
mp
ca
Sy
ho
m
du
ph
so
ce
od
ll
us
atu
s
m
ed
ite
rra
ne
us
Symphodus melanocercus
Labrus merula
Cheilinus trilobatus
Labrus viridis
Thalassoma lutescens
Pictilabrus laticlavius
lis
Labropsis austra
ceus
rgarita
us
es ma
lan
hoer
Halic
is
ortu
ns
sh
re
ne
oe
da
lich
an
Ha
sb
juli
to
Ste
Symphodus cinereus
Symphodus melanocercus
Symphodus roissali
Anampses geographicus
Anampses caeruleopunctatus
us
tric
te
s
bru
fus
la
to
s ru
No ianu
d
rrae
Bo
us pa
ptic
Cle
major
Pagrus
s
rcu
ce
no
ela
sm
ris
du
ho
pest
s ru
mp
bru
Sy
nola
Cte
a
s merul
Labru
stris
rupe
Sym
ph
od
us
oce
lla
tus
r
to
ia
cid
in
us tinca
Symphod
SSyy
mmp
phh
oodd
uuss
cro
inis
ere
sa
ulis
Halichoeres hortulanus
Halichoeres margaritaceus
La
bro
ide
sd
im
cae
idia
rule
opu
tus
Anam
nct
atu
pses
s
geog
raph
icus
Halichoeres margin
atus
An
am
pse
s
Thalassoma bifasciatum
Ctenolabrus rupestris
Labrus merula
Epibulus incidiator
brus
nola
Cte
Pa
gru
sm
ajo
r
Symphodus melanocercus
Ste
to
juli
sa
Ep
lbo
ibu
vit
lus
ta
inc
Chei
idia ta
linus
tor
chlo
rour
us
Cheilinus
trilobatus
La
Symphodus ocellatus
Symphodus mediterraneus
Th
br
An
am oide
s di
pse
HLab
mid
alic ropsis aus
s ca
iatu
tralis
ho
eru
s
ere
leo
sm
pu
nct
arg
atu
ina
s
tus
Symphodus cinereus
Symphodus tinca
s
ulu
ib
Ep
s
rcueus
ocean
ditnerr
s meela
hodu s m
Symp odu
ph
Sym
nus fasciatus
Hemigym
rus
apte
mel
julis
ris
s
Co
tu
ea
ilin
un
ys
th
ch
bri
La
Symphodus roissali
s
s
nu
icu
ula
ph
ort
gra
eo
sh
sg
ere
pse
ho
am
lic
An
Ha
fus
s ru
ianu
Bod
nus
igym
Hem
unilineatus
Labrichthys
Th TH
Cor
ala haem
is ju
ss lasig
lis
om soym
nu
a b ma s fa
ifa lute sciatu
s
Hemigymnusscmelapterus
iatuscen
m s
Pic
tilabr
us
are
la
maticlun
lavi
sso
us
Thala
Cle
ptic
tetricus
Notolabrus
us
pa
rra
e
alasso
ma lun
Tha
are
lass
Pic
om
a lu
tila
tesc
bru
ens
s la
tic
lav
ius
Phylogenetic trees
Thalassoma lutescens
Pictilabrus laticlavius
Notolabrus tetricus
Bodianus rufus
Clepticus parrae
Clepticus parrae
Pagrus major
Pagrus major
4
• Cladogram
• No branch lengths
• Clades
• Phylogram
• Branch lengths
Ultrametric tree
Additive tree
5
Leafs = terminal taxa
Clade
Terminal branches
A
B
C D
E
F
G
H
I
J
Polytomy
Internal branches
Node
Root
6
• Speciation
7
Hypothesis
A
B
C
8
Rooting
• Gives the branching order
• Use of an outgroup
• Rest = ingroup
Rooted tree
outgroup
Non rooted tree
Add an outgroup
9
• Outgroup: sister taxa from ingroup
• Shared characters between outgroup and ingroup
= ancestral characters
• Sometimes no outgroup: rooting at equal distance
from tree tips (need branch lengths) = midpoint
rooting
B
A
C
D
F
E
B
C
E
A
D
F
10
• Groups
• Monophyletic (clade): natural
group
• Mammals
• Paraphyletic
• Reptiles
• Polyphyletic
• Algae, protozoans
11
Characters
• Organisms are composed of different features
• These features are different among taxa: Character
states
• All character states form a character
• These states are produced by heritable changes
• Phylogenetic inference is performed from
differences between character states
12
• We want to establish the ancestor-descendant link
from the presence/absence of character states
• We look for the appearance of new character
states in descendants
• The different character states are homologies
• Taxa sharing this new character state (derived) form
clades
• Example: hair in mammals
• Characters can be differentially weighted
13
• Homology
14
15
• Homoplasy
16
• Ancestral characters: plesiomorphies
• Shared ancestral characters: symplesiomorphies
• Derived characters: apomorphies
• Shared derived characters: synapomorphies
• Ideally, identify clades
• Non shared derived characters = particular to a
given taxon: autapomorphies
17
18
Homology
• Homologies are supposed to show similarities in:
• position
• structure
• development
• A recognized criterion to support homology is the
congruence with other characters
19
Dog
Lizard
Frog
Human
Change
HAIR
Absents
Presents
20
Homoplasy
• Non homologous similarities
• Results from independent evolution
• Convergence
• Parallelism
• Reversion
• Blurs phylogenetic signal: may lead to false
evolutionary relationships
21
Parallelism
Convergence
Reversion
22
Lizard
Human
TAIL
Frog
Dog
Human
Dog
Absent
Present
TAIL
Frog
Lizard
Absent
Present
23
• Without homoplasy, phylogenetic inference
would be easy
• Main problem of phylogenetic recontruction:
discriminate homoplasy (noise) from
homology (signal)
• Data quality (“good” phylogenetic signal) is
more important than method used
24
• If there is only one correct tree, when characters
support different trees, at least one contains
homoplasies
Dog
Lizard
HAIR
Absent
Present
Frog
Human
Human
Dog
TAIL
Frog
Lizard
Absent
Present
25
Congruence
• The chosen tree is the tree maximising the number
of congruent characters
MAMMALS
Dog
HAIR
MILK
...
Human
Lizard
Frog
Changes
26
Case of molecular data
• Homoplasy is more common with molecular than
morphological data
• Few states (4 for DNA: A G C T)
• Chemically close
• Evolutionary rates can be high
• No identification of homoplasy via structure or
development
27
Data
• Fossils: rare
• Morphological characters
• Molecular character: DNA, proteins, ...
• By far the most used now: models, numerous
characters, less subjective, ...
• But... phylogeny of the DNA fragment (≠ taxa)
• Future: genomes ➙ phylogenomics
• Others (behaviour, hosts, habitat, ...)
28
Morphological data
• Homology uneasy to identify
• Characters often not numerous: problem when
studying many taxa, especially if they are closely
related
• Some subjective decisions
• Evolutionary processes poorly known: limit method
choice
• Require coding
• Sometimes difficult
• Hypotheses on character evolution
29
Coding
• Binary: Presence/absence = 0/1
• Multiple states (ordered or not): definition of
step numbers between states
• Additive binary coding: e.g. 00, 01, 10, 11
• Linear coding: e.g. 0, 1, 2
• Both can be combined
30
31
Molecular data
• Nucleotides ou amino acids (for ancient divergences)
• Characters = base (or AA) positions
• Character states = bases (ou AA) identity
• Important step: alignment
• Sometimes manual
• Automated methods: manual editing required
• No test: no null hypothesis
• Can use information on secondary structure or
coding nature
32
• Nucleotides: only 4 states (in 2 types)
• Evolution can be modelled
• Homoplasy “easy”
33
• Amino acids
• 20 states
• 5 categories
• Evolution much
more difficult to
model
• Codons
• 61 states!
34
• Gene tree ≠ species tree
• Genes: orthologous or paralogous
Paralogs
Orthologs
b* c
a
Orthologs
C* B
A*
b* C*
A*
Duplication
Tree
Ancestral gene
35
Alignment
<---------------(--------------------HELIX 19---------------------)
<---------------(22222222-000000-111111-00000-111111-0000-22222222
Thermus ruber
UCCGAUGC-UAAAGA-CCGAAG=CUCAA=CUUCGG=GGGU=GCGUUGGA
Th. thermophilus UCCCAUGU-GAAAGA-CCACGG=CUCAA=CCGUGG=GGGA=GCGUGGGA
E.coli
UCAGAUGU-GAAAUC-CCCGGG=CUCAA=CCUGGG=AACU=GCAUCUGA
Ancyst.nidulans UCUGUUGU-CAAAGC-GUGGGG=CUCAA=CCUCAU=ACAG=GCAAUGGA
B.subtilis
UCUGAUGU-GAAAGC-CCCCGG=CUCAA=CCGGGG=AGGG=UCAUUGGA
Chl.aurantiacus UCGGCGCU-GAAAGC-GCCCCG=CUUAA=CGGGGC=GAGG=CGCGCCGA
match
**
***
* ** ** *
**
• Hypothesis of positional homology between nucleotides
or AA
• Methods
• Manual (Seaview, BioEdit, Se-Al, ...)
• Automated (ClustalX, MAFFT, POY, MUSCLE, TCoffee, ...)
• Combination (what we do)
36
• Alignment easy or not
• Coding sequence or not
• Use AA (codons) for alignment
• Consider AA types (size, polarity,
hydrophobicity)
• Sequences may be more or less divergent
• Homology can be variable within regions
• Alignment preformed by adding insertion-deletion
events (indels) via gaps: limited by penalties (unless
at sequence ends)
37
• Goal of automated alignment: maximise
alignment score
• Example
Dot Plot
GATTC
GAATTC
We define:
Match
= +1
Mismatch = 0
Indel
= -1
38
1
1
1
GA-TTC
GAATTC
Score = 4
-1
1
1
0
1
1
1
0
-1
GATTCGAATTC
Score = 2
2 optimal alignments
1
-1
G-ATTC
GAATTC
Score = 4
1
1
1
1
39
• Need to define a gap opening penalty and a gap
extension penalty, generally lower (favour extension
on holes everywhere in the alignment)
• GOP and GEP may vary along sequences, because
of gap presence and biochemical features (e.g.
hydrophil AA)
• Substitutions can be differentially weighted (some
easier than others; e.g. for AA: BLOSUM 62
matrix)
40
• Analytically complex problem: the “best” alignment
cannot be guaranteed when sequences number rises
(multiple alignment)
• Progressive alignment (e.g. Clustal)
• Estimation of a guide tree (NJ) from pairwise
alignment
• Closest sequences first aligned and so on
• Fast but no optimality criterion
41
• Global or local alignment
• Global: consider whole sequence length. Good if
few divergence and similar size
• Local: by region. Better if variable regions
• Hybrid (semiglobal or glocal)
42
• Informative regions can be automatically
selected after the alignment, by removing
badly aligned parts
• GBlocks
• Several options are available to modify
the stringency of selection
43
Saturation
• Multiple hits
• Multiple substitutions at the same site
• At fast evolving sites
Seq 1
Seq 2
AGCGAG
GCGGAC
1
Seq 1
C
Seq 2 C
3
2
G
T
1
A
A
44
• rRNA small subunit
45
• 3 observable changes
• 12 actual changes
46
• Detection
• Plot transitions (Ti) vs transversions (Tv)
• Plot % differences between sequences vs time (if
available)
• Plot uncorrected vs corrected distances
47
Saturation
No saturation
(Jukes-Cantor)
48
• Correction
• Use evolutionary model to correct divergence
between sequences
• Remove fast evolving sites (e.g. third codon
position)
• Use different weights for Ti and Tv
• Use only Tv
• Use more slowly evolving sequences
49
Bias
• Long branch attraction
• If the method assumes that all sites change
at the same rate
A
B
p
A
D
p
q
q
q
C
D
True tree
C
B
Inferred tree
50
• Codon usage bias: some codons more used for the
same AA
51
• Base compositional differences in lineages (LogDet,
heterogenous ML)
• Example: % GC in thermophilic bacteria
Aquifex
Thermus
Bacillus
Deinococcus
True tree
Aquifex (73%)
Thermus
(72%)
Bacillus (50%)
Deinococcus
(52% G+C)
Inferred tree
52
Optimality criteria
• To choose the “best tree”
• Hypothesis on how evolution works
• Different in different methods
• Number of steps
• Sum of branch lengths
• Likelihood
53
Several methods
The best??
• Parsimony
• Distance
• Maximum likelihood
• Bayesian inference
• If there is an optimality criterion, topologies
must be compared to find the best
54
Topologies: number
• Number of unrooted trees (for n taxa)
i= n
∏ (2i-5) = (2n-5)(2n-7)...(3)(1)
i= 3
• Number of rooted trees
i= t
∏ (2i-3) = (2n-3)(2n-5)...(3)(1)
i= 2
• Examples
• 5 taxa: 105 rooted trees
• 8 taxa: 135 135
• 10 taxa: 34 459 425
• 50 taxa: 3 1074 (> atomes in the universe!!)
55
• Algorithms to explore the treespace
• Exhaustive search if few taxa (10-12 for parsimony):
examines all topologies
• Branch and Bound: partly explores treespace (about
20 taxons in parsimony), efficient
• Heuristic search, less efficient, faster: finds a “good”
tree via a driven agglomeration procedure and
rearranges it to find a better tree
56
Treespace
Suboptimal island of
trees
Global optimum
Starting trees
“Treespace”
57
• Rearrangments:
• NNI = Nearest Neighbour Interchange
• Faster but less rigourous than other techniques
58
• SPR = Subtree Pruning Regrafting
59
• TBR = Tree Bisection Reconnection
• More rigorous but slower
• Launch several independent searches with
exhaustive algorithm
60
Parsimony
61
Cladistics
• Two lineages are more closely related to
each other than to another if they share a
more recent common ancestor
• Phylogenetic hypotheses = hypothesis of a
common ancestor
• Associated to reconstruction via parsimony
• MP = Maximum Parsimony
62
Parsimony
• “Ockham’s razor”
Pluralites non est ponenda sine necessitate
• Favour simplest solution
• Choose between competing phylogenetic
hypotheses
• Maximize congruences and minimize homoplasies
• Assess character fit to trees
• Method based on individual characters
63
Character fit
• Minimum number of steps (from one state to
another) required to explain the observed
distribution of character states
• This is determined by character optimisation (mapping)
via parsimony
• Optimisation is different on different trees
• Changes may be non unique for a single tree with a
given number of steps: branch length may not be
defined
64
1 step
Hair
Bird
Bat
Human
Crocodile
Kangaroo
Frog
Human
Bat
Kangaroo
Bird
Frog
Crocodile
Example
2 steps
Absent
Présent
65
Parsimony analysis
• For a set of characters, determine the fit (number of
steps) of each character to the tree
• The sum for all characters (X putative weighting) is
called tree length
• The most parsimonious trees (MPT) are those with the
smallest length
• Informative character: at least 2 states in 2 taxa
• Optimality criterion (= objective function): number
of steps = tree length
66
• Several MPT may be obtained
• Several trees: consensus
• Trees give hypotheses on character evolution
• Branch lengths: number of changes. Generally
underestimated. Not the objective in MP
• Several indices to assess fit between tree and data
(Consistency Index, Retention Index, ...)
67
Consensus
• Strict
• Semi-strict
• Majority-rule
68
Character types
• Different costs for state change
• Wagner (ordered, additive): morphology
0→1→2
• Fitch (non ordered, non additive, equal costs):
DNA, AA, morphology
A ⎯ G
T ⎯ C
69
• Sankoff (generalized)
A ⎯ G 1 step
T ⎯ C 5 steps
• Typical example: different weights for transitions
and transversions
• Symmetrical or asymmetrical costs
70
Transversions (Tv)
Py
Pu
Stepmatrices
to
Purines (Pu)
G
A
C
T
Pyrimidines (Py)
A
C
G
T
A
0
5
1
5
from C
5
0
5
1
G
1
5
0
5
T
5
1
5
0
Transitions (Ti)
Py
Py
Pu
Pu Transitions easiests
Transversions more numerous
71
Generalized parsimony
• = Weighted parsimony
• Different costs for different changes
• Minimize costs sum = global cost
72
• Problem: define costs
• Knowledge on molecular evolution is used to
define costs
• Transitions/transversions (Ti/Tv, numbers or
rate)
• Substitution rate heterogeneity, e.g. for
different codon position
73
Algorithms
1. Calculate topologies
2. Optimize all characters and calculate length
• Long if many taxa
• Algorithms
• Exhaustive search if few taxa (about 10): examines all
topologies
• Branch and Bound: partly explores treespace, for
about 20 taxons, efficient
• Heuristic search, less efficient, faster: finds “good”
trees and rearranges them
74
Parsimony - Advantages
• Simple
• No explicit evolutionary model
• Tree and character evolution
• Good if homoplasy rare
• Good for morphological characters
75
Parsimony - Drawbacks
• Problem if many homoplasy, or concentrated in
some regions
• Long branch attraction (Felsenstein Zone)
• Underestimates branch lengths
• Implicit evolutionary model: behaviour may not be
clear
• More justified on philosophical than numerical
bases
76
Maximum likelihood
77
• Maximum Likelihood = ML
• Method based on individual characters
• Uses an explicit evolutionary model
• MP sometimes considered as a special case of ML
• The more computationally complex method
• Model very important: only for molecular data
78
Principle
• Answers the question:
What is the probability to observe the data given the
evolutionary model(process and tree)?
• Pr(D|T)
• Estimation of parameter model values to maximize
this probability: likelihood
• Of course, we look for the tree (topology and
length)
• Compute likelihood for all topologies: heuristic
algorithm
79
Nucleotides
Given
A
Probability of
A : AACG
B : ACCG
C : AACA
D : AATG
D
C
B
A
⎧
A ⎪a
⎪b
P = C ⎪⎨ c
G
⎪
T ⎩d
C G T
b
c
a e
e a
c f
d⎫
⎪
f ⎪
⎬
g ⎪
⎪
a⎭
π = [A, C, G, T]
80
Parameters
•
π = [A, C, G, T]
• Sum = 1
• Substitution rates: P matrix
• Row sum = 1
• Function of bases and time (branch lengths)
• Heterogeneity: Γ
• Tree
• Topology
• Branch lengths
Bases frequencies: π
A C G T
P=
A
C
G
T
⎧a b
⎪
⎪b a
⎨
⎪ c e
⎪
⎩d c
d⎫
⎪
f ⎪
⎬
g ⎪
⎪
a⎭
c
e
a
f
A
D
C
B
81
Substitution rate heterogeneity
€
Parameter: α
- high: rate = 1 at all sites
- small (0.5): few changes
for most sites
- 0: all rates different
In practice, a discrete
distribution with 4 classes
gives good results
82
• The probability to observe a given sequence is the
product of frequencies (composition) by
substitution rates (considering branch lengths)
Example
⎧0.976 0.01 0.007 0.007 ⎫
⎪
⎪
⎪0.002 0.983 0.005 0.01 ⎪
⎬
P = ⎨
⎪ 0.003 0.01 0.979 0.007⎪
⎪
⎪
⎩ 0.002 0.013 0.005 0.979⎭
(for a given branch length b)
CCAT
CCGT
b
π = [0.1, 0.4, 0.2, 0.3]
Likelihood = πCPC→CπCPC→CπAPA→GπTPT→T
= 0.4X0.983X0.4X0.983X0.1X0.007X0.3X0.979
= 0.00003
83
• The likelihood L changes with branch length
0.0002
0.00018
0.00016
0.00014
L
0.00012
0.0001
0.00008
0.00006
0.00004
0.00002
0
0
0.1
0.2
0.3
0.4
0.5
0.6
Branch length b
ML for a branch length of 0.330614
84
• Very small number: compute log(L)
• Additivity: log(AT) = log(A) + log(T)
• Negative number (0<L<1)
• Do the same thing for the whole tree
• for all topologies and branch lengths
• for all sequences of a fixed length, and sequences
at the nodes (ancestral sequences)
• while estimating the best parameters
• Very long...
85
• ...and: changes do not happen the same way at
the same places
• Constraints in structure
• Codon position
• Active site
• etc...
• And substitution rate varies with time for a fixed
position: heterotachy
86
• We can add a proportion of invariant sites
(estimation via ML is possible, another parameter)
• Compute α for variables sites and/or different
models for different position
• Codon position
• Alpha-helix
87
Basic models for DNA
Jukes-Cantor (JC)
πA= πC = πG = πT
Kimura 2 parameters (K2P)
πA= πC = πG = πT
α=β
Felsenstein 81 (F81)
πA≠ πC ≠ πG ≠ πT
α≠β
α=β
Kimura 3 parameters (K3P)
πA= πC = πG = πT
Hasegawa-Kishino
-Yano 85 (HKY 85)
πA≠ πC ≠ πG ≠ πT
α ≠ β1 ≠ β2
α≠β
Symmetric (SYM)
πA= πC = πG = πT
Tamura-Nei (TrN)
πA≠ πC ≠ πG ≠ πT
6 different rates
α: transitions
β: transversions
General Time Reversible (GTR)
πA≠ πC ≠ πG ≠ πT
6 different rates
α ≠ β1 ≠ β2
88
Coding sequences
• Different constraints on different codon positions
• Partition sequence according to codon position
and assign different model/parameters. Various
options
• SRD06 (Shapiro et al. 2006)
• Link positions 1 and 2
• Position 3 can have different rate, Ti/Tv, Γ
• Use codon model
89
• Use information in genetic code: codon model
• Computationally intensive
• GY94 (Goldman & Yang 1994, Muse & Gaunt
1994) (MrBayes, HyPhy, PAML)
• New parameter ω = ratio nonsynonymous/
synonymous substitutions
90
Proteins (amino acids)
• Model: probability of change of an AA to another
(PhyML, PhyloWin, Puzzle, Phylip)
• 20 AA: many more possibilities than nucleotides,
estimation is difficult
• Many empirical models (Dayhoff, JTT, WAG,
Blosum, ...), from sequence pairs or tree-based
comparisons on big datasets
• Some models based on codons (REV)
• Take into account AA characteristics
91
Model choice
• The more a model has parameters
• The more it fits the data
• The more computing time is high
• The more estimation is uncertain (= variance
increases = degrees of freedom decrease)
92
• Need of a compromise
• Eventually, choosing a more complex model does
not significantly increase the likelihood
• Solution: hLRT or AIC (Modeltest, MrModelTest,
ProtTest)
• hLRT (hierarchical likelihood ratio test): compares
models (must be nested)
• AIC (Akaike information criterion): estimates model
fit to data
• AIC = 2k - 2logL, where k is the number of free
parameters
• Choose model with lower AIC
93
• Very long to estimate parameters while
estimating topology
• If the tree is roughly correct, parameter
estimation is stable
• Parameter estimation from a fixed tree (rapidly
constructed via e.g. MP, NJ)
• Use these parameters to estimate topology
94
Likelihood Ratio Test
• To test many hypotheses
• Comparison of two nested hypotheses: one (H0) is a
special case of the other (H1)
• Statistic Δ = logL1 - logL0
• If no difference, 2Δ follows a Χ2 distribution with
degrees of freedom equal to the difference in
parameters between the two hypotheses
• Comparison of models, topologies (KH- and SHtests), lengths (molecular clock), ...
95
ML - Advantages
• Considers saturation
• Reliable branch lengths
• Consistent: with a good model, converges toward the
right tree with increasing number of data
• With good model, not affected by LBA
• Uses all the data (no “informative sites”)
• Evolutionary process and ancestral sequences
• Quite robust
96
ML - Drawbacks
• Inconsistent with a wrong model
• Even the more complex model simplifies reality
• Still computationally intensive: needs heuristics
then compromise
97
Bayesian inference
• Recent and now widely used method (MrBayes,
PhyloBayes, BayesPhylogenies)
• Same models as ML (MrModelTest)
• Gives posterior probability of parameters (among
which topology and branch lengths), based on
previous knowledge on data: prior probability
(controversed)
98
• What is the probability of the model/theory/tree
given the data?
• Pr(T|D) = (Pr(T)Pr(D|T))/Pr(D)
posterior
prior
likelihood
probability of the data
99
• Bayes formula combines prior probability
and likelihood to yield posterior
probability: prior chosen as non
informative (e.g. flat), then posterior
probability (pp) mainly depends on
likelihood
100
• BI does not search “the” best tree (and parameters),
but explores treespace with a Markov chain Monte
Carlo (MCMC), and samples trees when a plateau
is reached (e.g. high probability trees): confidence
intervals, assess clade support (pp)
• No validation step needed: a high number of trees is
produced, a consensus based on a sample of trees
gives the probabilities of clade support (if the model
is good): faster than ML
• Problem: running chains long enough. Several
chains used to better explore treespace
(MCMCMC = Metropolis coupled MCMC) and to
avoid getting stuck on hills
101
102
Traditional approach
(ML, MP)
Bayesian inference
Tend to choose trees with
higher pp
Long!
MCMC
After a delay: sample
trees with high pp
103
• Bayesian analyses
estimate marginal
(Tree B) rather
than joint (Tree A)
probability
• ML selects Tree
A (highest peak)
• BI chooses Tree
B (most
voluminous
peak)
104
• Example: MrBayes output
Rough plot of parameter LnL
+------------------------------------------------------------+ -47216.46
|
*******************************************************|
|
*
|
|
*
|
| *
|
|
|
| *
|
|
|
|
|
|
|
|*
|
|
|
|
|
|
|
|
|
|
|
+------+-----+-----+-----+-----+-----+-----+-----+-----+-----+ -72924.41
^
^
1
100000
• 100000 iterations (generations)
• Sample a tree/100 generations: 1000 trees
• Burnin: discard first 200 trees (keep only trees from
plateau) and make consensus
105
• Many applications: ancestral character states,
divergence time estimates... not only for
phylogenetic reconstruction
• Clades pp tend to be higher than bootstrap
values (e.g. from ML): precision overestimated?
• Maybe not, pp should not be interpreted as
bootstrap proportions
106
Distances
107
• Assessement of the mean number of changes
between two taxa
• Based on distances, not individual characters
• Data sometimes only as distances (DNA/DNA
hybridation, serology, morphometry, ...)
• If not, data transformation in distance matrix
• Mainly for molecular data
108
• The percentage of differences between sequences (p-
distance, Hammig distance) generally underestimates
the true distance because of saturation
• Especially true if sequences are distantly related
• Use a model to correct distances: parameters reflect
the way we think molecular evolution works (same
models as ML: JC, K2P, GTR, ...)
• These models can consider substitution rate
heterogeneity (Γ)
• LogDet distance allows for different base frequencies
in sequences
109
• Coding DNA: synonymous (or silent) substitutions
(do not change AA) or not
• Evolutionary rate higher for synonymous
• Ka = non synonymous distance = non synonymous
substitutions/non synonymous sites
• Ks = synonymous distance = synonymous
substitutions/synonymous sites
• These distances consider Ti and Tv (K2P)
• Close sequences: only Ks informative
• Distant sequences: Ks is saturated, Ka informative
110
Algorithms
• Main: Neighbor-Joining (NJ)
• Additive trees
• Derived methods: BioNJ, weighbor...
• Sometimes (not anymore): UPGMA
• Ultrametric trees (molecular clock)
111
• NJ: starts with a star tree and forms pairs which
minimise tree length (sum of branch lengths)
8
8
7
1
1
7
6
2
6
4
3
5
5
2
• Tends to generate the shortest tree, but no
4
3
optimisation during the agglomeration procedure
(which is very fast)
112
Parameters
• Model must fit to data, we must find the right
parameters
• Number of invariant sites
• Heterogenous substitution rate along sequence
alignment
• Substitution rates different for different types of
change
113
• Starting distances ≠ patristic distances (computed
from the tree)
• Else it would be easy because the tree is additive
A
C
0.1
0.2
0.3
0.1
0.6
B
A
B
C
D
A
B
0.4
0.4 0.4 0.6
0.8 1.0
C
0.4
0.6
0.8
D
0.8
1.0
0.8
-
D
114
• Different in the real life
• Stochastic errors even with a perfect model
• Model never perfect (evolutionary model and
algorithm)
• Need a criterion to assess the fit of original data to
the tree (topology and branch lengths)
• Fitch-Margoliash: least-squares
• Minimum evolution (ME): minimize tree length
• The algorithm itself does not guarantee to reach the
criterion, even if NJ is a good approximation: better
to add an optimization step
115
Distances - Advantages
• Fast: the only method if the number of taxa is
very high
• Many models, can be tested via ML
• LogDet very useful when base composition
varies, but does not consider substitution rate
heterogeneity (remove invariant sites)
116
Distances - Drawbacks
• Information loss: impossible to reverse to sequences
from distances
• No scenarios on character evolution
• Generally less performant than ML (simulations)
• Poor for old divergences
117
Validation
118
• Any data yield a tree, even without phylogenetic
signal
• No way to test if this is “the” right tree: no
interesting null hypothesis
• But we can assess the confidence we have in a tree
• Many methods based on randomisation
(destruction or alteration of phylogenetic signal)
• Most methods are independent of the tree
reconstruction method
119
Bootstrap (non parametric)
• Resampling technique
• Create new datasets (100, 1000,...) from the original:
random character selection (columns) with
replacement (without: jacknife)
• Noise in the phylogenetic structure = estimation of
sampling variance
• Build a tree from each dataset
• Compute majority-rule consensus of all trees
• Percentage of clade occurence = support
120
• Widely used
• Supposes character independence
• Supposes they are identically distributed
• Not a statistical test
• Often too conservative
• Requires many characters: usually not good for
morphology
121
Parametric bootstraping
• Select a model from the data (ModelTest)
• Estimate topology
• Use model and topology to generate data via
simulation (SeqGen)
• Analyse variation of simulated datasets: topology,
confidence interval (datation, ...), topology
comparison tests (SOWH, ...)
122
Permutation Tail
Probability
• Statistical test. H0: no phylogenetic structure
• Measure a statistic on tree (e.g. length)
• Destroy original data structure via random
permutations (randomisation)
• Generate a distribution of the statistic under H0
• PTP: proportion of data ≥ observed statistic
123
Randomisation
• Keep number of taxa, characters and character steps
‘TAXA’
R-P
A-E
N-R
D-M
O-U
M-T
L-E
Y-D
1
R
A
N
D
O
M
L
Y
2
P
E
R
M
U
T
E
D
3
R
A
N
D
O
M
L
Y
‘CHARACTERS’
4
5
6
P
R
P
E
A
E
R
N
R
M
D
M
U
O
U
T
M
T
E
L
E
D
Y
D
7
R
A
N
D
O
M
L
Y
8
P
E
R
M
U
T
E
D
1
N
R
M
L
D
O
Y
A
2
U
E
R
T
E
M
D
P
3
D
A
M
R
Y
O
N
L
‘CHARACTERS’
4
5
6
E
R
T
P
L
E
M
A
D
E
Y
M
U
D
E
T
O
U
D
M
P
R
N
R
7
O
A
N
D
Y
L
M
R
8
U
D
P
R
M
T
E
E
‘TAXA’
R-P
A-E
N-R
D-M
O-U
M-T
L-E
Y-D
124
Frequency
FAIL
TEST
95% cutoff
PASS
TEST
reject null hypothesis
Measure of data quality (e.g. tree length, ML, pairwise incompatibilities)
GOOD
BAD
125
• Phylogenetic signal
Number of
Number of
Tree length
replicates Tree length
replicates
------------------------- ------------------------1222*
1
1686
8
1669
1
1687
7
1671
1
1688
6
1672
1
1689
8
1673
1
1690
6
1674
1
1691
3
1675
2
1692
2
1676
2
1693
3
1678
1
1694
3
1679
2
1695
3
1680
4
1696
3
1681
5
1697
2
1682
8
1699
2
1683
4
1702
1
1684
4
1704
2
1685
2
1705
1
126
• No signal
Number of
Number of
Tree length
replicates Tree length
replicates
------------------------- ------------------------1924
3
1940
6
1926
1
1941
7
1927
4
1942
4
1928
1
1943
2
1929
2
1944
1
1930
8
1945
1
1931
6
1946
1
1932
5
1947
1
1933
4
1950
3
1934
4
1952
1
1935
5
1953
1
1936
1
1955
1
1937
8
1958
1
1938*
11
1939
7
127
• H0 easily rejected: PTP identifies only very poor
data
• Does not identify were is the structure in the
data
128
Bremer index
• BI = Decay index (TreeRot)
• Only for parsimony
• A strong clade should appear in trees slightly longer
than MPT
• BI = number of steps needed to “break” a clade
• For a tree = sum of BI for each clade
129
• The more a group is supported, the more high is BI
• BI > 0 only for clades occuring in MPT
• BI not standardised (≠ bootstrap): interpretation
may not be simple
• Generally in accordance with bootstrap
130
Data combination
131
• Several datasets (genes, morphology, ...): several
trees
• Important issue because of increasing use of
genomes (many genes!) in phylogenetics
• What should we do if they are not congruent?
• Compare trees or combine them via a consensus
• Combine data (total evidence) and build a new tree
• Conditional combination: before combining, test
data homogeneity and/or difference between trees
132
• Consensus
133
• Combination (total evidence)
134
Partition homogeneity test
• ILD test (Incongruence Length Difference)
• Principle
• For same data, compare tree length (or ML) for
observed and random partitions
• If it is no significantly different, data are
homogeneous: combine
• If significant difference: keep separated trees or
discard taxa generating conflict
135
sp1
sp2
sp3
sp4
sp5
sp6
sp7
sp8
TACATAAACAAGCCTAAAATGCGACACTACGTTCACTGTTACGCTCTCCACTGCCTAGACGAAGAAGCTTCA
TACATAAACAAGCCCAAAATGCGACACTACGTCCACTGTTATGCTCTCCACTGCCTAGACGAAGACGCTTCA
TACATAAACAAGCCCAAAATGCGACACTACGTCCACTGTTACGCTCTTCACTGCCTAGACGAGGATGCCTCG
TACATAAATAAGCCAAAAATGCGACACTACGTTCATTGTTACGCACTCCATTGCCTCGACGAAGAAGCTTCA
TACATAAACAAACCCAAAATGCGACACTACGTCCACTGTTATGCTCTCCACTGTCTAGACGAAGACGCTTCG
TACATAAACAAGCCCAAGATGCGTCACTACGTCCACTGCTACGCCCTCCACTGTCTCGACGAGGAGGCCTCG
TACATAAACAAACCAAAAATGCGACACTACGTCCATTGTTACGCCCTACACTGCCTAGACGAAGACGCTTCA
TACATAAACAAACCAAAAATGCGACACTACGTCCATTGTTACGCCCTACACTGCCTAGACGAAGACGCTTCA
Partition 1: L = 12
sp1
sp2
sp3
sp4
sp5
sp6
sp7
sp8
L = 21
Partition 2: L = 9
TACATAAACAAGCCTAAAATGCGACACTACGTTCACTGTTACGCTCTCCACTGCCTAGACGAAGAAGCTTCA
TACATAAACAAGCCCAAAATGCGACACTACGTCCACTGTTATGCTCTCCACTGCCTAGACGAAGACGCTTCA
TACATAAACAAGCCCAAAATGCGACACTACGTCCACTGTTACGCTCTTCACTGCCTAGACGAGGATGCCTCG
TACATAAATAAGCCAAAAATGCGACACTACGTTCATTGTTACGCACTCCATTGCCTCGACGAAGAAGCTTCA
TACATAAACAAACCCAAAATGCGACACTACGTCCACTGTTATGCTCTCCACTGTCTAGACGAAGACGCTTCG
TACATAAACAAGCCCAAGATGCGTCACTACGTCCACTGCTACGCCCTCCACTGTCTCGACGAGGAGGCCTCG
TACATAAACAAACCAAAAATGCGACACTACGTCCATTGTTACGCCCTACACTGCCTAGACGAAGACGCTTCA
TACATAAACAAACCAAAAATGCGACACTACGTCCATTGTTACGCCCTACACTGCCTAGACGAAGACGCTTCA
Partition 1: L = 14
L = 25
Partition 2: L = 11
136
Sum of
Number of
Sum of
Number of
tree lengths
replicates tree lengths
replicates
------------------------------------------------------------1661
1
1672
10
1662
2
1673
7
1663
1
1674
4
1665*
9
1675
4
1666
8
1676
1
1667
9
1677
4
1668
5
1678
2
1669
11
1679
1
1670
10
1680
1
1671
9
1683
1
* = sum of lengths for original partition
P value = 1 - (87/100) = 0.130000
137
Assessing difference
between trees
• Templeton test
• One of the earliest approach (Templeton 1983)
• Comparison of topologies with different lengths:
is this difference significantly different from 0?
• List characters with different lengths
• Do a Wilcoxon test (signed rank, non
parametric)
138
• Symmetric difference (PAUP)
• Statistic: number of different partitions between
trees (topologies only)
• Assess the observed statistic against a null
distribution generated from random topologies
139
• ML-based tests
• Kishino-Hasegawa test (1989) (PAUP)
• Statistic: difference in lnL (likelihood ratio)
or length (steps) between trees (around 0 if
not significant)
• Trees must be selected a priori (not best ML
tree against suboptimal tree)
• Null distribution from differences between
sites or generated from pseudoreplicates
(bootstrapping) because of non-normality
140
• Test observed difference against null distribution
Sites favouring tree A
Mean
Expected
Sites favouring tree B
0
Distribution of Step/Likelihood differences at each site
141
• Shimodaira-Hasegawa test (1999) (PAUP)
• In most cases, trees selected a posteriori, from
phylogenetic analysis: KH not good
• In such cases, SH test corrects bias in H
rejection by KH test, but same principle
0
• Comparison of multiple topologies
• Approximately Unbiased test (Shimodeira, 2002)
(Consel)
• Like KH and SH tests, it is a winning sites test
• Less conservative than SH test, because of a
better way to generate pseudoreplicates
142
• Swofford Waddell Olsen Hillis (SWOH) test
• Uses parametric bootstraping
• H : topology A (hypothetical) is not different
0
from B (observed, e.g. ML tree from the data)?
• Use a statistic assessing differences between A
and B: likelihood ratio, number of steps, ...
• Compute best model with A and simulate data
on this topology (SeqGen)
• From the simulated dataset, find the likelihood
for topology A and compute ML tree
• Compute Δ (if LRT) for each pair of trees
143
• Do this many times: distribution of the statistic
Δ to assess significancy of the observed value
• If observed Δ > 95 % of the simulated values of
Δ, reject H0
• More power than KH, SH, and AU tests, but
depends on model, which has to be correct
• Bayesian methods: computationally highly
demanding (still quite infeasible)
144
Supertrees
145
• Combine trees with partially overlapping taxa
• Bigger tree
• Several methods (at least 17)
• Indirect: matrix constructed from tree, and
analysis with an optimality criterion (e.g. MRP,
MRD, MRC, MRF)
• Direct: combination of topologies in a consensuslike way (e.g. MinCut, Modified MinCut)
146
147
Matrix Representation
with Parsimony (MRP)
• Most used technique
• Reconstruct a matrix from trees (RadCon, Rainbow)
and analysis via parsimony (PAUP): can be very
long
• Clades coding (nodes), can be weighted (e.g.
bootstrap source trees)
• Classical validation indices can be used
148
149
MinCut
• Direct analysis: no optimality criterion
• Fast
• No supertree validation
• Good with compatible source trees
150
• Uses of supertrees
• Combining trees from different data/studies
• Phylogenomics: genes are often unequally
present in the taxa under study
• Metagenomic: taxa partially and unequally
represented in sequences
➡Many gaps in the matrix:
• Supermatrix (as is)
• Design several complete sub-matrices,
compute subtrees, build supertree
151
9E
uk
22
Alv
26
Alv
13Din
h
3
Sp
39
31
CC
3R
Ant
33
g
35Eu
is
selm
Tetra
l
19Ch
79
88
96
lla
Mantonie
1RCC143
82
85
refPFRRDB
93
81
Ostreococcus
Bathycoccus
87
Haplosp
oridium
7Pla
57
16Cr
u
54
67
56
79
59
Ci
30
ym
34G
14Euk
24Eu
k
28
Cru
25
Cru
21
G
ym
k
Eu
10
ru
8C
Cru
27
ULABN14TF
36
refDSU213
2Hyd
4Hyd
20
Eu
k
32
Co
l
18
Eu
Euk
kO
LI11
261
ULAK
X75T
F
Alveol
ateGII
66
93
81
17Euk
97
52
ystis
Phaeoc
nesio
Prym
F
943T
3
ULAC
02
99
r
77
u
B
BH
11
GB
k
Eu
15
k
Eu
29
...
98
99
5Emb
Emiliania
96
85
refTH
ER
R18S
refBBO
RR18S
C
refPVLRRD
A
Bolidomonas
70
Nannochloropsis
re
e
ch
pty
3
no
02
86
Cya
F2
fA
roides
Nyctothe
ymena
Tetrah
6Sph
ean
thar
ia
Acan
us
Han
Ka
Gy
re
ro
na
din
Ale
ium
xa
nd
refS
riu
YM
m
18
SR
RN
ULAG
91
E01T
F
ULADY7
4TF
ULAE
395T
F
12Eu
k
23
Euk
• e.g. Sargasso Sea environmental sequences
152
• Sometimes too many taxa/data to perform analysis
• Design well-chosen sub-dataset, individually
analysed, combination: divide-and-conquer
• Supermatrix (in addition to the above-mentioned
problem): many missing data increasing
computing time
153
• Possible conflicts between supertree and supermatrix
154
Phylogenomics
155
156
• Genomes: more accurate and precise phylogenies?
Not so simple...
• Very large dataset: computation difficult
• Genomes are plastic: duplications (total, partial),
fusions, chromosome fissions, LGT, ...
• No good model of genomic evolution
• Still difficult: be very careful to control biases
157
• Diminution of stochastic error (random), only by
increasing character number
• The possibility of systematic error still remains, for
example caused by wrong method or model
choice
158
• 3 main biases
• Composition bias: sequences with the same
composition tend to cluster
• Check from sequences
• Long branch attraction
• Good taxon sampling
• Heterotachy: substitution rate change through
time for fixed positions
• Hard to detect and correct
159
Genomes
• More characters
• New character types: gene order, gene content,
nucleotidic signature (DNA strings), rare genomic
changes
• 2 main approaches
• Classical: sequences (gene concatenation) and
phylogeny (supermatrix or supertree)
• Whole genome features: gene order, gene content,
DNA string
• + 1: rare genomic changes
160
Classical
methods
161
• Resolution of difficult phylogenetic problems
(e.g. Tree of Life, Eukaryotes, Bilateria)
• Evolution of gene groups: mutations, selective
pressure
• Identification of lateral gene transfer
162
• Example: Tree of Life (Nature, 2005)
- Purple: identified
by genomic
- Yellow: confirmed
by genomic
163
• Example: Tree of Life (Science, 2006)
164
• Example: Eukaryote phylogeny
165
• Example: Classical picture of Deuterostomians
evolution
166
• Genomic data (Nature, 2006)
- 146 genes
- Classical methods:
sequences
- Bias control
167
Summary
Data
DNA, AA, morphology, ...
Alignment
Software + eye
Characters
Distances
Data quality
Saturation, homogeneity, ...
Distances
Method
Model?
Data type, taxa number
BI ML
Model?
MP
Optimality criteria
Weigthing?
(sites, changes)
Yes
Tree(s)
Validation
Bootstrap, PTP, Bremer, ...
ME...
No
NJ...
168
Softwares
• Plenty!!... often free!
but almost all for molecular data, with various
• ...methods
(MEGA, SeaView, DAMBE, FastDNAml,
PhyML, MrBayes, Phylobayes, Tree-Puzzle, ...).
morphological data (and molecular): Phylip (free
• For
but not simple), PAUP (the best, but not free) which
contains many methods and tests
softwares to read and edit trees (TreeView,
• Numerous
TreeEdit, NJ-Plot, FigTree, TreeDyn...)
consensus (RadCon, PAUP, Component, ...),
• For
supertrees (RadCon, Rainbow, Clann, SuperTree, ...)
169