Here - Tandy Warnow

Transcription

Here - Tandy Warnow
Coalescent-based species tree estimation with
many hundreds of taxa and thousands of genes
Siavash Mirarab12, Tandy Warnow3
1University
of Texas at Austin, 2University of California San Diego,
3University of Illinois at Urbana-Champaign
Gorilla
Human
Chimpanzee
Orangutan
phylogenomics
gene 1
gene 2
gene 999
gene 1000
ACTGCACACCG
ACTGC-CCCCG
AATGC-CCCCG
-CTGCACACGG
CTGAGCATCG
CTGAGC-TCG
ATGAGC-TC-
CTGA-CAC-G
AGCAGCATCGTG
AGCAGC-TCGTG
AGCAGC-TC-TG
C-TA-CACGGTG
CAGGCACGCACGAA
AGC-CACGC-CATA
ATGGCACGC-C-TA
AGCTAC-CACGGAT
Orangutan
Chimpanzee
Gorilla
Human
“gene” here refers to a portion of
refer
“c-genes”:
theto
genome
(not a functional gene)
I’ll use the term “gene” to
recombination-free orthologous stretches of the genome
2
Gene tree discordance
gene 1
Gorilla Human Chimp Orang.
gene1000
Gorilla Chimp Human Orang.
3
Gene tree discordance
The species tree
Gorilla Human Chimp Orangutan
gene 1
Gorilla Human Chimp Orang.
gene1000
Gorilla Chimp Human Orang.
3
A gene tree
Gene tree discordance
The species tree
Gorilla Human Chimp Orangutan
gene 1
Gorilla Human Chimp Orang.
gene1000
Gorilla Chimp Human Orang.
Causes of gene tree discordance include:
•
Incomplete Lineage Sorting (ILS)
•
Duplication and loss
•
Horizontal Gene Transfer (HGT)
3
A gene tree
Incomplete Lineage Sorting (ILS)
•
A random process related to having
multiple versions of each gene in a
population
4
Tracing alleles
through generations
Incomplete Lineage Sorting (ILS)
•
A random process related to having
multiple versions of each gene in a
population
4
Tracing alleles
through generations
Incomplete Lineage Sorting (ILS)
•
•
A random process related to having
multiple versions of each gene in a
population
Omnipresent; most likely for short
branches or large population sizes
4
Tracing alleles
through generations
Incomplete Lineage Sorting (ILS)
•
A random process related to having
multiple versions of each gene in a
population
•
Omnipresent; most likely for short
branches or large population sizes
•
We have statistical models of ILS (multi-species coalescent)
•
Tracing alleles
through generations
The species tree defines the
probability distribution on gene trees,
and is identifiable from the
distribution on gene trees [Degnan and Salter, Int. J. Org. Evolution, 2005]
4
Traditional
approach:
concatenation
Step 2: Species tree reconstruction
Approach 1: Concatenation
Orangutan
Chimpanzee
supermatrix
gene 1
gene 2
ACTGCACACCG
CTGAGCATCG
ACTGC-CCCCG
CTGAGC-TCG
AATGC-CCCCG
ATGAGC-TC-
-CTGCACACGGCTGA-CAC-G
Phylogeny
inference
gene 1000
CAGAGCACGCACGAA
AGCA-CACGC-CATA
ATGAGCACGC-C-TA
AGC-TAC-CACGGAT
Gorilla
Human
Approach 2: Summary methods
gene 2
gene 1
Gene tree estimation
Chimp
Gorilla
Human
Orang.
Gorilla
Chimp
Orangutan
Summary
method
Human
Orang.
Gorilla
gene 1000
Chimpanzee
Orang.
Chimp
Human
Gorilla
5
Human
Traditional
approach:
concatenation
Step 2: Species tree reconstruction
Approach 1: Concatenation
Orangutan
Chimpanzee
supermatrix
gene 1
gene 2
ACTGCACACCG
CTGAGCATCG
ACTGC-CCCCG
CTGAGC-TCG
AATGC-CCCCG
ATGAGC-TC-
-CTGCACACGGCTGA-CAC-G
Phylogeny
inference
gene 1000
CAGAGCACGCACGAA
AGCA-CACGC-CATA
ATGAGCACGC-C-TA
AGC-TAC-CACGGAT
Gorilla
Human
Approach 2: Summary methods
Statistically inconsistent and can even
Chimp
Gorilla
be positively misleading
(proved for
unpartitioned maximum likelihood)
gene 1
•
Gene tree estimation
Gorilla
Orang.
Pop.
Gen.,
2014]
Orangutan
Chimpanzee
Chimp
Mixed accuracy in simulations
Summary
method
[Kubatko and Degnan,
Biology, 2007]
Human Systematic
Orang.
[Mirarab, et al., Systematic Biology, 2014]
gene 1000
•
gene 2
[Roch and Steel,
Human
Theo.
Error
Orang.
Chimp
Human
Gorilla
5
Gorilla
Data
Human
Orangutan
Chimpanzee
supermatrix
gene 1
gene 2
ACTGCACACCG
CTGAGCATCG
ACTGC-CCCCG
CTGAGC-TCG
AATGC-CCCCG
ATGAGC-TC-
-CTGCACACGGCTGA-CAC-G
gene 1000
CAGAGCACGCACGAA
AGCA-CACGC-CATA
ATGAGCACGC-C-TA
AGC-TAC-CACGGAT
Phylogeny
inference
Scalable ILS-based summary methods
Gorilla
Human
Approach 2: Summary methods
Gene tree estimation
GCACACCG
GC-CCCCG
GC-CCCCG
GCACACGG
gene 1
ene 1
AGCATCG
AGC-TCG
AGC-TC-
A-CAC-G
gene 2
ene 2
Chimp
Gorilla
Human
Orang.
Gorilla
Chimp
Orangutan
Summary
method
Human
Orang.
GCACGCACGAA
-CACGC-CATA
GCACGC-C-TA
TAC-CACGGAT
gene 1000
Gorilla
ene 1000
Chimpanzee
Orang.
Chimp
Human
Gorilla
Human
Summary methods can be statistically
Error
consistent
Statistical)Binning)(SB,)WSB))
ASTRAL'I)and)ASTRAL'II))
[Science,62014]6[PLoS6ONE,62015]6
[Bioinformatics,62014,62015])
•
• STAR, STELLS, BUCKy-pop, MP-EST, NJst,
Avian)phylogenomics))
Plant)phylogenomics)(1KP))
ASTRAL
[ECCB
2014],
…
[Science,62014]
5
[PNAS,62014]
6
Data
Orangutan
Chimpanzee
supermatrix
gene 1
gene 2
ACTGCACACCG
CTGAGCATCG
ACTGC-CCCCG
CTGAGC-TCG
AATGC-CCCCG
ATGAGC-TC-
-CTGCACACGGCTGA-CAC-G
gene 1000
CAGAGCACGCACGAA
AGCA-CACGC-CATA
ATGAGCACGC-C-TA
AGC-TAC-CACGGAT
Phylogeny
inference
Scalable ILS-based summary methods
Gorilla
Human
Approach 2: Summary methods
Gene tree estimation
GCACACCG
GC-CCCCG
GC-CCCCG
GCACACGG
gene 1
ene 1
AGCATCG
AGC-TCG
AGC-TC-
A-CAC-G
gene 2
ene 2
Chimp
Gorilla
Human
Orang.
Gorilla
Chimp
Orangutan
Summary
method
Human
Orang.
GCACGCACGAA
-CACGC-CATA
GCACGC-C-TA
TAC-CACGGAT
gene 1000
Gorilla
ene 1000
Chimpanzee
Orang.
Chimp
Human
Gorilla
Human
Summary methods can be statistically
Error
consistent given true gene treesASTRAL'I)and)ASTRAL'II))
Statistical)Binning)(SB,)WSB))
Error-free gene trees
[Science,62014]6[PLoS6ONE,62015]6
[Bioinformatics,62014,62015])
•
• STAR, STELLS, BUCKy-pop, MP-EST, NJst,
Avian)phylogenomics))
Plant)phylogenomics)(1KP))
ASTRAL
[ECCB
2014],
…
[Science,62014]
5
[PNAS,62014]
6
Data
Properties of quartet trees in
presence of ILS
Chimp
Gorilla Gorilla
Chimp Gorilla
Chimp
Human
Orang. Human
Orang. Human
Orang.
Chimp
Gorilla Orang.
Chimp
Orang.
Chimp
Human
Orang. Human
Gorilla
Human
Gorilla
•
For 4 species, the dominant quartet
topology is the species tree [Allman, et al. 2010]
p1 = 30% p2 = 30% p3 = 40%
Orang.
Chimp
Gorilla
Chimp
Human
Gorilla
Human
Orang. Human
Chimp
Gorilla
Orang.
Dominant
7
Properties of quartet trees in
presence of ILS
Chimp
Gorilla Gorilla
Chimp Gorilla
Chimp
Human
Orang. Human
Orang. Human
Orang.
Chimp
Gorilla Orang.
Chimp
Chimp
Human
Orang. Human
Gorilla
Orang.
Human
Gorilla
•
For 4 species, the dominant quartet
topology is the species tree [Allman, et al. 2010]
•
For >4 species, the dominant topology
may be different from the species tree
[Degnan and Rosenberg, 2006]
1. Break up input each gene tree
trees on 4 taxa (quartet trees)
✓ ◆
n
(44n)
p1 = 30% p2 = 30% p3 = 40%
Orang.
Chimp
Gorilla
Chimp
Human
Gorilla
Human
Orang. Human
Chimp
✓ ◆
n
into 4
Gorilla
2. Find all
dominant quartet
topologies
Orang.
3. Combine dominant quartet trees
Dominant
7
Properties of quartet trees in
presence of ILS
Chimp
Gorilla Gorilla
Chimp Gorilla
Chimp
Human
Orang. Human
Orang. Human
Orang.
Chimp
Gorilla Orang.
Chimp
Chimp
Human
Orang. Human
Gorilla
Orang.
Human
•
For 4 species, the dominant quartet
topology is the species tree [Allman, et al. 2010]
•
For >4 species, the dominant topology
may be different from the species tree
[Degnan and Rosenberg, 2006]
Gorilla
1. Break up input each gene tree
trees on 4 taxa (quartet trees)
✓ ◆
n
(44n)
p1 = 30% p2 = 30% p3 = 40%
Orang.
Chimp
Gorilla
Chimp
Human
Gorilla
Human
Orang. Human
Dominant
Chimp
✓ ◆
n
into 4
Gorilla
2. Find all
dominant quartet
topologies
Orang.
3. Combine dominant quartet trees
•
✓ ◆
n
3(44n)
Alternative: weight
quartet topology
by their frequency and find the optimal tree
7
Maximum Quartet Support Species Tree
[Mirarab, et al., ECCB, 2014]
•
Optimization Problem (suspected NP-Hard):
Find the species tree with the maximum number of induced
quartet trees shared with the collection of input gene trees
Score(T ) =
a gene tree
•
X
t2T
Set of quartet trees
induced by T
(Q(T ) \ Q(t))
all input gene trees
Theorem: Statistically consistent under the multispecies coalescent model when solved exactly
8
ASTRAL-I
[Mirarab, et al., ECCB, 2014]
•
ASTRAL solves the problem exactly using dynamic programming:
•
Exponential running time (feasible for <18 species)
9
ASTRAL-I
[Mirarab, et al., ECCB, 2014]
•
ASTRAL solves the problem exactly using dynamic programming:
•
•
Exponential running time (feasible for <18 species)
Introduced a constrained version of the problem
•
Draws the set of branches in the species tree from a given set
X = {all bipartitions in all gene trees}
•
Motivation: given large number of gene trees, each species tree branch
appears in at least one gene tree
•
Theorem: the constrained version remains statistically consistent
•
Running time: O(n2 k|X |2 ) for n species and k species
9
•
•
•
1KP: 103 plant species, 400-800 genes
Norman J. Wicketta,b,1,2, Siavash Mirarabc,1, Nam Nguyenc, Tandy Warnowc, Eric Carpenterd, Naim Matascie,f,
Saravanaraj Ayyampalayamg, Michael S. Barkerf, J. Gordon Burleighh, Matthew A. Gitzendannerh,i, Brad R. Ruhfelh,j,k,
Eric Wafulal, Joshua P. Derl, Sean W. Grahamm, Sarah Mathewsn, Michael Melkoniano, Douglas E. Soltish,i,k,
Pamela S. Soltish,i,k, Nicholas W. Milesk, Carl J. Rothfelsp,q, Lisa Pokornyp,r, A. Jonathan Shawp, Lisa DeGironimos,
Dennis W. Stevensons, Barbara Sureko, Juan Carlos Villarrealt, Béatrice Roureu, Hervé Philippeu,v, Claude W. dePamphilisl,
Tao Chenw, Michael K. Deyholosd, Regina S. Baucomx, Toni M. Kutchany, Megan M. Augustiny, Jun Wangz, Yong Zhangv,
Zhijian Tianz, Zhixiang Yanz, Xiaolei Wuz, Xiao Sunz, Gane Ka-Shu Wongd,z,aa,2, and James Leebens-Mackg,2
Yang, et. al. 96 Caryophyllales species,
1122 genes
a
Chicago Botanic Garden, Glencoe, IL 60022; bProgram in Biological Sciences, Northwestern University, Evanston, IL 60208; cDepartment of Computer Science,
University of Texas, Austin, TX 78712; dDepartment of Biological Sciences, University of Alberta, Edmonton, AB, Canada T6G 2E9; eiPlant Collaborative,
Tucson, AZ 85721; fDepartment of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721; gDepartment of Plant Biology, University of
Georgia, Athens, GA 30602; hDepartment of Biology and iGenetics Institute, University of Florida, Gainesville, FL 32611; jDepartment of Biological Sciences,
Eastern Kentucky University, Richmond, KY 40475; kFlorida Museum of Natural History, Gainesville, FL 32611; lDepartment of Biology, Pennsylvania State
University, University Park, PA 16803; mDepartment of Botany and qDepartment of Zoology, University of British Columbia, Vancouver, BC, Canada V6T 1Z4;
n
Arnold Arboretum of Harvard University, Cambridge, MA 02138; oBotanical Institute, Universität zu Köln, Cologne D-50674, Germany; pDepartment of
Biology, Duke University, Durham, NC 27708; rDepartment of Biodiversity and Conservation, Real Jardín Botá nico-Consejo Superior de Investigaciones
Cientificas, 28014 Madrid, Spain; sNew York Botanical Garden, Bronx, NY 10458; tDepartment fur Biologie, Systematische Botanik und Mykologie,
Ludwig-Maximilians-Universitat, 80638 Munich, Germany; uDépartement de Biochimie, Centre Robert-Cedergren, Université de Montréal, Succursale
Centre-Ville, Montreal, QC, Canada H3C 3J7; vCNRS, Station d’ Ecologie Expé rimentale du CNRS, Moulis, 09200, France; wShenzhen Fairy Lake Botanical
Garden, The Chinese Academy of Sciences, Shenzhen, Guangdong 518004, China; xDepartment of Ecology and Evolutionary Biology, University of
Michigan, Ann Arbor, MI 48109; yDonald Danforth Plant Science Center, St. Louis, MO 63132; zBGI-Shenzhen, Bei shan Industrial Zone, Yantian District,
Shenzhen 518083, China; and aaDepartment of Medicine, University of Alberta, Edmonton, AB, Canada T6G 2E1
Dentinger, et. al. 39 mushroom species,
208 genes
Edited by Paul O. Lewis, University of Connecticut, Storrs, CT, and accepted by the Editorial Board September 29, 2014 (received for review December 23, 2013)
Giarla and Esselstyn. 19 Philippine shrew
species, 1112 genes
•
Laumer, et. al. 40 flatworm species, 516
genes
•
Grover, et. al. 8 cotton species, 52 genes
•
Hosner, Braun, and Kimball. 28 quail
species, 11 genes
•
Phylotranscriptomic analysis of the origin and early
diversification of land plants
Reconstructing the origin and evolution of land plants and their
algal relatives is a fundamental problem in plant phylogenetics, and
is essential for understanding how critical adaptations arose, including the embryo, vascular tissue, seeds, and flowers. Despite
advances in molecular systematics, some hypotheses of relationships
remain weakly resolved. Inferring deep phylogenies with bouts of
rapid diversification can be problematic; however, genome-scale
data should significantly increase the number of informative characters for analyses. Recent phylogenomic reconstructions focused on
the major divergences of plants have resulted in promising but inconsistent results. One limitation is sparse taxon sampling, likely
resulting from the difficulty and cost of data generation. To address
this limitation, transcriptome data for 92 streptophyte taxa were
generated and analyzed along with 11 published plant genome
sequences. Phylogenetic reconstructions were conducted using up
to 852 nuclear genes and 1,701,170 aligned sites. Sixty-nine analyses
were performed to test the robustness of phylogenetic inferences to
permutations of the data matrix or to phylogenetic method, including
supermatrix, supertree, and coalescent-based approaches, maximumlikelihood and Bayesian methods, partitioned and unpartitioned analyses, and amino acid versus DNA alignments. Among other
results, we find robust support for a sister-group relationship
between land plants and one group of streptophyte green algae, the Zygnematophyceae. Strong and robust support for a
clade comprising liverworts and mosses is inconsistent with a
widely accepted view of early land plant evolution, and suggests
that phylogenetic hypotheses used to understand the evolution of
fundamental plant traits should be reevaluated.
land plants
| Streptophyta | phylogeny | phylogenomics | transcriptome
T
he origin of embryophytes (land plants) in the Ordovician
period roughly 480 Mya (1–4) marks one of the most important events in the evolution of life on Earth. The early evolution of embryophytes in terrestrial environments was facilitated
by numerous innovations, including parental protection for the
developing embryo, sperm and egg production in multicellular
protective structures, and an alternation of phases (often referred to
as generations) in which a diploid sporophytic life history stage
gives rise to a multicellular haploid gametophytic phase. With
Simmons and Gatesy. 47 angiosperm
species, 310 genes
www.pnas.org/cgi/doi/10.1073/pnas.1323926111
10
Significance
Early branching events in the diversification of land plants and
closely related algal lineages remain fundamental and unresolved questions in plant evolutionary biology. Accurate
reconstructions of these relationships are critical for testing hypotheses of character evolution: for example, the origins of the
embryo, vascular tissue, seeds, and flowers. We investigated
relationships among streptophyte algae and land plants using
the largest set of nuclear genes that has been applied to this
problem to date. Hypothesized relationships were rigorously
tested through a series of analyses to assess systematic errors in
phylogenetic inference caused by sampling artifacts and model
misspecification. Results support some generally accepted phylogenetic hypotheses, while rejecting others. This work provides
a new framework for studies of land plant evolution.
Author contributions: N.J.W., S. Mirarab, T.W., S.W.G., M.M., D.E.S., P.S.S., D.W.S., M.K.D.,
J.W., G.K.-S.W., and J.L.-M. designed research; N.J.W., S. Mirarab, N.N., T.W., E.C., N.M., S.A.,
M.S.B., J.G.B., M.A.G., B.R.R., E.W., J.P.D., S.W.G., S. Mathews, M.M., D.E.S., P.S.S., N.W.M.,
C.J.R., L.P., A.J.S., L.D., D.W.S., B.S., J.C.V., B.R., H.P., C.W.d., T.C., M.K.D., M.M.A., J.W., Y.Z.,
Z.T., Z.Y., X.W., X.S., G.K.-S.W., and J.L.-M. performed research; S. Mirarab, N.N., T.W., N.M.,
S.A., M.S.B., J.G.B., M.A.G., E.W., J.P.D., S.W.G., S. Mathews, M.M., D.E.S., P.S.S., N.W.M., C.J.R.,
L.P., A.J.S., L.D., D.W.S., B.S., J.C.V., H.P., C.W.d., T.C., M.K.D., R.S.B., T.M.K., M.M.A., J.W., Y.Z.,
G.K.-S.W., and J.L.-M. contributed new reagents/analytic tools; N.J.W., S. Mirarab, N.N., E.C.,
N.M., S.A., M.S.B., J.G.B., M.A.G., B.R.R., E.W., B.R., H.P., and J.L.-M. analyzed data; N.J.W.,
S. Mirarab, T.W., S.W.G., M.M., D.E.S., D.W.S., H.P., G.K.-S.W., and J.L.-M. wrote the paper;
and N.M. archived data.
The authors declare no conflict of interest.
This article is a PNAS Direct Submission. P.O.L. is a guest editor invited by the
Editorial Board.
Freely available online through the PNAS open access option.
Data deposition: The sequences reported in this paper have been deposited in the
iplant datastore database, mirrors.iplantcollaborative.org/onekp_pilot, and the National Center for Biotechnology Information Sequence Read Archive, www.ncbi.nlm.
nih.gov/sra [accession no. PRJEB4921 (ERP004258)].
1
N.J.W. and S. Mirarab contributed equally to this work.
2
To whom correspondence may be addressed. Email: nwickett@chicagobotanic.org,
gane@ualberta.ca, or jleebensmack@plantbio.uga.edu.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.
1073/pnas.1323926111/-/DCSupplemental.
PNAS Early Edition | 1 of 10
EVOLUTION
•
PNAS PLUS
ASTRAL-I on biological datasets
Future datasets
•
1200 plants with ~ 400 genes (1KP consortium)
•
250 avian species with 2000 genes (with LSU, UF, and
Smithsonian)
•
200 avian species with whole genomes (with Genome 10K,
international)
•
250 suboscine species (birds) with ~2000 genes (with
LSU and Tulane)
•
140 Insects with 1400 genes (with U. Illinois at UrbanaChampaign)
11
Shortcomings of ASTRAL-I
•
•
•
Even the constrained version was too slow for more
than about 200 species and hundreds of genes
The constraint set X did not include true species
tree branches for some challenging datasets,
resulting in low accuracy in some cases
Input gene trees could not have polytomies
12
ASTRAL-II
1. Faster calculation of the score function inside DP
•
O(nk) instead of O(n2k) for n species and k genes
•
Post-order traversal of input trees instead of set operations
2. Add extra bipartitions to the set X using heuristic approaches
•
Resolving consensus trees by subsampling taxa
•
Using quartet-based distances to find likely branches
3. Ability to take as input gene trees with polytomies
13
Simulation study
•
Variable parameters:
True(model)speciestree
•
Number of species: 10 – 1000
•
Number of genes: 50 – 1000
•
Amount of ILS: low, medium, high
•
Deep versus recent speciation
Truegenetrees
Sequencedata
Finch Falcon Owl Eagle Pigeon
look at all pairs of leaves chosen each from one of the children of
0
u. For each such pair of leaves, there are u2 quartet trees that put
that pair together, where u0 is the number of leaves outside the node
u. ThisFinch
will
each pair of nodes in each of the input k nodes
Owl examine
Falcon Eagle Pigeon
Es�matedspeciestree
Es�matedgenetrees
exactly
once and would therefore
require O(n2 k) computations.
The final score can be normalized by the maximum number of input
quartet trees that include a pair of taxa.
Given the similarity matrix, we calculate an UPGMA tree and
all its bipartitions to the
set X. This
heuristic
adds relatively few
with add
heterogenous
gene
tree
error
bipartitions, but the matrix is used in the next heuristic, which is our
main addition mechanism.
•
11 model conditions (50 replicas each)
•
Compare to NJst, MP-EST, concatenation (CA-ML)
Greedy: We estimate the greedy consensus of the gene trees at
different threshold levels (0, 1/100, 2/100, 5/100, 1/10, 1/4, 1/3).
For each polytomy in each greedy consensus tree, we resolve the
• Evaluate accuracy using FN rate: the percentage
branches
trueimplied by those
polytomy of
in multiple
ways andin
addthe
bipartitions
tree that are missing from the estimated treeresolutions to the set X. First, we resolve the polytomy by applying
the UPGMA algorithm to the similarity matrix, starting from the
clades given by the polytomy. Then, we sample one taxon from
each side of the ploytomy randomly, and use the greedy consensus
of the gene trees restricted to this subsample to find a resolution
14
of the polytomy (we randomly resolve any multifunctions in this
1e−07
6
3
0
ASTRAL-I versus ASTRAL-II
200 10M
50
0.8
1000
50
2M
200
50 500K
1000
200
1000
genes
0.6
0.4
ASTRAL−I
ASTRAL−II
1e−06
Species tree topological
error (FN)
Species tree topological error (RF)
Runni
9
ASTRAL−II + true st
0.2
Low ILS
0.0
0.8
Medium ILS
High ILS
1e−07
Figure S2: 0.6
Comparison of various variants of ASTRAL with 200 taxa and varying tree sha
and number
0.4 of genes. Species tree accuracy (top) and running times (bottom) are shown. ASTRALtrue st shows
0.2 the case where the true species tree is added to the search space; this is included to approxim
an ideal (e.g.
exact) solution to the quartet problem.
0.0
50
200
1000
50
200
1000
50
200
1000
genes
ASTRAL−I
12
10M
ASTRAL−II
ASTRAL−II + true st
2M
5
500K
6
3
200 species, deep ILS
0
12
15
1e−06
time (hours)
9
3
0
0.0
0.8
0.6
1e−07
Running time (hours) Species tree topological
error (FN)
time (hours)
Species
tree topological error (RF)
Running time (hours)
Species tree topologi
6
0.2
1e−07
Runni
9
ASTRAL-I versus ASTRAL-II
0.4
0.2
50
0.8
0.0
1000
50
2M
200
50 500K
1000
200
1000
genes
50
200
0.4
1000
50
ASTRAL−I
Low ILS
0.0
0.8
1000
genes
ASTRAL−II
ASTRAL−I
0.2
200
ASTRAL−II
50
200
1000
ASTRAL−II + true st
1e−06
0.6
200 10M
ASTRAL−II + true st
Medium ILS
High ILS
3
50
200
1000
50
1000
50
genes
Low ILS
0
12
200
Medium ILS
ASTRAL−I
9
ASTRAL−II
200
1000
High ILS
ASTRAL−II + true st
1e−07
6
3
12
0
9
10M
200
0
12
500K
5
1000
50
200
1000
50
genes
200 species,
deep
ILS + true st
ASTRAL−II
ASTRAL−II
ASTRAL−I
15
200
1000
1e−06
50
2M
6
3
1e−07 1e−06
Figure S2: 0.6
Comparison 10M
of various variants of ASTRAL
with 200 taxa500K
and varying tree sha
2M
12 of genes. Species tree accuracy (top) and running times (bottom) are shown. ASTRALand number
0.4
9 the case where the true species tree is added to the search space; this is included to approxim
true st shows
0.2
an ideal (e.g.
6 exact) solution to the quartet problem.
0.0
Species tree topological error (FN)
Tree accuracy when varying the
number of species
16%
ASTRAL−II
MP−EST
12%
8%
4%
10
50
100
200
500
1000
number of species
1000 genes, “medium” levels of recent ILS
16
Species tree topological error (FN)
Tree accuracy when varying the
number of species
16%
ASTRAL−II
MP−EST
12%
8%
4%
10
50
100
200
500
1000
number of species
1000 genes, “medium” levels of recent ILS
16
Species tree topological error (FN)
Tree accuracy when varying the
number of species
16%
12%
ASTRAL−II
ASTRAL−II
NJst
MP−EST
MP−EST
8%
4%
10
50
100
200
500
1000
number of species
1000 genes, “medium” levels of recent ILS
16
Running time when varying the
number of species
ASTRAL−II
NJst
Running time (hours)
MP−EST
20
10
0
10 50 100
200
500
1000
number of species
1000 genes, “medium” levels of recent ILS
17
Tree accuracy when varying
the level of ILS
1000 genes
Species tree topological error (FN)
30%
200 genes
50 genes
ASTRAL−II
NJst
CA−ML
20%
10%
0%
10M
L
2M
500K
10M
2M
500K
10M
2M
500K
M
H
L
M
H
L
M
H
tree length (controls the amount of ILS)
more ILS
more ILS
more ILS
200 species, recent ILS
18
Tree accuracy when varying
the level of ILS
1000 genes
Species tree topological error (FN)
30%
200 genes
50 genes
ASTRAL−II
NJst
CA−ML
20%
10%
0%
10M
L
2M
500K
10M
2M
500K
10M
2M
500K
M
H
L
M
H
L
M
H
tree length (controls the amount of ILS)
more ILS
more ILS
more ILS
200 species, recent ILS
18
Species tree topolo
Species tree topological error (RF)
0.0
0.4
0.2
0.1
0.3
10M
50
200
2M
1000
50
200
500K
1000
50
200
1000
genes
1e−06
Species tree topological
error (FN)
0.0
0.4
Impact of gene tree error
(using true gene trees)
1e−07
0.3
0.2
ASTRAL−II
0.1
Low ILS
0.0
0.4
ASTRAL−II (true gt)
CA−ML
Medium ILS
High ILS
1e−07
Figure0.3S6: Comparison of ASTRAL-II run on estimated and true gene trees with 200 taxa and
varying tree shapes and number of genes.
0.2
0.1
0.0
50
200
1000
50
200
1000
50
200
1000
genes
ASTRAL−II
ASTRAL−II (true gt)
CA−ML
Figure S6: Comparison of ASTRAL-II run on estimated and true gene trees with 200 taxa and
varying tree shapes and number of genes.
19
Species tree topolo
Species tree topological error (RF)
0.0
0.4
0.2
0.1
0.3
10M
50
200
2M
1000
50
200
500K
1000
50
200
1000
genes
1e−06
Species tree topological
error (FN)
0.0
0.4
Impact of gene tree error
(using true gene trees)
1e−07
0.3
0.2
ASTRAL−II
0.1
Low ILS
0.0
0.4
ASTRAL−II (true gt)
CA−ML
Medium ILS
High ILS
1e−07
Figure0.3S6: Comparison of ASTRAL-II run on estimated and true gene trees with 200 taxa and
varying tree shapes and number of genes.
0.2
0.1
0.0
50
200
1000
50
200
1000
50
200
1000
genes
ASTRAL−II
ASTRAL−II (true gt)
CA−ML
When we divide our 50 replicates into low, medium,
or
high
gene
tree
estimation
error,
ASTRAL
tends
to
Figure S6: Comparison of ASTRAL-II run on estimated and true gene trees with 200 taxa and
varying
shapes with
and number
genes.
betree
better
low oferror
•
19
Species tree error (FN)
50
0.4
200
0.3
0.2
Insights on biological data
0.1
0.0
low
•
•
Main question: The placement of Amborella
at the base of angiosperms
Xi et al. (2014) used a collection of 310
genes sampled from 46 species.
medium
92
Astral-II
A
25
88
Conflicting results:
•
44
20
93
Concatenation puts Amborella at the
base (H1)
76
53
•
•
MP-EST puts Amobrella+water lilies at
the base (H2)
96
96
10
Xi et al. conclude ILS is the cause
92
75
•
ASTRAL like many other recent studies
(e.g., 1KP) recovers H1
•
ILS is not necessarily the case
low
medium
Fig. 5. Comparison of species tree accuracy with 200 taxa, divided into three categorie
98
•
high
78
Arabidopsis
Brassica
Carica
Theobroma
Gossypium
Citrus
Manihot
Ricinus
Populus
Malus
Fragaria
Cannabis
Cucumis
Medicago
Glycine
Quercus
Betula
Eucalyptus
Vitis
Striga
Mimulus
Sesamum
Ipomoea
Solanum
Coffea
Helianthus
Lactuca
Panax
Camellia
Silene
Aquilegia
Persea
Liriodendron
Aristolochia
Musa
Phoenix
Sorghum
Oryza
Phalaenopsis
Dioscorea
Nuphar
Amborella
Picea
Pinus
Zamia
Selaginella
61
B MP-EST
70
94
28
5
86
5.5
69
72
96
66
85
56
60
42
Fig. 6. Comparison of species trees computed on the angiosperm dataset of
Xi et al. (2014). MP-EST and ASTRAL-II differ in the placement of Amborella;
the concatenation tree agrees with ASTRAL-II
20
gen
ten
esti
tha
(Su
AST
not
get
sup
sult
ing
bef
(Su
zer
Ho
(P ¼
sup
esti
Summary
•
Genome-scale data provides a wealth of information for
resolving long-standing phylogenetic questions
•
ASTRAL-II improves on ASTRAL-I in terms of both
accuracy and running time
•
ASTRAL-II can handle datasets with 1000 genes from
1000 taxa in a day of single cpu running time
•
ASTRAL dominates other summary methods, However,
Concatenation is better when gene trees have high error
•
In future, we need to further explore, the impact of model
violations, recombination, missing data, and multiple
sources of gene tree discordance (e.g., HGT)
21
Acknowledgments
…
Tandy Warnow
Keshav Pingali
S.M. Bayzid Nam Nguyen (now at UIUC)
Jim Leebens-­‐mack Norman Wickett Gane Wong (UGA)
(U Chicago)
(U of Alberta)
Théo Zimmermann
HMMI international student fellowship Guojie Zhang Tom Gilbert Erich Jarvis Bastien Boussau (BGI, China) (U Copenhagen) (Duke, HMMI) (Université Lyon)
…
Ed Braun (U Florida)