Here - Tandy Warnow
Transcription
Here - Tandy Warnow
Coalescent-based species tree estimation with many hundreds of taxa and thousands of genes Siavash Mirarab12, Tandy Warnow3 1University of Texas at Austin, 2University of California San Diego, 3University of Illinois at Urbana-Champaign Gorilla Human Chimpanzee Orangutan phylogenomics gene 1 gene 2 gene 999 gene 1000 ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT Orangutan Chimpanzee Gorilla Human “gene” here refers to a portion of refer “c-genes”: theto genome (not a functional gene) I’ll use the term “gene” to recombination-free orthologous stretches of the genome 2 Gene tree discordance gene 1 Gorilla Human Chimp Orang. gene1000 Gorilla Chimp Human Orang. 3 Gene tree discordance The species tree Gorilla Human Chimp Orangutan gene 1 Gorilla Human Chimp Orang. gene1000 Gorilla Chimp Human Orang. 3 A gene tree Gene tree discordance The species tree Gorilla Human Chimp Orangutan gene 1 Gorilla Human Chimp Orang. gene1000 Gorilla Chimp Human Orang. Causes of gene tree discordance include: • Incomplete Lineage Sorting (ILS) • Duplication and loss • Horizontal Gene Transfer (HGT) 3 A gene tree Incomplete Lineage Sorting (ILS) • A random process related to having multiple versions of each gene in a population 4 Tracing alleles through generations Incomplete Lineage Sorting (ILS) • A random process related to having multiple versions of each gene in a population 4 Tracing alleles through generations Incomplete Lineage Sorting (ILS) • • A random process related to having multiple versions of each gene in a population Omnipresent; most likely for short branches or large population sizes 4 Tracing alleles through generations Incomplete Lineage Sorting (ILS) • A random process related to having multiple versions of each gene in a population • Omnipresent; most likely for short branches or large population sizes • We have statistical models of ILS (multi-species coalescent) • Tracing alleles through generations The species tree defines the probability distribution on gene trees, and is identifiable from the distribution on gene trees [Degnan and Salter, Int. J. Org. Evolution, 2005] 4 Traditional approach: concatenation Step 2: Species tree reconstruction Approach 1: Concatenation Orangutan Chimpanzee supermatrix gene 1 gene 2 ACTGCACACCG CTGAGCATCG ACTGC-CCCCG CTGAGC-TCG AATGC-CCCCG ATGAGC-TC- -CTGCACACGGCTGA-CAC-G Phylogeny inference gene 1000 CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT Gorilla Human Approach 2: Summary methods gene 2 gene 1 Gene tree estimation Chimp Gorilla Human Orang. Gorilla Chimp Orangutan Summary method Human Orang. Gorilla gene 1000 Chimpanzee Orang. Chimp Human Gorilla 5 Human Traditional approach: concatenation Step 2: Species tree reconstruction Approach 1: Concatenation Orangutan Chimpanzee supermatrix gene 1 gene 2 ACTGCACACCG CTGAGCATCG ACTGC-CCCCG CTGAGC-TCG AATGC-CCCCG ATGAGC-TC- -CTGCACACGGCTGA-CAC-G Phylogeny inference gene 1000 CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT Gorilla Human Approach 2: Summary methods Statistically inconsistent and can even Chimp Gorilla be positively misleading (proved for unpartitioned maximum likelihood) gene 1 • Gene tree estimation Gorilla Orang. Pop. Gen., 2014] Orangutan Chimpanzee Chimp Mixed accuracy in simulations Summary method [Kubatko and Degnan, Biology, 2007] Human Systematic Orang. [Mirarab, et al., Systematic Biology, 2014] gene 1000 • gene 2 [Roch and Steel, Human Theo. Error Orang. Chimp Human Gorilla 5 Gorilla Data Human Orangutan Chimpanzee supermatrix gene 1 gene 2 ACTGCACACCG CTGAGCATCG ACTGC-CCCCG CTGAGC-TCG AATGC-CCCCG ATGAGC-TC- -CTGCACACGGCTGA-CAC-G gene 1000 CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT Phylogeny inference Scalable ILS-based summary methods Gorilla Human Approach 2: Summary methods Gene tree estimation GCACACCG GC-CCCCG GC-CCCCG GCACACGG gene 1 ene 1 AGCATCG AGC-TCG AGC-TC- A-CAC-G gene 2 ene 2 Chimp Gorilla Human Orang. Gorilla Chimp Orangutan Summary method Human Orang. GCACGCACGAA -CACGC-CATA GCACGC-C-TA TAC-CACGGAT gene 1000 Gorilla ene 1000 Chimpanzee Orang. Chimp Human Gorilla Human Summary methods can be statistically Error consistent Statistical)Binning)(SB,)WSB)) ASTRAL'I)and)ASTRAL'II)) [Science,62014]6[PLoS6ONE,62015]6 [Bioinformatics,62014,62015]) • • STAR, STELLS, BUCKy-pop, MP-EST, NJst, Avian)phylogenomics)) Plant)phylogenomics)(1KP)) ASTRAL [ECCB 2014], … [Science,62014] 5 [PNAS,62014] 6 Data Orangutan Chimpanzee supermatrix gene 1 gene 2 ACTGCACACCG CTGAGCATCG ACTGC-CCCCG CTGAGC-TCG AATGC-CCCCG ATGAGC-TC- -CTGCACACGGCTGA-CAC-G gene 1000 CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT Phylogeny inference Scalable ILS-based summary methods Gorilla Human Approach 2: Summary methods Gene tree estimation GCACACCG GC-CCCCG GC-CCCCG GCACACGG gene 1 ene 1 AGCATCG AGC-TCG AGC-TC- A-CAC-G gene 2 ene 2 Chimp Gorilla Human Orang. Gorilla Chimp Orangutan Summary method Human Orang. GCACGCACGAA -CACGC-CATA GCACGC-C-TA TAC-CACGGAT gene 1000 Gorilla ene 1000 Chimpanzee Orang. Chimp Human Gorilla Human Summary methods can be statistically Error consistent given true gene treesASTRAL'I)and)ASTRAL'II)) Statistical)Binning)(SB,)WSB)) Error-free gene trees [Science,62014]6[PLoS6ONE,62015]6 [Bioinformatics,62014,62015]) • • STAR, STELLS, BUCKy-pop, MP-EST, NJst, Avian)phylogenomics)) Plant)phylogenomics)(1KP)) ASTRAL [ECCB 2014], … [Science,62014] 5 [PNAS,62014] 6 Data Properties of quartet trees in presence of ILS Chimp Gorilla Gorilla Chimp Gorilla Chimp Human Orang. Human Orang. Human Orang. Chimp Gorilla Orang. Chimp Orang. Chimp Human Orang. Human Gorilla Human Gorilla • For 4 species, the dominant quartet topology is the species tree [Allman, et al. 2010] p1 = 30% p2 = 30% p3 = 40% Orang. Chimp Gorilla Chimp Human Gorilla Human Orang. Human Chimp Gorilla Orang. Dominant 7 Properties of quartet trees in presence of ILS Chimp Gorilla Gorilla Chimp Gorilla Chimp Human Orang. Human Orang. Human Orang. Chimp Gorilla Orang. Chimp Chimp Human Orang. Human Gorilla Orang. Human Gorilla • For 4 species, the dominant quartet topology is the species tree [Allman, et al. 2010] • For >4 species, the dominant topology may be different from the species tree [Degnan and Rosenberg, 2006] 1. Break up input each gene tree trees on 4 taxa (quartet trees) ✓ ◆ n (44n) p1 = 30% p2 = 30% p3 = 40% Orang. Chimp Gorilla Chimp Human Gorilla Human Orang. Human Chimp ✓ ◆ n into 4 Gorilla 2. Find all dominant quartet topologies Orang. 3. Combine dominant quartet trees Dominant 7 Properties of quartet trees in presence of ILS Chimp Gorilla Gorilla Chimp Gorilla Chimp Human Orang. Human Orang. Human Orang. Chimp Gorilla Orang. Chimp Chimp Human Orang. Human Gorilla Orang. Human • For 4 species, the dominant quartet topology is the species tree [Allman, et al. 2010] • For >4 species, the dominant topology may be different from the species tree [Degnan and Rosenberg, 2006] Gorilla 1. Break up input each gene tree trees on 4 taxa (quartet trees) ✓ ◆ n (44n) p1 = 30% p2 = 30% p3 = 40% Orang. Chimp Gorilla Chimp Human Gorilla Human Orang. Human Dominant Chimp ✓ ◆ n into 4 Gorilla 2. Find all dominant quartet topologies Orang. 3. Combine dominant quartet trees • ✓ ◆ n 3(44n) Alternative: weight quartet topology by their frequency and find the optimal tree 7 Maximum Quartet Support Species Tree [Mirarab, et al., ECCB, 2014] • Optimization Problem (suspected NP-Hard): Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees Score(T ) = a gene tree • X t2T Set of quartet trees induced by T (Q(T ) \ Q(t)) all input gene trees Theorem: Statistically consistent under the multispecies coalescent model when solved exactly 8 ASTRAL-I [Mirarab, et al., ECCB, 2014] • ASTRAL solves the problem exactly using dynamic programming: • Exponential running time (feasible for <18 species) 9 ASTRAL-I [Mirarab, et al., ECCB, 2014] • ASTRAL solves the problem exactly using dynamic programming: • • Exponential running time (feasible for <18 species) Introduced a constrained version of the problem • Draws the set of branches in the species tree from a given set X = {all bipartitions in all gene trees} • Motivation: given large number of gene trees, each species tree branch appears in at least one gene tree • Theorem: the constrained version remains statistically consistent • Running time: O(n2 k|X |2 ) for n species and k species 9 • • • 1KP: 103 plant species, 400-800 genes Norman J. Wicketta,b,1,2, Siavash Mirarabc,1, Nam Nguyenc, Tandy Warnowc, Eric Carpenterd, Naim Matascie,f, Saravanaraj Ayyampalayamg, Michael S. Barkerf, J. Gordon Burleighh, Matthew A. Gitzendannerh,i, Brad R. Ruhfelh,j,k, Eric Wafulal, Joshua P. Derl, Sean W. Grahamm, Sarah Mathewsn, Michael Melkoniano, Douglas E. Soltish,i,k, Pamela S. Soltish,i,k, Nicholas W. Milesk, Carl J. Rothfelsp,q, Lisa Pokornyp,r, A. Jonathan Shawp, Lisa DeGironimos, Dennis W. Stevensons, Barbara Sureko, Juan Carlos Villarrealt, Béatrice Roureu, Hervé Philippeu,v, Claude W. dePamphilisl, Tao Chenw, Michael K. Deyholosd, Regina S. Baucomx, Toni M. Kutchany, Megan M. Augustiny, Jun Wangz, Yong Zhangv, Zhijian Tianz, Zhixiang Yanz, Xiaolei Wuz, Xiao Sunz, Gane Ka-Shu Wongd,z,aa,2, and James Leebens-Mackg,2 Yang, et. al. 96 Caryophyllales species, 1122 genes a Chicago Botanic Garden, Glencoe, IL 60022; bProgram in Biological Sciences, Northwestern University, Evanston, IL 60208; cDepartment of Computer Science, University of Texas, Austin, TX 78712; dDepartment of Biological Sciences, University of Alberta, Edmonton, AB, Canada T6G 2E9; eiPlant Collaborative, Tucson, AZ 85721; fDepartment of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721; gDepartment of Plant Biology, University of Georgia, Athens, GA 30602; hDepartment of Biology and iGenetics Institute, University of Florida, Gainesville, FL 32611; jDepartment of Biological Sciences, Eastern Kentucky University, Richmond, KY 40475; kFlorida Museum of Natural History, Gainesville, FL 32611; lDepartment of Biology, Pennsylvania State University, University Park, PA 16803; mDepartment of Botany and qDepartment of Zoology, University of British Columbia, Vancouver, BC, Canada V6T 1Z4; n Arnold Arboretum of Harvard University, Cambridge, MA 02138; oBotanical Institute, Universität zu Köln, Cologne D-50674, Germany; pDepartment of Biology, Duke University, Durham, NC 27708; rDepartment of Biodiversity and Conservation, Real Jardín Botá nico-Consejo Superior de Investigaciones Cientificas, 28014 Madrid, Spain; sNew York Botanical Garden, Bronx, NY 10458; tDepartment fur Biologie, Systematische Botanik und Mykologie, Ludwig-Maximilians-Universitat, 80638 Munich, Germany; uDépartement de Biochimie, Centre Robert-Cedergren, Université de Montréal, Succursale Centre-Ville, Montreal, QC, Canada H3C 3J7; vCNRS, Station d’ Ecologie Expé rimentale du CNRS, Moulis, 09200, France; wShenzhen Fairy Lake Botanical Garden, The Chinese Academy of Sciences, Shenzhen, Guangdong 518004, China; xDepartment of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109; yDonald Danforth Plant Science Center, St. Louis, MO 63132; zBGI-Shenzhen, Bei shan Industrial Zone, Yantian District, Shenzhen 518083, China; and aaDepartment of Medicine, University of Alberta, Edmonton, AB, Canada T6G 2E1 Dentinger, et. al. 39 mushroom species, 208 genes Edited by Paul O. Lewis, University of Connecticut, Storrs, CT, and accepted by the Editorial Board September 29, 2014 (received for review December 23, 2013) Giarla and Esselstyn. 19 Philippine shrew species, 1112 genes • Laumer, et. al. 40 flatworm species, 516 genes • Grover, et. al. 8 cotton species, 52 genes • Hosner, Braun, and Kimball. 28 quail species, 11 genes • Phylotranscriptomic analysis of the origin and early diversification of land plants Reconstructing the origin and evolution of land plants and their algal relatives is a fundamental problem in plant phylogenetics, and is essential for understanding how critical adaptations arose, including the embryo, vascular tissue, seeds, and flowers. Despite advances in molecular systematics, some hypotheses of relationships remain weakly resolved. Inferring deep phylogenies with bouts of rapid diversification can be problematic; however, genome-scale data should significantly increase the number of informative characters for analyses. Recent phylogenomic reconstructions focused on the major divergences of plants have resulted in promising but inconsistent results. One limitation is sparse taxon sampling, likely resulting from the difficulty and cost of data generation. To address this limitation, transcriptome data for 92 streptophyte taxa were generated and analyzed along with 11 published plant genome sequences. Phylogenetic reconstructions were conducted using up to 852 nuclear genes and 1,701,170 aligned sites. Sixty-nine analyses were performed to test the robustness of phylogenetic inferences to permutations of the data matrix or to phylogenetic method, including supermatrix, supertree, and coalescent-based approaches, maximumlikelihood and Bayesian methods, partitioned and unpartitioned analyses, and amino acid versus DNA alignments. Among other results, we find robust support for a sister-group relationship between land plants and one group of streptophyte green algae, the Zygnematophyceae. Strong and robust support for a clade comprising liverworts and mosses is inconsistent with a widely accepted view of early land plant evolution, and suggests that phylogenetic hypotheses used to understand the evolution of fundamental plant traits should be reevaluated. land plants | Streptophyta | phylogeny | phylogenomics | transcriptome T he origin of embryophytes (land plants) in the Ordovician period roughly 480 Mya (1–4) marks one of the most important events in the evolution of life on Earth. The early evolution of embryophytes in terrestrial environments was facilitated by numerous innovations, including parental protection for the developing embryo, sperm and egg production in multicellular protective structures, and an alternation of phases (often referred to as generations) in which a diploid sporophytic life history stage gives rise to a multicellular haploid gametophytic phase. With Simmons and Gatesy. 47 angiosperm species, 310 genes www.pnas.org/cgi/doi/10.1073/pnas.1323926111 10 Significance Early branching events in the diversification of land plants and closely related algal lineages remain fundamental and unresolved questions in plant evolutionary biology. Accurate reconstructions of these relationships are critical for testing hypotheses of character evolution: for example, the origins of the embryo, vascular tissue, seeds, and flowers. We investigated relationships among streptophyte algae and land plants using the largest set of nuclear genes that has been applied to this problem to date. Hypothesized relationships were rigorously tested through a series of analyses to assess systematic errors in phylogenetic inference caused by sampling artifacts and model misspecification. Results support some generally accepted phylogenetic hypotheses, while rejecting others. This work provides a new framework for studies of land plant evolution. Author contributions: N.J.W., S. Mirarab, T.W., S.W.G., M.M., D.E.S., P.S.S., D.W.S., M.K.D., J.W., G.K.-S.W., and J.L.-M. designed research; N.J.W., S. Mirarab, N.N., T.W., E.C., N.M., S.A., M.S.B., J.G.B., M.A.G., B.R.R., E.W., J.P.D., S.W.G., S. Mathews, M.M., D.E.S., P.S.S., N.W.M., C.J.R., L.P., A.J.S., L.D., D.W.S., B.S., J.C.V., B.R., H.P., C.W.d., T.C., M.K.D., M.M.A., J.W., Y.Z., Z.T., Z.Y., X.W., X.S., G.K.-S.W., and J.L.-M. performed research; S. Mirarab, N.N., T.W., N.M., S.A., M.S.B., J.G.B., M.A.G., E.W., J.P.D., S.W.G., S. Mathews, M.M., D.E.S., P.S.S., N.W.M., C.J.R., L.P., A.J.S., L.D., D.W.S., B.S., J.C.V., H.P., C.W.d., T.C., M.K.D., R.S.B., T.M.K., M.M.A., J.W., Y.Z., G.K.-S.W., and J.L.-M. contributed new reagents/analytic tools; N.J.W., S. Mirarab, N.N., E.C., N.M., S.A., M.S.B., J.G.B., M.A.G., B.R.R., E.W., B.R., H.P., and J.L.-M. analyzed data; N.J.W., S. Mirarab, T.W., S.W.G., M.M., D.E.S., D.W.S., H.P., G.K.-S.W., and J.L.-M. wrote the paper; and N.M. archived data. The authors declare no conflict of interest. This article is a PNAS Direct Submission. P.O.L. is a guest editor invited by the Editorial Board. Freely available online through the PNAS open access option. Data deposition: The sequences reported in this paper have been deposited in the iplant datastore database, mirrors.iplantcollaborative.org/onekp_pilot, and the National Center for Biotechnology Information Sequence Read Archive, www.ncbi.nlm. nih.gov/sra [accession no. PRJEB4921 (ERP004258)]. 1 N.J.W. and S. Mirarab contributed equally to this work. 2 To whom correspondence may be addressed. Email: nwickett@chicagobotanic.org, gane@ualberta.ca, or jleebensmack@plantbio.uga.edu. This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1323926111/-/DCSupplemental. PNAS Early Edition | 1 of 10 EVOLUTION • PNAS PLUS ASTRAL-I on biological datasets Future datasets • 1200 plants with ~ 400 genes (1KP consortium) • 250 avian species with 2000 genes (with LSU, UF, and Smithsonian) • 200 avian species with whole genomes (with Genome 10K, international) • 250 suboscine species (birds) with ~2000 genes (with LSU and Tulane) • 140 Insects with 1400 genes (with U. Illinois at UrbanaChampaign) 11 Shortcomings of ASTRAL-I • • • Even the constrained version was too slow for more than about 200 species and hundreds of genes The constraint set X did not include true species tree branches for some challenging datasets, resulting in low accuracy in some cases Input gene trees could not have polytomies 12 ASTRAL-II 1. Faster calculation of the score function inside DP • O(nk) instead of O(n2k) for n species and k genes • Post-order traversal of input trees instead of set operations 2. Add extra bipartitions to the set X using heuristic approaches • Resolving consensus trees by subsampling taxa • Using quartet-based distances to find likely branches 3. Ability to take as input gene trees with polytomies 13 Simulation study • Variable parameters: True(model)speciestree • Number of species: 10 – 1000 • Number of genes: 50 – 1000 • Amount of ILS: low, medium, high • Deep versus recent speciation Truegenetrees Sequencedata Finch Falcon Owl Eagle Pigeon look at all pairs of leaves chosen each from one of the children of 0 u. For each such pair of leaves, there are u2 quartet trees that put that pair together, where u0 is the number of leaves outside the node u. ThisFinch will each pair of nodes in each of the input k nodes Owl examine Falcon Eagle Pigeon Es�matedspeciestree Es�matedgenetrees exactly once and would therefore require O(n2 k) computations. The final score can be normalized by the maximum number of input quartet trees that include a pair of taxa. Given the similarity matrix, we calculate an UPGMA tree and all its bipartitions to the set X. This heuristic adds relatively few with add heterogenous gene tree error bipartitions, but the matrix is used in the next heuristic, which is our main addition mechanism. • 11 model conditions (50 replicas each) • Compare to NJst, MP-EST, concatenation (CA-ML) Greedy: We estimate the greedy consensus of the gene trees at different threshold levels (0, 1/100, 2/100, 5/100, 1/10, 1/4, 1/3). For each polytomy in each greedy consensus tree, we resolve the • Evaluate accuracy using FN rate: the percentage branches trueimplied by those polytomy of in multiple ways andin addthe bipartitions tree that are missing from the estimated treeresolutions to the set X. First, we resolve the polytomy by applying the UPGMA algorithm to the similarity matrix, starting from the clades given by the polytomy. Then, we sample one taxon from each side of the ploytomy randomly, and use the greedy consensus of the gene trees restricted to this subsample to find a resolution 14 of the polytomy (we randomly resolve any multifunctions in this 1e−07 6 3 0 ASTRAL-I versus ASTRAL-II 200 10M 50 0.8 1000 50 2M 200 50 500K 1000 200 1000 genes 0.6 0.4 ASTRAL−I ASTRAL−II 1e−06 Species tree topological error (FN) Species tree topological error (RF) Runni 9 ASTRAL−II + true st 0.2 Low ILS 0.0 0.8 Medium ILS High ILS 1e−07 Figure S2: 0.6 Comparison of various variants of ASTRAL with 200 taxa and varying tree sha and number 0.4 of genes. Species tree accuracy (top) and running times (bottom) are shown. ASTRALtrue st shows 0.2 the case where the true species tree is added to the search space; this is included to approxim an ideal (e.g. exact) solution to the quartet problem. 0.0 50 200 1000 50 200 1000 50 200 1000 genes ASTRAL−I 12 10M ASTRAL−II ASTRAL−II + true st 2M 5 500K 6 3 200 species, deep ILS 0 12 15 1e−06 time (hours) 9 3 0 0.0 0.8 0.6 1e−07 Running time (hours) Species tree topological error (FN) time (hours) Species tree topological error (RF) Running time (hours) Species tree topologi 6 0.2 1e−07 Runni 9 ASTRAL-I versus ASTRAL-II 0.4 0.2 50 0.8 0.0 1000 50 2M 200 50 500K 1000 200 1000 genes 50 200 0.4 1000 50 ASTRAL−I Low ILS 0.0 0.8 1000 genes ASTRAL−II ASTRAL−I 0.2 200 ASTRAL−II 50 200 1000 ASTRAL−II + true st 1e−06 0.6 200 10M ASTRAL−II + true st Medium ILS High ILS 3 50 200 1000 50 1000 50 genes Low ILS 0 12 200 Medium ILS ASTRAL−I 9 ASTRAL−II 200 1000 High ILS ASTRAL−II + true st 1e−07 6 3 12 0 9 10M 200 0 12 500K 5 1000 50 200 1000 50 genes 200 species, deep ILS + true st ASTRAL−II ASTRAL−II ASTRAL−I 15 200 1000 1e−06 50 2M 6 3 1e−07 1e−06 Figure S2: 0.6 Comparison 10M of various variants of ASTRAL with 200 taxa500K and varying tree sha 2M 12 of genes. Species tree accuracy (top) and running times (bottom) are shown. ASTRALand number 0.4 9 the case where the true species tree is added to the search space; this is included to approxim true st shows 0.2 an ideal (e.g. 6 exact) solution to the quartet problem. 0.0 Species tree topological error (FN) Tree accuracy when varying the number of species 16% ASTRAL−II MP−EST 12% 8% 4% 10 50 100 200 500 1000 number of species 1000 genes, “medium” levels of recent ILS 16 Species tree topological error (FN) Tree accuracy when varying the number of species 16% ASTRAL−II MP−EST 12% 8% 4% 10 50 100 200 500 1000 number of species 1000 genes, “medium” levels of recent ILS 16 Species tree topological error (FN) Tree accuracy when varying the number of species 16% 12% ASTRAL−II ASTRAL−II NJst MP−EST MP−EST 8% 4% 10 50 100 200 500 1000 number of species 1000 genes, “medium” levels of recent ILS 16 Running time when varying the number of species ASTRAL−II NJst Running time (hours) MP−EST 20 10 0 10 50 100 200 500 1000 number of species 1000 genes, “medium” levels of recent ILS 17 Tree accuracy when varying the level of ILS 1000 genes Species tree topological error (FN) 30% 200 genes 50 genes ASTRAL−II NJst CA−ML 20% 10% 0% 10M L 2M 500K 10M 2M 500K 10M 2M 500K M H L M H L M H tree length (controls the amount of ILS) more ILS more ILS more ILS 200 species, recent ILS 18 Tree accuracy when varying the level of ILS 1000 genes Species tree topological error (FN) 30% 200 genes 50 genes ASTRAL−II NJst CA−ML 20% 10% 0% 10M L 2M 500K 10M 2M 500K 10M 2M 500K M H L M H L M H tree length (controls the amount of ILS) more ILS more ILS more ILS 200 species, recent ILS 18 Species tree topolo Species tree topological error (RF) 0.0 0.4 0.2 0.1 0.3 10M 50 200 2M 1000 50 200 500K 1000 50 200 1000 genes 1e−06 Species tree topological error (FN) 0.0 0.4 Impact of gene tree error (using true gene trees) 1e−07 0.3 0.2 ASTRAL−II 0.1 Low ILS 0.0 0.4 ASTRAL−II (true gt) CA−ML Medium ILS High ILS 1e−07 Figure0.3S6: Comparison of ASTRAL-II run on estimated and true gene trees with 200 taxa and varying tree shapes and number of genes. 0.2 0.1 0.0 50 200 1000 50 200 1000 50 200 1000 genes ASTRAL−II ASTRAL−II (true gt) CA−ML Figure S6: Comparison of ASTRAL-II run on estimated and true gene trees with 200 taxa and varying tree shapes and number of genes. 19 Species tree topolo Species tree topological error (RF) 0.0 0.4 0.2 0.1 0.3 10M 50 200 2M 1000 50 200 500K 1000 50 200 1000 genes 1e−06 Species tree topological error (FN) 0.0 0.4 Impact of gene tree error (using true gene trees) 1e−07 0.3 0.2 ASTRAL−II 0.1 Low ILS 0.0 0.4 ASTRAL−II (true gt) CA−ML Medium ILS High ILS 1e−07 Figure0.3S6: Comparison of ASTRAL-II run on estimated and true gene trees with 200 taxa and varying tree shapes and number of genes. 0.2 0.1 0.0 50 200 1000 50 200 1000 50 200 1000 genes ASTRAL−II ASTRAL−II (true gt) CA−ML When we divide our 50 replicates into low, medium, or high gene tree estimation error, ASTRAL tends to Figure S6: Comparison of ASTRAL-II run on estimated and true gene trees with 200 taxa and varying shapes with and number genes. betree better low oferror • 19 Species tree error (FN) 50 0.4 200 0.3 0.2 Insights on biological data 0.1 0.0 low • • Main question: The placement of Amborella at the base of angiosperms Xi et al. (2014) used a collection of 310 genes sampled from 46 species. medium 92 Astral-II A 25 88 Conflicting results: • 44 20 93 Concatenation puts Amborella at the base (H1) 76 53 • • MP-EST puts Amobrella+water lilies at the base (H2) 96 96 10 Xi et al. conclude ILS is the cause 92 75 • ASTRAL like many other recent studies (e.g., 1KP) recovers H1 • ILS is not necessarily the case low medium Fig. 5. Comparison of species tree accuracy with 200 taxa, divided into three categorie 98 • high 78 Arabidopsis Brassica Carica Theobroma Gossypium Citrus Manihot Ricinus Populus Malus Fragaria Cannabis Cucumis Medicago Glycine Quercus Betula Eucalyptus Vitis Striga Mimulus Sesamum Ipomoea Solanum Coffea Helianthus Lactuca Panax Camellia Silene Aquilegia Persea Liriodendron Aristolochia Musa Phoenix Sorghum Oryza Phalaenopsis Dioscorea Nuphar Amborella Picea Pinus Zamia Selaginella 61 B MP-EST 70 94 28 5 86 5.5 69 72 96 66 85 56 60 42 Fig. 6. Comparison of species trees computed on the angiosperm dataset of Xi et al. (2014). MP-EST and ASTRAL-II differ in the placement of Amborella; the concatenation tree agrees with ASTRAL-II 20 gen ten esti tha (Su AST not get sup sult ing bef (Su zer Ho (P ¼ sup esti Summary • Genome-scale data provides a wealth of information for resolving long-standing phylogenetic questions • ASTRAL-II improves on ASTRAL-I in terms of both accuracy and running time • ASTRAL-II can handle datasets with 1000 genes from 1000 taxa in a day of single cpu running time • ASTRAL dominates other summary methods, However, Concatenation is better when gene trees have high error • In future, we need to further explore, the impact of model violations, recombination, missing data, and multiple sources of gene tree discordance (e.g., HGT) 21 Acknowledgments … Tandy Warnow Keshav Pingali S.M. Bayzid Nam Nguyen (now at UIUC) Jim Leebens-‐mack Norman Wickett Gane Wong (UGA) (U Chicago) (U of Alberta) Théo Zimmermann HMMI international student fellowship Guojie Zhang Tom Gilbert Erich Jarvis Bastien Boussau (BGI, China) (U Copenhagen) (Duke, HMMI) (Université Lyon) … Ed Braun (U Florida)