Principal Components Analysis in Tree Space
Transcription
Principal Components Analysis in Tree Space
Principal Components Analysis in Tree Space Tom M. W. Nye, Newcastle University tom.nye@ncl.ac.uk Isaac Newton Institute, 21 June 2011 June 21, 2011 Summary The problem Take a sample of trees on a fixed set of taxa - gene trees - bootstrap or posterior sample How can you characterize the data set? Quantify variability? Standard tools of multivariate analysis: clustering, Multi-Dimensional Scaling (MDS), Principal Components Analysis (PCA) Trees aren’t vectors – so how will PCA work? A solution Regard PCA as a geometrical procedure Re-frame this procedure in tree space Summary The problem Take a sample of trees on a fixed set of taxa - gene trees - bootstrap or posterior sample How can you characterize the data set? Quantify variability? Standard tools of multivariate analysis: clustering, Multi-Dimensional Scaling (MDS), Principal Components Analysis (PCA) Trees aren’t vectors – so how will PCA work? A solution Regard PCA as a geometrical procedure Re-frame this procedure in tree space Example btaurus human human canis pantro pantro monodelphis mmulatta human mmusculus pantro rnorvegicus mmulatta mmusculus canis rnorvegicus rnorvegicus monodelphis xtropicalis monodelphis gallus gallus gallus xtropicalis takifugu xtropicalis takifugu tetraodon takifugu tetraodon danio tetraodon danio ciona danio ciona dmelanogaster dmelanogaster yeast dpseudoobscura dpseudoobscura anopheles anopheles anopheles apis apis apis dmelanogaster yeast celegans btaurus btaurus mmusculus cbriggsae mmulatta canis ciona dpseudoobscura cbriggsae celegans yeast cbriggsae celegans Data thanks to Anne Zupczok and Arndt von Haeseler Why PCA? Potentially have 1000’s of trees on 100’s of species. How do the data vary? Principal components analysis:Which features vary the most? How are features correlated? Quantification: proportion of variance explained Analysing tree geometry (rather than topology) has advantages:Geometry and topology interdependent Differences in geometry offer finer resolution of comparison Why PCA? Potentially have 1000’s of trees on 100’s of species. How do the data vary? Principal components analysis:Which features vary the most? How are features correlated? Quantification: proportion of variance explained Analysing tree geometry (rather than topology) has advantages:Geometry and topology interdependent Differences in geometry offer finer resolution of comparison Why PCA? Potentially have 1000’s of trees on 100’s of species. How do the data vary? Principal components analysis:Which features vary the most? How are features correlated? Quantification: proportion of variance explained Analysing tree geometry (rather than topology) has advantages:Geometry and topology interdependent Differences in geometry offer finer resolution of comparison A Naive Approach A split is a bi-partition of the leaves induced by cutting a branch: B C @ @ A @ @ D AB|CDE E Trees consist of sets of compatible splits e.g. AB|CDE and AC |BDE cannot both be in a tree Embed tree-space in RM :On m leaves there are M = (2m−1 − 1) possible splits Associate each split with a basis vector of RM and represent tree as vector of branch lengths Each tree contains at most 2m − 3 M splits Principal components in RM usually do not correspond to trees A Naive Approach A split is a bi-partition of the leaves induced by cutting a branch: B C @ @ A @ @ D AB|CDE E Trees consist of sets of compatible splits e.g. AB|CDE and AC |BDE cannot both be in a tree Embed tree-space in RM :On m leaves there are M = (2m−1 − 1) possible splits Associate each split with a basis vector of RM and represent tree as vector of branch lengths Each tree contains at most 2m − 3 M splits Principal components in RM usually do not correspond to trees Geometry of PCA q q q q q q q q Geometry of PCA q q q q q q q q 1 Find centre of data (point which minimizes sum of squared distances) Geometry of PCA q q q q q q q q 1 2 Find centre of data (point which minimizes sum of squared distances) Consider line through centre Geometry of PCA q q q q q q q q 1 2 3 Find centre of data (point which minimizes sum of squared distances) Consider line through centre Project points onto line Geometry of PCA q q q q q q q q 1 2 3 4 Find centre of data (point which minimizes sum of squared distances) Consider line through centre Project points onto line Identify line that maximizes variance (or minimizes sum of squared distances) Topological Decomposition of Tree-Space (1) Trees with same fixed topology represented as points in positive orthant of Rm−3 (ignore leaves) e.g. m = 5 taxa represented by quadrant in R2 A B AB|CDE E C D A A B B C C E E D D ABC|DE Topological Decomposition of Tree-Space (2) Orthants are stitched together along their faces A AB|CDE B C A E B E C D D ABC|DE ABE|CD A Nearest neighbour interchange: B ABE|CD ABD|CE E D C ABC|DE ABD|CE Geodesic Metric Billera, Holmes, Vogtmann (2001) proved existence of the geodesic metric Consider paths composed of straight line segments in each orthant Define length of path to be sum segment of lengths Then:The geodesic between two points x, y is the shortest path between points Geodesic metric d(x, y ) = length of geodesic Within each orthant, geodesic metric = Euclidean metric Calculation:Anne Kupczok and Megan Owen presented algorithms (2007) Owen and Provan (2009) developed O(m3 ) algorithm Geodesic Metric Billera, Holmes, Vogtmann (2001) proved existence of the geodesic metric Consider paths composed of straight line segments in each orthant Define length of path to be sum segment of lengths Then:The geodesic between two points x, y is the shortest path between points Geodesic metric d(x, y ) = length of geodesic Within each orthant, geodesic metric = Euclidean metric Calculation:Anne Kupczok and Megan Owen presented algorithms (2007) Owen and Provan (2009) developed O(m3 ) algorithm Geodesic Metric Billera, Holmes, Vogtmann (2001) proved existence of the geodesic metric Consider paths composed of straight line segments in each orthant Define length of path to be sum segment of lengths Then:The geodesic between two points x, y is the shortest path between points Geodesic metric d(x, y ) = length of geodesic Within each orthant, geodesic metric = Euclidean metric Calculation:Anne Kupczok and Megan Owen presented algorithms (2007) Owen and Provan (2009) developed O(m3 ) algorithm Example A C D @ @ @ @ B DE |ABC 6 A q x1 C D @ @ B @ @ E E q x2 "" " " " AC |BDE q y1 C A @ @ D B @ @ E " @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ AB|CDE @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ q @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ y2 @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ ? BE |ACD ΦPCA: PCA with the geodesic metric For first principal component:1 Find centre x0 of data x1 , . . . , xn 2 Consider each line through x0 : - line L is geodesic between all points x, y ∈ L - lines extend to infinity in two directions 3 Project x1 , . . . , xn onto line by finding points yi on geodesic closest to xi 4 Identify linePthat maximizes variance of y1 , . . . , yn (or minimizes d(xi , yi )2 ) Details Finding center x0 :Cannot ‘add’ trees – so cannot compute a mean P Minimizing d(x0 , xi )2 computationally expensive Use the majority consensus tree Projection:Find closest point on L to each point xi Numerical search made efficient by - Euclidean approximation - shift end points slightly – geodesic doesn’t change much Searching for the Optimal Line No analytic solution – need to search for optimal line Geometrical problem: direction vector in initial quadrant Topological problem: which orthants? ΦPCA algorithm:Add one split p and its NNI replacement p 0 at a time - p, p 0 must satisfy compatibility constraint with L - optimize direction vector given splits p, p 0 - for each pair p, p 0 and direction vector, perform projection Analogous to one coord at a time in regular PCA Search algorithms for p, p 0 : - greedy approach - simulated annealing: birth / death moves Searching for the Optimal Line No analytic solution – need to search for optimal line Geometrical problem: direction vector in initial quadrant Topological problem: which orthants? ΦPCA algorithm:Add one split p and its NNI replacement p 0 at a time - p, p 0 must satisfy compatibility constraint with L - optimize direction vector given splits p, p 0 - for each pair p, p 0 and direction vector, perform projection Analogous to one coord at a time in regular PCA Search algorithms for p, p 0 : - greedy approach - simulated annealing: birth / death moves Results 1 Application to metazoan (animal) data set 2 Long branch attraction simulation example Metazoan data set (von Haeseler group, J. Comp. Biol. 2008) 118 gene trees from 20 metazoan species Metazoan Data Set he le s api s op ra bscu r udoo aste g no ela dm dpse an cbr dan ta io ki fu g u igg sae celegans te tra od x tr o yeast on p ic a li s gallus na pa nt mon ode lp h is m m us cu lu s cu gi ve or s r nt a u r u b is can ro s hu mmu ma la tt an cio Metazoan Data Set he le s api s op ra bscu r udoo aste g no ela dm dpse an cbr dan ta io ki fu g u igg sae celegans te tra od x tr o yeast on p ic a li s gallus na pa nt mon ode lp h is m m us cu lu s cu gi ve or s r nt a u r u b is can ro s hu mmu ma la tt an cio Metazoan Data Set cbr he le s api s op ra bscu r udoo aste g no ela dm dpse an igg dan ta io ki fu g u sae celegans te tra od x tr o yeast on p ic a li s gallus na pa nt mon ode lp h is m m us cu lu s cu gi ve or s r nt a u r u b is can ro s hu mmu ma la tt an cio Metazoan Data Set cbr he le api s op ra bscu r udoo aste g no ela dm dpse an s igg dan ta io ki fu g u sae celegans te tra od x tr o yeast on p ic a li s gallus na pa nt mon ode lp h is m m us cu lu s cu gi ve or s r nt a u r u b is can ro s hu mmu ma la tt an cio Metazoan Data Set cbr he le api s op ra bscu r udoo aste g no ela dm dpse an s igg dan ta io ki fu g u sae celegans te tra od on x tr o yeast p ic a li s gallus na pa nt mon ode lp h is m m us cu lu s cu gi ve or s r nt a u r u b is can ro s hu mmu ma la tt an cio Metazoan Data Set cbr he le api s op ra bscu r udoo aste g no ela dm dpse an s igg dan ta io ki fu g u sae celegans te tra od on x tr o p ic a li s gallus cio na pa nt mon ode lp h is m m us cu lu s cu gi ve or s r nt a u r u b is can ro s hu mmu ma la tt an yeast Metazoan Data Set cbr he le api s op ra bscu r udoo aste g no ela dm dpse an s igg dan ta io ki fu g u sae celegans te tra od on x tr o p ic a li s gallus cio na pa nt mon ode lp h is m m us cu lu s cu gi ve or s r nt a u r u b is can ro s hu mmu ma la tt an yeast Metazoan Data Set cbr igg he le api s op ra bscu r udoo aste g no ela dm dpse an s dan ta io ki fu g u sae celegans te tra od on x tr o p ic a li s gallus cio na pa nt mon ode lp h is m m us cu lu s cu gi ve or s r nt a u r u b is can ro s hu mmu ma la tt an yeast Metazoan Data Set he le api s op igg ra bscu r udoo aste g no ela dm dpse an cbr s dan ta io ki fu g u sae celegans te tra od on x tr o li s gallus cio na pa nt mon ode lp h is m m us cu lu s cu gi ve or s r nt a u r u b is can ro s hu mmu ma la tt an yeast p ic a Metazoan Data Set he le s dan ta io ki fu g u sae celegans api s op igg ra bscu r udoo aste g no ela dm dpse an cbr te tra od on x tr o li s gallus cio na pa nt mon ode lp h is m m us cu lu s cu gi ve or s r nt a u r u b is can ro s hu mmu ma la tt an yeast p ic a Metazoan Data Set ∗ he le s dan ta io ki fu g u sae celegans api s op igg ra bscu r udoo aste g no ela dm dpse an cbr te tra od on x tr o li s gallus cio na pa nt mon ode lp h is m m us cu lu s cu gi ve or s r nt a u r u b is can ro s hu mmu ma la tt an yeast p ic a m o el s ph llu d no ga lis o ra is rve gic u s c u lu s s xt ro pi ca is mmu ap rn les phe ano dps eud oob scu s te r a on noga ci d m e la cb ye rig gs as a t ce le ga nse da o ugu t a k if tetraodon bt ca a u r u s ni s ni Metazoan Data Set ro nt p au m a n h mm ula tta m de hi s lp llu lis ra s rve gic u s c u lu s s o o on ga xt ro pi ca is mmu ap rn les phe ano dps eud oob scu s te r a on noga ci d m e la cb ye rig gs as a t ce le ga nse da o ugu t a k if tetraodon bt ca a u r u s ni s ni Metazoan Data Set ro nt p au m a n h mm ula tta m o el s ph llu d no ga lis o ra is rve gic u s c u lu s s xt ro pi ca is mmu ap rn les phe ano dps eud oob scu s te r a on noga ci d m e la cb ye rig gs as a t ce le ga nse ni ugu t a k if tetraodon o bt ca a u r u s ni s da Metazoan Data Set ro nt p au m a n h mm ula tta m on hi s lp llu lis ra s rve gic u s c u lu s s e od ga xt ro pi ca is mmu ap rn o les phe ano dps eud oob scu s te r a on noga ci d m e la cb ye rig gs as a t ce le ga nse ni ugu t a k if tetraodon o bt ca a u r u s ni s da Metazoan Data Set ro nt p au m a n h mm ula tta m o d no ga el s ph llu lis is xt ro pi ca ap is les phe ra rn or v e gic u mmu s c u lu s s ano dps eud oob scu s te r a on noga ci d m e la cb ye rig gs as a t ce le ga nse bt ca a u r u s ni s o ni u d a t a k if u g tetraodon Metazoan Data Set ro nt pa u m a n h mm ula tta m o d no is s h lp llu e ga lis is xt ro pi ca ap les phe ra rn or v e gic u mmu s c u lu s s ano dps eud oob scu s te r a on noga ci d m e la cb ye rig gs as a t ce le ga nse o ni ugu t a k if tetraodon bt ca a u r u s ni s da Metazoan Data Set ro nt n pa h u m a mm ula tta m on od g p el u all hi s lis is xt ro pi ca ap s a bt ca a u r u s ni s o ni d a a k if u g u t tetraodon pa les phe ra rn or v e gic u mmu s c u lu s s ano dps eud oob scu s te r on noga ci d m e la cb ye rig gs as a t ce le ga nse Metazoan Data Set nt ro an hum mm ula tta m o s llu ga is ph l de o n lis is xt ro pi ca ap a bt ca a u r u s ni s o ni d a a k if u g u t tetraodon pa les phe ra rn or v e gic u mmu s c u lu s s ano dps eud oob scu s te r on noga ci d m e la cb ye rig gs as a t ce le ga nse Metazoan Data Set nt ro an hum mm ula tta ap lis xt ro pi ca s llu s a g hi lp e od on m is a o ni ugu t a k if tetraodon bt ca a u r u s ni s da pa nt les phe ra rn or v e gic u mmu s c u lu s s ano dps eud oob scu s te r on noga ci d m e la cb ye rig gs as a t ce le ga nse Metazoan Data Set ro an hum mm ula tta ap lis xt ro pi ca s llu is a g h lp e od on m is a o ni ugu t a k if tetraodon bta ca u r u s ni s da pa nt les phe ra rn or v e gic u mmu s c u lu s s ano dps eud oob scu s te r on noga ci d m e la cb ye rig gs as a t ce le ga nse Metazoan Data Set ∗ ro an hum mm ula tta Long branch attraction (LBA) simulation When the true tree contains long branches next to short branches, long branches are erroneously placed together in estimated phylogenies Simulation:Simulate 100 DNA alignments from ‘true tree’ (genes) Estimate trees from alignments (gene trees) Apply ΦPCA to resultant collection of alternative trees Long Branch Attraction Tree Tree taken from Philippe, Syst. Biol. 2005, ‘An Empirical assessment of LBA artefacts’ apicomp_1 apicomp_2 apicomp_3 apicomp_5 apicomp_4 cilliates apicomp_6 diatoms_1 diatoms_2 algae oomycetes plants_1 plants_2 plants_3 plants_5 plants_4 plants_6 conosa_1 rhodophyta fungi_2 fungi_1 conosa_2 fungi_4 fungi_3 fungi_6 fungi_5 fungi_7 fungi_8 metazoa_2 metazoa_1 metazoa_3 choanoflagellida metazoa_4 crenarch_3 crenarch_2 crenarch_1 euryarch_3 euryarch_2 euryarch_1 guillardia microsporidia ap cil _a _a oo my ce tes 2 ms_ 1 ae d ia too m s _ a l g t d ia eu eu apicom p_1 a p ic ap ic om p_ 2 om p_3 aapp i c ic om om p_ p_ 4 5 LBA Simulation rc rc lia om te p_ 6 s h_ h_ ic 3 eu 2 _a rch _1 cn_ arc cn_a rch_3 h_1 cn_ arch _2 h ill gruh o d o p p la n co 1 oa_ ta z m eet az oa _2 m metazo a_3 me taz oa _4 i_ 7 fl fu n g no 3 sa _2 _1 oa fu f n fun gi_ ung g i_ 2 i_ 1 p p l la n t an s_ ts 5 _6 sa i_ 8 ch 6 g i_ 5 fun gi_ fun _4 gi n fu no no fu n g co mic rosp pl an ts _1 plants_2 pl an ts _3 ts _ 4 ap _a _a oo my ce tes 2 ms_ 1 ae d ia too m s _ a l g t d ia eu eu apicom p_1 a p ic ap ic om p_ 2 om p_3 aapp i c ic om om p_ p_ 4 5 LBA Simulation cil rc rc h_ h_ ic lia te p_ 6 s 3 eu 2 _a rch _1 cn_ arc cn_a rch_3 h_1 cn_ arch _2 om h ill gruh o d o p p la n mic rosp co co i_ 8 no 1 oa_ ta z m eet az oa _2 m metazo a_3 me taz oa _4 fl fu n g no i_ 7 oa fu f n fun gi_ ung g i_ 2 i_ 1 3 sa _2 _1 ch 6 g i_ 5 fun gi_ 4 fun i_ ng fu p p l la n t an s_ ts 5 _6 sa fu n g no pl an ts _1 plants_2 pl an ts _3 ts _ 4 i ap _a _a oo my ce tes 2 ms_ 1 ae d ia too m s _ a l g t d ia eu eu apicom p_1 a p ic ap ic om p_ 2 om p_3 aapp i c ic om om p_ p_ 4 5 LBA Simulation cil rc rc h_ eu 2 _a rch _1 cn_ a cn_a rch_3 rch_1 cn_ arch _2 co m te s lia p_ 6 h_ 3 h ill gruh o d o p p la n p p l la n t an s_ ts 5 _6 mic rosp i_ 8 1 oa_ ta z m eet az oa _2 m metazo a_3 me taz oa _4 i_ 7 fl fu n g no 3 _2 _1 oa fu f n fun gi_ ung g i_ 2 i_ 1 sa sa ch 6 g i_ 5 fun gi_ fun _4 gi n fu no no fu n g co co pl an ts _1 plants_2 pl an ts _3 ts _ 4 _a _a oo my ce tes 2 ms_ 1 ae d ia too m s _ a l g t d ia eu eu apicom p_1 a p ic ap ic om p_ 2 om p_3 aapp i c o ic m om p_ p_ 4 5 LBA Simulation i ap cil rc rc h_ eu 2 _a rch _1 cn_ a cn_a rch_3 rch_1 cn_ arch _2 co m te s lia p_ 6 h_ 3 h ill gruh o d o p p la n p p l la n t an s_ ts 5 _6 mic rosp i_ 8 co _1 zoa e ta oa _2 m m et az metazo a_3 me taz oa _4 fl i_ 7 no fu n g _2 _1 oa fu f n fun gi_ ung g i_ 2 i_ 1 3 sa sa ch 6 g i_ 5 fun gi_ 4 fun i_ ng fu no no fu n g co pl an ts _1 plants_2 pl an ts _3 ts _ 4 _a _a oo my ce tes 2 ms_ 1 ae d ia too m s _ a l g t d ia eu eu apicom p_1 a p ic ap ic om p_ 2 om p_3 aapp i c ic om om p_ p_ 4 5 LBA Simulation ap rc rc h_ eu 2 _a rch _1 cn_ a cn_a rch_3 rch_1 cn_ arch _2 h_ cil ic lia om te p_ 6 s 3 ill gruh o d o p p la n mic rosp co n g i_ 8 co 1 oa_ ta z m eet az oa _2 m metazo a_3 me taz oa _4 fl i_ 7 no fu n g sa _2 _1 oa fu f n fun gi_ ung g i_ 2 i_ 1 3 pl an ts _1 plants_2 pl an ts _3 ts _ 4 p p l la n t an s_ ts 5 _6 sa ch 6 g i_ 5 fun gi_ 4 fun i_ ng fu no no fu h _a _a oo my ce tes 2 ms_ 1 ae d ia too m s _ a l g t d ia eu eu apicom p_1 a p ic ap ic om p_ 2 om p_3 aapp i c ic om om p_ p_ 4 5 LBA Simulation ap rc rc h_ eu 2 _a rch _1 cn_ a cn_a rch_3 rch_1 cn_ arch _2 h_ cil ic lia 6 s 3 p la n co co i_ 8 fu n g no fl 1 oa_ ta z m eet az oa _2 m metazo a_3 me taz oa _4 sa _2 _1 oa i_ 7 pl an ts _1 plants_2 pl an ts _3 ts _ 4 p p l la n t an s_ ts 5 _6 sa ch fu n g no no fu f n fun gi_ ung g i_ 2 i_ 1 3 te p_ h ill gruh o d o p mic rosp 6 g i_ 5 fun gi_ fun _4 gi n fu om _a _a ap rc rc h_ eu 2 _a rch _1 cn_ a cn_a rch_3 rch_1 cn_ arch _2 oo my ce tes 2 ms_ 1 ae d ia too m s _ a l g t d ia eu eu apicom p_1 a p ic ap ic om p_ 2 om p_3 aapp i c ic om om p_ 4 p_ 5 LBA Simulation h_ cil ic lia 3 co co s sa _2 _1 oa no fl 1 oa_ ta z m eet az oa _2 m metazo a_3 me taz oa _4 pl an ts _1 plants_2 pl an ts _3 ts _ 4 p p l la n t an s_ ts 5 _6 sa ch i_ 7 no no fu n g 6 p la n i_ 8 fu n g fu f n fun gi_ ung g i_ 2 i_ 1 3 te p_ h ill gruh o d o p mic rosp 6 g i_ 5 fun gi_ 4 fun i_ ng fu om _a _a rc rc h_ eu 2 _a rch _1 cn_ a cn_a rch_3 rch_1 cn_ arch _2 oo my ce tes 2 ms_ 1 ae d ia too m s _ a l g t d ia eu eu apicom p_1 a p ic ap ic om p_ 2 om p_3 aapp i c ic om om p_ p_ 4 5 LBA Simulation i ap h_ cil co m te s lia 3 p la n co i_ 8 co _1 zoa e ta oa _2 m m et az metazo a_3 me taz oa _4 i_ 7 fl fu n g _2 _1 no 3 pl an ts _1 plants_2 pl an ts _3 ts _ 4 p p l la n t an s_ ts 5 _6 sa sa oa fu f n fun gi_ ung g i_ 2 i_ 1 no no ch 6 g i_ 5 fun gi_ 4 fun i_ ng u f 6 h ill gruh o d o p mic rosp fu n g p_ _a _a rc rc h_ eu 2 _a rch _1 cn_ a cn_a rch_3 rch_1 cn_ arch _2 oo my ce tes 2 ms_ 1 ae d ia too m s _ a l g t d ia eu eu apicom p_1 a p ic ap ic om p_ 2 om p_3 aapp i c ic om om p_ 4 p_ 5 LBA Simulation ap h_ 3 cil ic lia te co s 1 oa_ ta z m eet az oa _2 m metazo a_3 me taz oa _4 fl i_ 7 sa _2 _1 no fu n g pl an ts _1 plants_2 pl an ts _3 ts _ 4 p p l la n t an s_ ts 5 _6 sa oa 3 no no n g i_ 8 ch fu f n fun gi_ ung g i_ 2 i_ 1 6 p la n co 6 g i_ 5 f u nn g i _ 4 fu i_ ng u f p_ h ill gruh o d o p mic rosp fu om _a _a rc rc h_ eu 2 _a rch _1 cn_ a cn_a rch_3 rch_1 cn_ arch _2 h_ oo my ce tes 2 ms_ 1 ae d ia too m s _ a l g t d ia eu eu apicom p_1 a p ic ap ic om p_ 2 om p_3 aapp i c o ic m om p_ p_ 4 5 LBA Simulation ap 3 cil ic lia om te co sa fl 1 oa_ ta z m eet az oa _2 m metazo a_3 me taz oa _4 _2 _1 no i_ 7 pl an ts _1 plants_2 pl an ts _3 ts _ 4 p p l la n t an s_ ts 5 _6 sa oa fu n g no no i_ 8 ch fu f n fun gi_ ung g i_ 2 i_ 1 3 s p la n co 6 g i_ 5 fun gi_ 4 fun i_ ng fu 6 h ill gruh o d o p mic rosp fu n g p_ _a _a rc rc h_ eu 2 _a rch _1 cn_ a cn_a rch_3 rch_1 cn_ arch _2 h_ oo my ce tes 2 ms_ 1 ae d ia too m s _ a l g t d ia eu eu apicom p_1 a p ic ap ic om p_ 2 om p_3 aapp i c ic om om p_ p_ 4 5 LBA Simulation i ap 3 cil co m te s lia p la n co co 1 oa_ ta z m eet az oa _2 m metazo a_3 me taz oa _4 fl i_ 7 sa _2 _1 no fu n g pl an ts _1 plants_2 pl an ts _3 ts _ 4 p p l la n t an s_ ts 5 _6 sa oa fu f n fun gi_ ung g i_ 2 i_ 1 3 no no i_ 8 ch 6 g i_ 5 fun gi_ 4 fun i_ ng fu 6 h ill gruh o d o p mic rosp fu n g p_ _a _a h_ 2 _a rch h_ 3 eu cn_ rc rc oo my ce tes 2 ms_ 1 ae d ia too m s _ a l g t d ia eu eu apicom p_1 a p ic ap ic om p_ 2 om p_3 aapp i c ic om om p_ p_ 4 5 LBA Simulation ∗ ap cil _1 a cn_a rch_3 rch_1 cn_ arch _2 ic lia om te p_ 6 s h ill gruh o d o p mic rosp p la n co co no no 1 oa_ ta z m eet az oa _2 m metazo a_3 me taz oa _4 i_ 7 fl fu n g no fu f n fun gi_ ung g i_ 2 i_ 1 3 sa _2 _1 oa 6 g i_ 5 fun gi_ 4 fun i_ ng fu p p l la n t an s_ ts 5 _6 sa i_ 8 ch fu n g pl an ts _1 plants_2 pl an ts _3 ts _ 4 Summary How can you understand variability in large sets of trees? Attempt to re-interpret PCA as a geometrical procedure in tree-space Incorporates geometrical and topological information about trees via the geodesic metric Computational feasibility is achieved by greedy or stochastic search for optimal line Method successfully identifies the most variable features and correlations between features Captures known variability in experimental data and reveals long branch attraction in simulated data Summary How can you understand variability in large sets of trees? Attempt to re-interpret PCA as a geometrical procedure in tree-space Incorporates geometrical and topological information about trees via the geodesic metric Computational feasibility is achieved by greedy or stochastic search for optimal line Method successfully identifies the most variable features and correlations between features Captures known variability in experimental data and reveals long branch attraction in simulated data Summary How can you understand variability in large sets of trees? Attempt to re-interpret PCA as a geometrical procedure in tree-space Incorporates geometrical and topological information about trees via the geodesic metric Computational feasibility is achieved by greedy or stochastic search for optimal line Method successfully identifies the most variable features and correlations between features Captures known variability in experimental data and reveals long branch attraction in simulated data Summary How can you understand variability in large sets of trees? Attempt to re-interpret PCA as a geometrical procedure in tree-space Incorporates geometrical and topological information about trees via the geodesic metric Computational feasibility is achieved by greedy or stochastic search for optimal line Method successfully identifies the most variable features and correlations between features Captures known variability in experimental data and reveals long branch attraction in simulated data Summary How can you understand variability in large sets of trees? Attempt to re-interpret PCA as a geometrical procedure in tree-space Incorporates geometrical and topological information about trees via the geodesic metric Computational feasibility is achieved by greedy or stochastic search for optimal line Method successfully identifies the most variable features and correlations between features Captures known variability in experimental data and reveals long branch attraction in simulated data Summary How can you understand variability in large sets of trees? Attempt to re-interpret PCA as a geometrical procedure in tree-space Incorporates geometrical and topological information about trees via the geodesic metric Computational feasibility is achieved by greedy or stochastic search for optimal line Method successfully identifies the most variable features and correlations between features Captures known variability in experimental data and reveals long branch attraction in simulated data Further work:Higher components: approximation by ‘planes’ Better theory of distributions on tree space Thanks to:Anne Kupczok and Arndt von Haeseler Two anonymous reviewers