Robustness of the Parsimonious Reconciliation Method in
Transcription
Robustness of the Parsimonious Reconciliation Method in
Robustness of the Parsimonious Reconciliation Method in Cophylogeny Laura Urbini, Blerina Sinaimeri, Catherine Matias, Marie-France Sagot Trujillo, Spain June 21-22, 2016 L. Urbini June 21-22, 2016 - AlCoB 1 / 22 Introduction The cophylogeny problem L. Urbini June 21-22, 2016 - AlCoB 2 / 22 Introduction Reconciliation model Reconcile the trees through a mapping of S into H (asymmetric role between the trees). Events that can be recovered: L. Urbini June 21-22, 2016 - AlCoB 3 / 22 Introduction Reconciliation model Definition Input: H, S (rooted trees), the map φ : Leaves(S) → Leaves(H) and cost vector c = hcc , cd , cs , cl i. Output: A reconciliation function ρ : V (S) → V (H), where ρ extends φ (i.e. ∀v ∈ Leaves(S), ρ(v ) = φ(v )). L. Urbini June 21-22, 2016 - AlCoB 4 / 22 Introduction Reconciliation model Definition Input: H, S (rooted trees), the map φ : Leaves(S) → Leaves(H) and cost vector c = hcc , cd , cs , cl i. Output: A reconciliation function ρ : V (S) → V (H), where ρ extends φ (i.e. ∀v ∈ Leaves(S), ρ(v ) = φ(v )). ρ partitions the set V (S) into three sets: Σ vertices associated with cospeciation. L. Urbini June 21-22, 2016 - AlCoB 4 / 22 Introduction Reconciliation model Definition Input: H, S (rooted trees), the map φ : Leaves(S) → Leaves(H) and cost vector c = hcc , cd , cs , cl i. Output: A reconciliation function ρ : V (S) → V (H), where ρ extends φ (i.e. ∀v ∈ Leaves(S), ρ(v ) = φ(v )). ρ partitions the set V (S) into three sets: Σ vertices associated with cospeciation. ∆ vertices associated to duplication. L. Urbini June 21-22, 2016 - AlCoB 4 / 22 Introduction Reconciliation model Definition Input: H, S (rooted trees), the map φ : Leaves(S) → Leaves(H) and cost vector c = hcc , cd , cs , cl i. Output: A reconciliation function ρ : V (S) → V (H), where ρ extends φ (i.e. ∀v ∈ Leaves(S), ρ(v ) = φ(v )). ρ partitions the set V (S) into three sets: Σ vertices associated with cospeciation. ∆ vertices associated to duplication. Γ vertices associated to host-switches. L. Urbini June 21-22, 2016 - AlCoB 4 / 22 Introduction Reconciliation model Definition Input: H, S (rooted trees), the map φ : Leaves(S) → Leaves(H) and cost vector c = hcc , cd , cs , cl i. Output: A reconciliation function ρ : V (S) → V (H), where ρ extends φ (i.e. ∀v ∈ Leaves(S), ρ(v ) = φ(v )). ρ partitions the set V (S) into three sets: Σ vertices associated with cospeciation. ∆ vertices associated to duplication. Γ vertices associated to host-switches. The loss events are related to host vertices. L. Urbini June 21-22, 2016 - AlCoB 4 / 22 Introduction Reconciliation model Definition Input: H, S (rooted trees), the map φ : Leaves(S) → Leaves(H) and cost vector c = hcc , cd , cs , cl i. Output: A reconciliation function ρ : V (S) → V (H), where ρ extends φ (i.e. ∀v ∈ Leaves(S), ρ(v ) = φ(v )). ρ partitions the set V (S) into three sets: Σ vertices associated with cospeciation. ∆ vertices associated to duplication. Γ vertices associated to host-switches. Parsimony Method: Assign a cost to each event and minimize the total cost. The loss events are related to host vertices. L. Urbini June 21-22, 2016 - AlCoB 4 / 22 Limitations of the reconciliation model Limitations of the reconciliation model (Multiple associations leaves) The model makes a strong assumption on the data in input: One symbiont leaf is mapped to at most one host leaf. L. Urbini June 21-22, 2016 - AlCoB 5 / 22 Limitations of the reconciliation model Limitations of the reconciliation model (Multiple associations leaves) The model makes a strong assumption on the data in input: One symbiont leaf is mapped to at most one host leaf. Datasets obtained for each choice of the multiple associations. L. Urbini June 21-22, 2016 - AlCoB 5 / 22 Limitations of the reconciliation model Limitations of the reconciliation model (Multiple associations leaves) The model makes a strong assumption on the data in input: One symbiont leaf is mapped to at most one host leaf. Datasets obtained for each choice of the multiple associations. Association changes → Similar reconciliation? L. Urbini June 21-22, 2016 - AlCoB 5 / 22 Limitations of the reconciliation model Limitations of the reconciliation model (Rooting a phylogenetic tree) The model makes a strong assumption on the data in input: L. Urbini June 21-22, 2016 - AlCoB 6 / 22 Limitations of the reconciliation model Limitations of the reconciliation model (Rooting a phylogenetic tree) The model makes a strong assumption on the data in input: L. Urbini June 21-22, 2016 - AlCoB 6 / 22 Limitations of the reconciliation model Limitations of the reconciliation model (Rooting a phylogenetic tree) The model makes a strong assumption on the data in input: Datasets obtained for each choice of rooting. L. Urbini June 21-22, 2016 - AlCoB 6 / 22 Limitations of the reconciliation model Limitations of the reconciliation model (Rooting a phylogenetic tree) The model makes a strong assumption on the data in input: Datasets obtained for each choice of rooting. Root changes → Similar reconciliation? L. Urbini June 21-22, 2016 - AlCoB 6 / 22 Limitations of the reconciliation model Limitations of the reconciliation model (Rooting a phylogenetic tree) The model makes a strong assumption on the data in input: PLATEAU PROPERTY [GET13]: All optimal rooting form a subtree, called plateau, from which: the rooting along every path toward a leaf have monotonically increasing cost. L. Urbini June 21-22, 2016 - AlCoB 7 / 22 Limitations of the reconciliation model Limitations of the reconciliation model (Rooting a phylogenetic tree) The model makes a strong assumption on the data in input: PLATEAU PROPERTY [GET13]: All optimal rooting form a subtree, called plateau, from which: the rooting along every path toward a leaf have monotonically increasing cost. Coevolution → “many” cospeciations → “low” total reconciliation cost. L. Urbini June 21-22, 2016 - AlCoB 7 / 22 Limitations of the reconciliation model Limitations of the reconciliation model The model makes a strong assumption on the data in input: One symbiont leaf can be associated with more than one host leaf. Finding the root of a phylogenetic tree is often problematic. Explore the robustness of parsimonious model, with errors in input given H, S, φ and c: Change associations in case of multiple associations leaves. Change root of symbiont tree. Try all possible rootings and test the plateau property. Try the rootings at distance k ≤ max(5%|V (S)|, 3), from the original root. Input changes → Similar reconciliation? L. Urbini June 21-22, 2016 - AlCoB 8 / 22 Limitations of the reconciliation model EUCALYPT [DBS+ 14] Problem Generating all optimal reconciliations. The number of optimal reconciliations can be exponential in the size of the trees. A polynomial delay algorithm: The time between two successive solutions is polynomial in the size of the input. eucalypt.gforge.inria.fr L. Urbini June 21-22, 2016 - AlCoB 9 / 22 The Input The Input A dataset is a pair of H, S, and map φ. We considering the following cost vector c = hcc , cd , cs , cl i ∈ C where C = {h−1, 1, 1, 1i, h0, 1, 1, 1i, h0, 1, 2, 1i, h0, 2, 3, 1i, h1, 1, 1, 1i, h1, 1, 3, 1i}. L. Urbini June 21-22, 2016 - AlCoB 10 / 22 The Output The Output A reconciliation is summarised as a pattern of integers. π = hnc , nd , ns , nl i Definition For a given input: H, S, the map φ and cost vector c. Optimal solution: Multisets of patterns ΛH,S,φ,c = {π; π has optimal cost} Dissimilarity between two multisets of patterns: P P || π∈Λ1 π − π∈Λ2 π|| d(Λ1 , Λ2 ) = (|Λ1 | + |Λ2 |) ∗ maxπ∈Λ1 ∪Λ2 ||π|| L. Urbini June 21-22, 2016 - AlCoB (1) 11 / 22 The Output The Output Dissimilarity between two multisets of patterns: P P || π∈Λ1 π − π∈Λ2 π|| d(Λ1 , Λ2 ) = (|Λ1 | + |Λ2 |) ∗ maxπ∈Λ1 ∪Λ2 ||π|| Example 1: Λ1 = {[4, 2, 0, 1], [4, 2, 0, 1], [5, 1, 1, 0]} Λ2 = {[4, 1, 0, 1]} L. Urbini June 21-22, 2016 - AlCoB 12 / 22 The Output The Output Dissimilarity between two multisets of patterns: P P || π∈Λ1 π − π∈Λ2 π|| d(Λ1 , Λ2 ) = (|Λ1 | + |Λ2 |) ∗ maxπ∈Λ1 ∪Λ2 ||π|| Example 1: Λ1 = {[4, 2, 0, 1], [4, 2, 0, 1], [5, 1, 1, 0]} Λ2 = {[4, 1, 0, 1]} d(Λ1 , Λ2 ) = || {[4, 2, 0, 1] + [4, 2, 0, 1] + [5, 1, 1, 0]} − {[4, 1, 0, 1]} || (3 + 1) ∗ max(7, 7, 7, 6) d(Λ1 , Λ2 ) = L. Urbini ||[9, 4, 1, 1]|| = 0.536 4∗7 June 21-22, 2016 - AlCoB 12 / 22 Tested Dataset Datasets Biological Datasets: 15 Datasets: EC - Encyrtidae (7 leaves) & Coccidae (10 leaves) PP - Primates (36 leaves) & Pinworms (40 leaves), RH Rodents (34 leaves) & Hantaviruses (42 leaves), Multiple Associations: 3 of these datasets present multiple associations (namely MP, SBL, SFC) Simulated Datasets: For each Biological Dataset we created 50 Simulated Datasets. The simulated datasets will be used only for testing the rooting of the trees. L. Urbini June 21-22, 2016 - AlCoB 13 / 22 Results Perturbation of associations (Multiple associations leaves) SBL dataset, 5 out the 8 leaves of the symbiont tree have multiple associations → 560 datasets. Cost vector h0, 1, 1, 1i. 70% have cost 7 30% change the optimum cost value (from 7 to a value 6,8,9) L. Urbini June 21-22, 2016 - AlCoB 14 / 22 Results Perturbation of associations (Multiple associations leaves) SBL dataset, 5 out the 8 leaves of the symbiont tree have multiple associations → 560 datasets. Cost vector h0, 1, 1, 1i. 70% have cost 7 30% change the optimum cost value (from 7 to a value 6,8,9) L. Urbini June 21-22, 2016 - AlCoB 14 / 22 Results Perturbation of associations (Multiple associations leaves) SBL dataset, 5 out the 8 leaves of the symbiont tree have multiple associations → 560 datasets. Cost vector h0, 1, 1, 1i. 65.5% dissimilarity different to 0 8.5% biggest dissimilarity (0.6) L. Urbini June 21-22, 2016 - AlCoB 14 / 22 Results Rerooting (Testing the plateau property) 2 biological datasets and several simulated datasets have more than one plateau. plateau property is not valid in our model (because of the host switch). in 37% of biological datasets and in 17% of simulated datasets, the original root is not in the plateau. Hypothesis for real datasets, the original root is not in the correct position. L. Urbini June 21-22, 2016 - AlCoB 15 / 22 Results Rerooting (At distance k ) Distance k ≤ max(5%|V (S)|, 3), from the original root. Real Datasets: Dissimilarity of reconciliation globally increases as k also increases. L. Urbini June 21-22, 2016 - AlCoB 16 / 22 Results Rerooting (At distance k ) Distance k ≤ max(5%|V (S)|, 3), from the original root. Simulated Datasets: Dissimilarity of reconciliation globally increases as k also increases. L. Urbini June 21-22, 2016 - AlCoB 17 / 22 Conclusions Conclusions (Multiple associations leaves) Associate a symbiont to a unique host in case of multiple associations: Not big impact for the reconciliation cost. The choice of leaf associations may have a strong impact on the variability of the reconciliation output. Open problems: Simulating the coevolution of symbiont and host allowing multiple associations. L. Urbini June 21-22, 2016 - AlCoB 18 / 22 Conclusions Conclusions (Rooting a phylogenetic tree) Rerooting: The number of plateaux depends on the presence of host switches. The original root may not be inside the plateau. In general the variance of dissimilarity of reconciliations increases with the increase of the distance k. Open problems: Is there a relation between the number of plateaux and the level of dissimilarity of pattern? Is there a relation between the number of plateaux and the number of host switches in the optimal solutions? L. Urbini June 21-22, 2016 - AlCoB 19 / 22 Thank you L. Urbini June 21-22, 2016 - AlCoB 20 / 22 References I [DBS+ 14] Beatrice Donati, Christian Baudet, Blerina Sinaimeri, Pierluigi Crescenzi, and Marie-France Sagot. EUCALYPT: efficient tree reconciliation enumerator. Algo. Mol. Biol., 10(1):3, 2014. [GET13] Pawel Górecki, Oliver Eulenstein, and Jerzy Tiuryn. Unrooted tree reconciliation: A unified approach. IEEE/ACM Trans. Comput. Biology Bioinf., 10(2):522–536, 2013. L. Urbini June 21-22, 2016 - AlCoB 21 / 22 Time feasibility of the solutions If no time information then finding a optimal time feasible solution is NP-hard. Allow for time infeasible host switches → polynomial time. Check tome consistence of a solution → polynomial time. L. Urbini June 21-22, 2016 - AlCoB 22 / 22