Saccharomycetes Reveals Polyphyly of the Alternative Yeast Codon Usage
Transcription
Saccharomycetes Reveals Polyphyly of the Alternative Yeast Codon Usage
Genome Biology and Evolution Advance Access published July 22, 2014 doi:10.1093/gbe/evu152 Molecular Phylogeny of Sequenced Saccharomycetes Reveals Polyphyly of the Alternative Yeast Codon Usage Stefanie Mühlhausen1, Martin Kollmar1§ 1 Group Systems Biology of Motor Proteins, Department of NMR-based Structural § Corresponding author, Tel.: +49-551-2012260, Fax: +49-551-2012202 Email addresses: SM: stmu@nmr.mpibpc.mpg.de MK: mako@nmr.mpibpc.mpg.de © The Author(s) 2014. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. -1- Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 Biology, Max-Planck-Institute for Biophysical Chemistry, Göttingen, Germany Abstract The universal genetic code defines the translation of nucleotide triplets, called codons, into amino acids. In many Saccharomycetes a unique alteration of this code affects the translation of the CUG codon, which is normally translated as leucine. Most of the species encoding CUG alternatively as serine belong to the Candida genus and were grouped into a so-called CTG clade. However, the “Candida genus” is not a monophyletic group and several Candida species are known to use the standard CUG ancestor of the Candida, or to several branches independently leading to a polyphyletic alternative yeast codon usage (AYCU). In order to resolve the monophyly or polyphyly of the AYCU, we performed a phylogenomics analysis of 26 motor and cytoskeletal proteins from 60 sequenced yeast species. By investigating the CUG codon positions with respect to sequence conservation at the respective alignment positions we were able to unambiguously assign the standard code or AYCU. Quantitative analysis of the highly conserved leucine and serine alignment positions showed, that 61.1% and 17% of the CUG codons coding for leucine and serine, respectively, are at highly conserved positions, while only 0.6% and 2.3% of the CUG codons, respectively, are at positions conserved in the respective other amino acid. Plotting the codon usage onto the phylogenetic tree revealed the polyphyly of the AYCU with Pachysolen tannophilus and the CTG clade branching independently within a time span of 30 to 100 Mya. Keywords Genetic code, codon reassignment, codon usage, evolution, Candida -2- Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 translation. The codon identity could have been changed in a single branch, the Introduction The standard genetic code has been altered in many organisms. In eukaryotes, natural alterations have been identified in mitochondria, in which the universal stop codon UGA is translated as tryptophane (Yokobori et al. 2001; Watanabe & Yokobori 2011), in ciliates, in which the function of the stop codons UAA and UGA was changed to code for glutamine (Tourancheau et al. 1995), and in some Candida yeasts codon CUG translation was changed to serine. In taxonomic terms, the definition of the genus Candida is not very specific. Candida species comprise a group of about 850 organisms (Robert et al. 2013), which can be distantly related as for example Candida glabrata, Ca. albicans, and Ca. caseinolytica. Although it has been claimed that almost all Candida species, with the exceptions of Ca. glabrata and Ca. krusei, belong to a single clade (Butler et al. 2009), which is commonly referred to as “Candida clade”, many analyses showed that Candida species are spread over the entire Saccharomycetes clade (Diezmann et al. 2004; Kurtzman & Suzuki 2010; Kurtzman et al. 2011; Kurtzman 2011). In addition, different names have been given to the same yeast species in the past depending on the reproductive stage, in which they had been identified. For example, the names Meyerozyma guilliermondii (teleomorph), Ca. guilliermondii (anamorph), Yamadazyma guilliermondii and Pichia guilliermondii are used for the same holomorph (whole fungus, including anamorph and teleomorph). -3- Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 (Ohama et al. 1993; Sugita & Nakase 1999; Santos et al. 2011), in which the leucine Because Ca. albicans is the most prominent representative of the genus Candida and was the first shown to use the alternative CUG encoding, the term “Candida clade” is often used synonymous to “CTG clade” and most studies about CUG encoding concentrate on Ca. albicans and closely related Candida species. Therefore, depending on which node is used for defining the “Candida clade”, several species of the branch containing Ca. albicans might decode CUG as leucine and many species outside the “Candida clade” are also named Candida. Similarly, there are many nonCandida species decoding CUG as serine. In order to resolve this ambiguity, one of single branch or to several branches independently. The latter scenario would be in agreement with the proposed timing of the code alteration (~ 270 Ma) and the split of the Saccharomyces and Candida clades about 170 Ma (Massey et al. 2003; Santos et al. 2011) implying that the CUG codon was highly ambiguous for approximately 100 Myr. Species that branched within this time span could have adopted either of the two decoding possibilities. Most phylogenetic analyses of yeast species used 28s rRNA, 18s rRNA, actin and translation elongation factor-1α for reconstructing single or combined trees, which is a very successful approach when analyzing hundreds of species (Kurtzman 2011). However, higher accuracy for tree topologies is obtained in phylogenomic studies that use dozens to thousands of concatenated homologs in tree computations (Fitzpatrick et al. 2006). For the latter approach it is important to distinguish orthologs from paralogs. Only a few yeast species have been included in phylogenomic studies so far. In addition to assigning the CUG codon usage by phylogenetic comparisons it has been suggested to base the translation of codon CUG on the presence of Q9 as the -4- Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 the major questions is whether the change of identity of the CUG codon happened to a major ubiquinone (Sugita & Nakase 1999). Ubiquinone (Coenzyme Q, CoQ) is an electron carrier in the respiratory chain and involved in oxidative stress resistance (Szkopińska 2000; Turunen et al. 2004). Most enzymes of the ubiquinone biosynthesis pathway are conserved between Saccharomyces cerevisiae and Ca. albicans. However, the Ca. albicans ubiquinone has a nonaprenyl side chain (Q9) whereas the budding yeast has six isoprene units (Q6). Although a great variety of CoQ types are present in the genus Candida (Schauer & Hanschke 1999), the species analyzed at that time, which were known to decode CUG as serine, lacked galactose Based on these ideas, an extensive study revealed 89 yeast strains to possess coenzyme Q9 (except for Cl. lusitaniae) and to miss galactose in the cell (Sugita & Nakase 1999). However, in vitro translation assays showed 11 of these species to use codon CUG as leucine. The incongruence in many instances of the presence of Q9, absence of galactose, and molecular phylogenetic data has also been reported by others (Kurtzman & Robnett 1997; Diezmann et al. 2004; Kurtzman & Suzuki 2010). Sixty yeast species (not counting strains and data under embargo) have already been sequenced and assembled (http://www.diark.org, visited May 20, 2014). Only half of these genomes, which include both yeasts of the “CTG clade” and of the Saccharomyces clade, have been analyzed in detail yet. In addition, sequencing efforts increasingly focus on genomes of rarely studied species, whose phylogenetic relationships are not clear. For most of these genomes, gene predictions were even done and made available using both the Standard Code and the Alternative Yeast Nuclear Code (AYNC) resulting in mixed datasets. In order to determine whether there was a time span with a CUG codon usage ambiguity or whether the codon usage -5- Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 in the cells and possessed Q9, with the exception of Clavispora lusitaniae (Q8). alteration could be attributed to a single event, we performed a phylogenomic analysis of highly validated protein sequences of 26 members of the actin, actin-related, tubulin, myosin, kinesin, dynein, and actin-capping protein families of the sequenced yeast species. CUG codon translation has been assigned by analyzing the respective amino acids of the sequences in the context of the sequence alignments. Materials and Methods Some fungal actin-related proteins and myosins have been extracted from previously published datasets (Odronitz & Kollmar 2007; Hammesfahr & Kollmar 2012). The sequences were updated based on newer genome assemblies if necessary and corrected for CUG usage. The data for the other proteins and the species not included in the published datasets have essentially been obtained as described (Odronitz & Kollmar 2007). Shortly, the corresponding gene regions have been identified in TBLASTN searches starting with the respective protein sequences of homologs of Saccharomyces cerevisiae. The respective genomic regions were submitted to AUGUSTUS (Stanke & Morgenstern 2005) to obtain gene predictions. However, feature sets are only available for a few species of the Saccharomycetes clade. Therefore, all hits were subsequently manually analyzed at the genomic DNA level. When necessary, gene predictions were corrected by comparison with the homologs already included in the multiple sequence alignments. Especially, the short exons at the 5’ ends of the tubulin and actin genes were not always identified in automatic gene predictions. EST data to help in the annotation are only available for a few species in public databases. -6- Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 Identification and Annotation of the Proteins In the last years, genome-sequencing efforts have been extended from sequencing species from new branches to sequencing closely related organisms. Here, these species include for example Saccharomyces and Eremothecium species. Protein sequences from these closely related species have been obtained by using the crossspecies functionality of WebScipio (Odronitz et al. 2008; Hatje et al. 2011). Nevertheless, for all these genomes TBLASTN searches have been performed. With this strategy, we sought to ensure that we would not miss more divergent protein family homologs, which might have been derived by species-specific inventions or therefore these cross-species gene reconstructions only worked for a few species. All sequence-related data (protein names, corresponding species, sequences, and gene structure reconstructions) and references to genome sequencing centers are available at CyMoBase (http://www.cymobase.org, (Odronitz & Kollmar 2006)). A list of the analyzed species, their abbreviations as used in the alignments and trees, as well as references and accession numbers is also available as supplementary file S1, Supplementary Material online. WebScipio (Hatje et al. 2011; Odronitz et al. 2008) was used to reconstruct the gene structure of each sequence. Throughout this study we use the teleomorph names of all species. Generating the Multiple Sequence Alignment The protein sequences were added to the already existing multiple sequence alignments (Odronitz & Kollmar 2007; Hammesfahr & Kollmar 2012). In detail, we first aligned every newly predicted sequence to its supposed closest relative using ClustalW (Thompson et al. 2002) and added it then to the multiple sequence alignment. During the subsequent sequence validation process, we manually adjusted -7- Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 duplications. However, the Saccharomycetes belong to the fast-evolving species and the obtained alignment by removing wrongly predicted sequence regions and filling gaps. Still, in those sequences derived from low-coverage genomes many gaps remained. To maintain the integrity of exons preceded or followed by gaps, gaps reflecting missing parts of the genomes were added to the multiple sequence alignment. Sequence homologs of the same protein family (e.g., α-tubulin, β-tubulin, and γ-tubulin) were kept in one single protein family alignment. The actin/actinrelated protein alignment therefore contains 11 sub-families, the CapZ alignment 2, the tubulin alignment 3, the myosin alignment 3, and the kinesin alignment 6. Only sequence alignments can be obtained from CyMoBase. Computing Sequence Conservation The residue conservation at alignment positions was calculated with the conservation code toolbox as implemented by (Capra & Singh 2007). Conservation was estimated with the property-entropy method, an entropy measurement refined with respect to chemical properties of amino acids. Scores were calculated with conservation of adjacent amino acids incorporated (window size 3) and not (window size 0). Except for window size and scoring method, standard parameters were used. Unexpectedly, although window sizes were applied, the rest of the alignment still seemed to have an influence, albeit marginal, on the conservation scores. In addition, amino acids given as “X” are replaced by hyphens “-“ by the software, which denote gap positions in the alignment. Gap positions, however, indicate a biological relevant deletion of this position in the sequence and thus have a strong influence on the conservation score. Therefore, conservation estimates were performed on small sections of the concatenated alignment including Schizosaccharomyectes sequences. For each serine -8- Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 single dynein heavy chain homologs have been found in the analyzed species. The and leucine position, the alignment was reduced to this position and 15 adjacent positions in each direction. Sequences with CUG codons at the respective leucine or serine position were removed from the alignment sections. Computing and Visualising Phylogenetic Trees For calculating phylogenetic trees, the alignments of the CapZ, actin and actin-related, tubulin, kinesin, and the myosin protein families were split into sub-family alignments and the respective alignments were concatenated (supplementary file S2, Schizosaccharomycetes (Sc. pombe, Sc. octosporus, Sc. japonicus, and Sc. cryophilus) were taken. Gblocks v.0.91b (Talavera & Castresana 2007) was run with standard parameters, and half gap positions allowed to reduce the concatenated alignment from 28,202 amino acids to 9,788 amino acids in 230 blocks. The phylogenetic trees were generated using four different methods: Neighbor-Joining (NJ), maximum-likelihood (ML), Bayesian inference and split networks. 1) ClustalW v.2.0.10 (Thompson et al. 2002) was used to calculate unrooted trees with the NJ method. For each dataset, bootstrapping with 1,000 replicates was performed. 2) ML analysis with estimated proportion of invariable sites and bootstrapping (1.000 replicates) was performed using RAxML v.7.3.1 (Stamatakis et al. 2008). To this end, ProtTest v.3.2 was used first to determine the most appropriate of the available 112 amino acid substitution models (Darriba et al. 2011). Within ProtTest, the tree topology was calculated with the BioNJ algorithm and both the branch lengths and the model of protein evolution were optimized simultaneously. The Akaike Information Criterion with a modification to control for small sample size (AICc, with alignment length representing sample size) identified the WAG model with gamma model of rate -9- Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 Supplementary Material online). As outgroup, sequences from heterogeneity and site-specific evolutionary rates to be the best. 3) The ML analysis was repeated on an additional dataset generated from the concatenated alignment. In this dataset, amino acids encoded by CUG codons were substituted by “X”. Gblocks was invoked as described above, with standard parameters and half gap positions allowed, to reduce the alignment to 8,903 positions in 297 blocks. The phylogenetic tree was generated with RAxML under the WAG+Γ+F model and 1,000 replicates were performed. 4) Posterior probabilities were generated using MrBayes v3.2.1. (Ronquist & Huelsenbeck 2003). Using the mixed amino-acid option, two performed. MrBayes used the WAG model for all protein alignments. The trees were calculated based on a reduced dataset generated by Gblocks using standard parameters allowing no gap positions. This dataset includes 5,038 positions in 131 blocks. Trees were sampled every 1,000th generation and the first 25% of the trees were discarded as “burn-in” before generating a consensus tree. 5) An unrooted phylogenetic split network was generated with SplitsTree v. 4.1.3.1 (Huson & Bryant 2006). The Neighbor-Net method as implemented in SplitsTree was used to identify alternative splits. Phylogenetic trees and networks were visualized with FigTree v.1.3.1 (Rambaut and Drummond) and SplitsTree v.4.13.1, respectively, and are available as supplementary fig. S1, Supplementary Material online. Estimating the divergence times The divergence times of species using AYCU were estimated with a penalizedlikelihood approach as implemented in treePL (Smith & O’Meara 2012). The underlying tree topology was generated with RAxML under the WAG+Γ+F model on the alignment with CUG codons assigned. The splits between Sa. cerevisiae and Ca. - 10 - Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 independent runs with 100,000 generations, 4 chains, and a random starting tree were albicans, and Sa. cerevisiae and Sc. pombe (Heckman et al. 2001; Douzery et al. 2004) were constrained both individually and combined. Results Phylogeny of Sequenced Yeast Species In order to obtain a reliable phylogeny of the sequenced yeast species we choose a small-scale phylogenomics approach. The more data (i.e. orthologs) included in the important parameter for the reliability of the computed trees is the quality of the underlying sequence data. Here, we manually determined 26 motor and cytoskeletal proteins for each of the 60 sequenced yeast species. These 26 proteins belong to six major protein families, the actin/actin-related proteins, the tubulins, the myosins, the kinesins, the dynein heavy chain proteins, and the actin capping proteins of the CapZ complex. We included all proteins of these six families in the analysis except some kinesins that are unique to certain species. Many of the actin and tubulin genes contain very short exons of one to three residues that are not included in datasets of automatic gene predictions. By manually inspecting the genomic DNA sequences we could completely reconstruct all genes as long as the genomic sequences did not contain gaps. The concatenated sequences consist of on average 19,300 residues per species amounting to 28,202 alignment positions. Gblocks (Talavera & Castresana 2007) was used with less stringent and more stringent parameters to reduce the alignment to 9,788 and 5,038 positions, respectively. NJ and ML methods were used to construct trees from the extended alignment, and the Bayesian approach to compute a tree from the shorter alignment (supplementary fig. S1, Supplementary Material - 11 - Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 alignment the better and more robust the phylogenetic trees should become. Another online). To evaluate the influence of misassigned CUG codons on the tree topology all CUG codons were translated with “X” and the tree reconstruction repeated. The topology of this unbiased tree is identical to that based on the alignment with assigned CUG codons. As outgroup we included four Schizosaccharomyces species. The constructed trees have almost the same topology with the major exceptions being the placing of Kluyveromyces in the NJ tree, in which they group to Lachancea and Eremothecium. To highlight alternative phylogenetic relationships between the species we also constructed a phylogenetic network using the Neighbour-Net method trees show that Yarrowia lipolytica is the earliest diverging Saccharomycetes. The Saccharomycetaceae form a highly supported group together with the early separating Pfaffomycetaceae (supplementary fig. S1, Supplementary Material online). In our analysis, Pichia kudriavzevii and Ogataea parapolymorpha branch together with Dekkera bruxellensis, Kuraishia capsulata, and Komagataella pastoris. This branch is more closely related to the Ca. albicans species group than to the Saccharomycetaceae (supplementary fig. S1, Supplementary Material online), but the species encode CUG as leucine and not as serine. The species, for which the AYCU has been determined previously (e.g., Ca. albicans, Ca. dubliniensis and Ca. parapsilosis), are part of a highly supported group in all phylogenetic trees (fig. 1 and supplementary fig. S1, Supplementary Material online). Therefore, we refer to this group from here on as CTG clade. The CTG clade consists of several “Candida” species including Ca. albicans and Ca. tenuis, and also many others such as species from Metschnikowiaceae and Debaryomycetaceae. In all trees, Pachysolen groups outside the CTG clade as sister to the Pichiaceae/Ogataea (supplementary fig. S1, Supplementary Material online). In the phylogenetic network, Pachysolen is placed at - 12 - Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 (fig. 1). In agreement with other studies (Kurtzman & Robnett 2013b, 2013a), the the origin of the CTG clade but there are many connections supporting alternative topologies. Interestingly, species of the Candida genus are spread all over the tree grouping to the Saccharomycetes and to various subgroups of the CTG clade (fig. 1). Conservation of Amino Acids at CUG Codon Positions in Class I Myosins In order to determine, whether CUG codons are coding for conserved amino acids (leucine or serine) we performed a thorough analysis of the alignments of all protein subfamilies. As representative example a part of the alignment of the class I myosins supplementary fig. S2, Supplementary Material online). This analysis provided several surprising findings: 1) The class I myosin alignment contains 33 positions with 100% leucine conservation but only 11 positions that show 100% serine conservation. 2) Except for Wickerhamomyces anomalus and Candida tenuis (no CUG codons in their class I myosin genes), and Cl. lusitaniae, Ca. tropicalis, and Pachysolen (two to three CUG codons in their class I myosin genes but not at conserved positions), all Saccharomycetaceae, Pfaffomycetaceae, Pichiaceae, and Ogataea species, and D. bruxellensis, K. capsulata, K. pastoris, and Y. lipolytica, have at least one CUG codon at one of the 100% conserved leucine positions. 3) Except for the sequences of the Debaryomyces and Spathaspora species investigated here, whose CUG positions are in almost all cases identical, none of the other closely related species (e.g., Saccharomyces or Pichiaceae) have conserved CUG positions (fig. 2). This is especially apparent for the six analyzed Saccharomyces species that have duplicated class I myosins but do not have even a single conserved CUG position between the two paralogs. In addition, CUG positions are not conserved within any of the orthologs. There is one position in the alignment (position 322), at - 13 - Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 is presented in figure 2 (the full class I myosin alignment is available in which ten of the species of the CTG clade have a CUG-encoded serine whereas serines, alanines and glycines are found at that position in the other species (fig. 2). Conservation of Amino Acids at CUG Codon Positions in General The analysis of the complete data with respect to amino acid similarity at alignment positions, at which at least one leucine/serine is present, showed that positions with leucines are stronger conserved than positions with serines although in total serines have been observed at as many alignment positions as leucines (fig. 3A and conservation of each position alone (window size = 0) and within its closest environment (window size = 3). The analysis of CUG codons present at leucine positions showed that CUG codons are not evenly distributed to all positions with leucines but enriched at highly conserved alignment positions (fig. 3A). In contrast, there are far less CUG codons at serine positions, and these are in general present at less conserved positions compared with the leucine encoding CUG codons. However, although small there are a considerable number of CUG codons at highly conserved and thus discriminative serine positions (fig. 3A). In order to determine the amino acid conservation, it is extremely important to resolve protein sub-family relationships. For example, the kinesin-1 sub-family members have a highly conserved serine at the same position where the kinesin-5 proteins have a highly conserved leucine (fig. 3B). Assignment of the Codon Usage Scheme by Amino Acid Conservation Patterns In order to determine the CUG codon translation we investigated whether the CUG codons within the analyzed sequences of each species are at highly conserved leucine - 14 - Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 supplementary fig. S3, Supplementary Material online). We determined the or serine positions in the alignment. Therefore, we first determined the percentage of CUG codons at extremely conserved leucine and serine positions (conservation score of ≥90%; fig. 4). The analysis resulted in three groups of species: The first group, including Sa. cerevisiae, had 2.8-26.4% (on average 14.3%) of the CUG codons at highly conserved leucine positions. The second group, including Ca. albicans, had CUG codons at conserved serine positions although the total numbers were considerably lower than those for the leucine positions (0.9 to 7.5%; on average 3.4%), Pachysolen tannophilus did not have any CUG codons at either conserved a conservation score of more than or equal to 80% (fig. 4). In the same group of species as at the conservation score of more than or equal to 90%, 8.4-44.4% (on average 24.5%) of the CUG codons are at leucine positions. At that conservation level, one of the CUG codons of Yarrowia is found at a conserved serine position. At the same conservation score (≥80%), all other species have CUG codons at highly conserved serine positions (2.2-11.8%; on average 6.1%). In five species, one to two of the CUG codons are at conserved leucine positions (Candida metapsilosis, Ca. tenuis, Cl. lusitaniae, Spathaspora arborariae and Millerozyma). Thus, the species clearly separate into two groups, one with many CUG codons almost exclusively at highly conserved leucine positions, and the other with CUG codons at serine positions (fig. 4). At a modest sequence conservation level (conservation score ≥50%), 5591 (61.1%) of the CUG codons of the first group are at leucine positions (the maximum is 70.2% of the CUG codons in Saccharomyces kudriavzevii). At this conservation level 54 CUG codons (0.6%) are at modestly conserved serine positions (conservation score ≥50%). At the same conservation level, 17% of the CUG codons of the second group (in total 332 codons) are at serine positions, whereas 44 CUG codons (2.4 %) - 15 - Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 position. A similar situation was found when analyzing the CUG codon distribution at are at leucine positions with the same conservation level. The same pattern can be observed when using scores calculated column wise instead of window wise (window size 0; supplementary fig. S4, Supplementary Material online). CUG Codon Positions Shared between Species It was proposed that all CUG codons had been removed during the time of codon ambiguity and later reintroduced within the Saccharomycetaceae and Candida branches (Massey et al. 2003). If this model was true, species of the positions within but not across these two branches. In order to evaluate CUG codon position conservation we determined the number of CUG codons shared between every two species (fig. 5). This CUG codon position analysis is independent from CUG codon assignment. For comparison, we analyzed the conservation of the CUG codon positions at all alignment positions (fig. 5, lower triangle) and at positions with a modest conservation level of more than or equal to 50% (fig 5, upper triangle). Independent of whether sequence conservation at alignment positions is required the yeast species group into two distinct classes. Species of the first group containing the Saccharomycetaceae, Pfaffomycetaceae, Pichiaceae, and Ogataea species, and Dekkera bruxellensis, Komagataella pastoris, Kuraishia capsulate, and Yarrowia lipolytica share considerably more CUG positions than those from the second group formed by the CTG-clade species. When all CUG positions are considered, there are a few positions shared between the two groups. However, if only CUG positions at a modest conservation level of more than or equal to 50% are evaluated, the two groups do not share any CUG positions. The few CUG positions shared at all alignment positions between the two species groups reflect CUG positions in disordered and - 16 - Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 Saccharomycetaceae and Candida would be expected to have shared CUG codon non-conserved protein regions (see, e.g., the alignment of the class-1 myosin tail regions in supplementary fig. S2, Supplementary Material online). Interestingly, Pachysolen, does not share any CUG positions with any other species (fig. 5). Mapping the Assigned Codon Usage onto the Species Tree In order to determine the mono- or polyphyly of the AYCU assignment, we mapped the assigned codon usage onto the reconstructed species tree (fig. 6). This mapping supports the polyphyly of the AYCU assignment because the CTG clade and Saccharomycetales. The CUG codon position conservation is also in accordance with the species tree and the codon usage assignment. The CTG clade species group together in the species tree and share CUG codon positions implying that at least some of them have been introduced in the last common ancestor of the CTG clade. Pachysolen diverged separate from the CTG clade and therefore does not share any CUG codon positions with species from the CTG clade. Timeline for the Separation of the CTG Clade In order to estimate the time span, in which the CUG codon usage might have been ambiguous, we applied published branching time estimations to our species trees. However, the divergence time estimations for the splits between Archiascomycetes and other Ascomycota and within the Saccharomycetes differ by more than 700 Myr although extensive molecular and fossil record data have been used in respective studies ((Heckman et al. 2001; Hedges et al. 2004; Douzery et al. 2004; Berbee & Taylor 2007), for more references see (Hedges et al. 2006)). Therefore, we choose some of the most extreme estimates for the calibration (table 1). The divergence time - 17 - Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 Pachysolen branched at different points during the evolution of the of the CTG clade was calibrated on the basis of the topology and branch lengths of the ML tree under penalized-likelihood constraints as implemented in treePL (Smith & O’Meara 2012). The splits of Sc. pombe and Sa. cerevisiae, and of Sa. cerevisiae and Ca. albicans were set to 1,144 and 841 Ma (Heckman et al. 2001), and 587 and 235 Ma (Douzery et al. 2004), respectively. Under these two sets of constraints, the separation of the CTG clade was estimated to 440 and 135 Ma, respectively (table 1), implying that the reassignment of the CUG codon in the CTG clade happened at most 400 and 100 Myr, respectively, after the split of the Saccharomycetaceae and the obtained for the emergence of Pachysolen indicating that ancient yeast species emerging in the time of 220 to 190 Ma (740 to 640 Ma, respectively) could have independently reassigned the CUG codon translation. This time span is the minimum time of CUG codon ambiguity. The maximum time span depends on the last split before the first appearance of the AYCU (split of the Saccharomycetaceae and the branch containing the CTG clade, Pichiaceae, and Ogataea; fig. 6) and both the first split after the last assignment of the AYCU (split of the branch containing Pichiaceae and Ogataea from Pachysolen) and the divergence of the CTG clade (135 and 440 Ma, respectively). The maximum time span of codon ambiguity is thus estimated to be 100 Myr (400 Myr, respectively). Discussion The AYCU has already been assigned to ten of the sequenced yeast genomes, Ca. albicans (Jones et al. 2004; Butler et al. 2009), Ca. dubliniensis (Jackson et al. 2009), Ca. parapsilosis (Butler et al. 2009), Ca. tropicalis (Butler et al. 2009), Cl. lusitaniae - 18 - Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 CTG clade (fig. 6). Similarly, a date of at least 190 Ma (640 Ma, respectively) was (Butler et al. 2009), Debaryomyces hansenii (Dujon et al. 2004; Butler et al. 2009), Lodderomyces elongisporus (Butler et al. 2009), Mey. guilliermondii (Butler et al. 2009), Scheffersomyces stipites (Jeffries et al. 2007), and Spathaspora passalidarum (Wohlbach et al. 2011) and the standard codon usage was shown for several of the Saccharomyces species. Mass spectrometry and Edman sequencing of SMKT demonstrated that Millerozyma farinosa also uses the AYCU (Suzuki et al. 2002). An in vitro translation study showed 78 Candida species to use the AYCU including the above listed Candida strains (Sugita & Nakase 1999), and in addition Ca. recently (Wohlbach et al. 2011). Nevertheless, the gene prediction datasets for Ca. tanzawaensis and Ca. tenuis available at the respective sequencing centers include predictions both with AYCU and with standard codon usage. However, for highly resolved phylogenetic studies, protein ortholog assignments and functional studies, naming only a few applications, the accuracy of the protein sequences is important. The best and most accurate way to determine a species codon usage would be the experimental analysis of some proteins known to contain CUG codons by, for example, mass spectrometry. However, as long as proteins cannot be enriched this is extremely difficult to conduct, and the most prevalent endogenous proteins might not contain CUG codons. For example, our analysis showed that actin genes from only a few yeasts (11 out of 60 analyzed) contain CUG codons. Other approaches only provide indirect evidence. As discussed, the presence or absence of cellular components such as galactose or Q9 does not correlate with CUG codon usage. Here, we hypothesized, that the codon usage can unambiguously be assigned by analyzing three types of data: A highly resolved phylogenetic tree, sequence conservation at CUG coding positions, and conservation of CUG codon positions across species. To - 19 - Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 tanzawaensis and Ca. tenuis, for which the genome sequence had been assembled obtain significant numbers of sequences we were restricted to analyse sequenced genomes. Up to May 2014, genomes of 60 yeast species have been sequenced, assembled and released. We did not include genomes in the analysis, which were sequenced with low-coverage, and only included one strain per species (e.g., 73 different Sa. cerevisiae strains have been sequenced, of which only the reference strain was included in our analysis). To avoid bias by misassigning CUG codons we excluded all CUG positions from the alignment and the sequence conservation calculations. proteins, which are key-components of eukaryotic cells. We decided to only use proteins in the analysis that are at least present in all unikonts (the eukaryotic superkingdom including Fungi, Metazoa, Apusozoa and Amoebozoa) to ensure that we do not have extensive missing data. The topology of the species tree was almost identical independent of the reconstruction method used (NJ, ML and Bayesian; fig. 6). Although this particular set of species has never been analyzed before, the branching of most sub-branches and individual species is comparable to those found in previously published yeast trees (see references in Introduction section). The species known to use either the standard code (Saccharomycetaceae) or the AYCU (see above) grouped into two distinct clades. It seemed obvious to assume, but had to be shown, that all other species within these two groups, for which the codon usage has not yet been confirmed, use the same codon usage as the known species. In terms of a mono- or polyphyletic origin of the AYCU it would be necessary to resolve the codon usage of the species of the other major branches (Phaffomycetaceae clade and the - 20 - Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 For the reconstruction of the yeast tree we concatenated 26 cytoskeletal and motor branch containing the Pichiaceae and Ogataea) and the species that diverged before the combined Saccharomycetaceae/CTG clades such as Y. lipolytica. To determine the most probable CUG codon usage for each species we analyzed the amino acid conservation at the respective CUG codon positions in the protein sequence alignments. Amino acid positions highly conserved over hundreds of million years (divergence time of Schizosaccharomyces and Saccharomycetes) are expected to represent structurally and functionally important sites in proteins. A small residues in the other species, seems very unlikely. Similarly, large hydrophobic residues at positions conserved in small and polar amino acids should be strongly disfavoured. Our data show that the assignment of the standard codon usage is facilitated by the enrichment of CUGs at highly conserved leucine positions. In this respect only determining the class I myosin of a species with unknown CUG codon usage and comparing this sequence to the available data (fig. 2 and supplementary fig. S2, Supplementary Material online) would in most cases be enough to find out whether the respective species uses the standard or the alternative yeast CUG codon translation. If CUG codons can be mapped to conserved leucine positions, it will be highly probable that the respective species translates CUG as leucine. The AYCU will be assigned to a species if CUG codons are mapped to highly conserved serine positions or if CUG codons are not present at highly conserved leucine positions (proof by contradiction). In addition to the class I myosins, several proteins could be identified and compared until a conserved serine position was found. The latter approach would lead to an unambiguous assignment and was used in our analysis here. It has been proposed that CUG codons were not reintroduced randomly in CTG - 21 - Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 polar residue at a position, which is characterized by conserved large hydrophobic clade genomes but by avoiding structurally sensitive sites (Rocha et al. 2011). This is in agreement with the different biochemical properties of leucine and serine and the associated consequences for protein folding and solubility. Leucine positions are in general stronger conserved than serine positions, leucines are usually located in the core of protein domains, and mutations at leucine positions are therefore restricted. Compared with serine, the introduction of leucine encoding CUG codons is possible by synonymous mutations from CUN codons or by transition from thymine from the UUG codon. Transitions are more favourable and generated at higher frequencies than conservation of the CUG codons within the yeast species using the Standard Code. In contrast, mutating an existing serine codon to a CUG codon would require at least two mutations. Mutations from other codons than leucine or serine codons would either require the less likely transversions (e.g., GUG to CUG) or very unusual amino acid mutations (e.g., from proline to serine). Therefore, it is not surprising to find less CUG codons in CTG clade species, and those codons at less conserved positions. Overall, we could demonstrate, that especially the CUG codons coding for leucine, and also a substantial number of the CUG codons in species using the AYCU, are at conserved alignment positions (fig. 4). Although seven of the CUG codons coding for leucine and serine, respectively, are at positions supporting the respective opposite codon translation based on an amino acid conservation level of more than or equal to 80%, the majority of the CUG codon positions supports either the standard or the AYCU. This is in accordance with protein structure and in vivo studies showing that single point mutations and even partial reversion of the CUG identity are tolerated in Ca. albicans although large numbers of mutations cause profound cellular changes such as morphological variation and altered gene expression (Cutfield et al. 2000; - 22 - Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 transversions. Such silent substitutions seem to be very common given the low Gomes et al. 2007; Miranda et al. 2007; Rocha et al. 2011; Miranda et al. 2013; Bezerra et al. 2013). Because we excluded all CUG codon positions from the analysis, our assignments are independent from previous knowledge. Our data unambiguously supports the assignment of the standard codon usage to the Saccharomycetaceae, and the AYCU to the eight species, which were known to translate CUG as serine. From these data we conclude, that the CUG codon assignment can unambiguously be resolved by analyzing a considerable amount of sequences and that our CUG codon translation assignment is robust. Therefore, we assigned the AYCU to the CTG clade assignment of the CUG codon usage, we acknowledge that the final judgement should come from experiments. When plotting the assigned AYCU onto the species in the phylogenetic trees it became evident that the CUG codon usage must have been ambiguous for a long time in the ancestry of the Saccharomycetes leading to polyphyly of the AYCU (fig. 6). This means that ancestral species that emerged during this time span could have adopted either the standard or the AYC usage. In the case of a long time span of codon ambiguity, the original CUG codons in the ancestor of the Saccharomycetes were most likely erased and later reintroduced in the emerging branches either as leucine codons or as serine codons. In the less likely and mutually exclusive scenario, the CUG ambiguity appeared only once in the ancestor of the CTG clade. The combination of the reconstructed species trees with our CUG codon assignment strongly supports the first scenario of a polyphyletic origin of the AYCU. - 23 - Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 species and Pachysolen. Although our data strongly support an unambiguous Based on the phylogeny of the analyzed species Pachysolen does not group to the CTG clade. In addition, Pachysolen does not have any CUG codons at highly conserved leucine positions, but at conserved serine positions. The phylogenetic grouping of Pachysolen has not been analyzed in detail so far, and not at all in the context of a taxonomic sampling as used in our study. The evolutionary grouping and CUG codon usage were also not analyzed in the study of the Pachysolen draft genome sequence (Liu et al. 2012). A previous study based on 28s rDNA and 18s rDNA only resolved the Saccharomycetes but not the grouping of Pachysolen and none of the addition, support for the branchings was low (Suh et al. 2006). In contrast, we used a phylogenomics approach for the tree reconstruction resulting in strongly supported branching of Pachysolen. The separate branching of Pachysolen is in agreement with the previously proposed model of a long time span of codon usage ambiguity (Sugita & Nakase 1999). If the CUG codon usage was ambiguous for a long time, it would be natural to assume that species diverged within this time span, and that those species could have adapted either of the usages. Supposed, that the codon usage was ambiguous for a long time, it seems highly non-parsimonious, that either species did not branch within this time span or that all species except the ancestor of the CTG clade adapted the standard codon usage. A monophyletic group of the species using the AYCU is also highly unlikely given the missing conservation of CUG codon positions between the CTG clade species and Pachysolen. To date the codon reassignments we dated the ML tree based on various separation time estimates for the splits of Sc. pombe and Sa. cerevisiae, and Sa. cerevisiae and Ca. albicans (fig. 6 and table 1). Accordingly, there - 24 - Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 other clades shown to be monophyletic in our trees (including the CTG clade). In is a maximum time span of codon ambiguity of 100-400 Myr ranging from the first possible appearance of the serine tRNASerCAG (separation of the Saccharomycetes and the branch containing the CTG clade, the Pichiacea and Ogataea) to the split of the most basal species of the CTG clade (Ca. tenuis). Assuming an unambiguous usage of the CUG codon in Pachysolen and the CTG clade, the minimal time span of CUG codon ambiguity was 30 to 100 million years (fig. 6 and table 1). This is well in agreement with a previous study, which suggested an ambiguous usage of the CUG codon for 100 million years (Sugita & Nakase 1999). However, the presented time assume that further sequenced species will change the time estimates in two directions (this might already happen by analyzing the yeast species whose data is still under embargo). The inclusion of more basal CTG clade species will decrease the current codon ambiguity time. The identification of further species using the AYCU, which do not group to the CTG clade or to Pachysolen, instead might increase both the codon ambiguity and codon reassignment time ranges. Such species include yeasts, which might even branch before the split of the Saccharomycetaceae and the CTG clade/Pichiacea/Ogataea, as well as additional species diverging in the sister branch of the CTG clade, and species separating early in the Saccharomycetaceae branch. Acknowledgments We would like to thank Prof. Christian Griesinger for his continuous generous support, and Prof. Toni Gabaldón and Dr. Qinhong Wang for their permission to use the Ca.metapsilosis CANME and Ca.maltosa Xu316 genome assembly data prior to publication. This project has been funded by the Deutsche Forschungsgemeinschaft - 25 - Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 estimates are based on the species analyzed in this study. It is well reasonable to DFG Grant KO 2251/13-1 to M.K. and a Synaptic Systems fellowship to S.M., and was partly supported by the Göttingen Graduate School of Neurosciences and Molecular Biosciences (DFG Grants GSC 226/1 and GSC 226/2). Authors contribution SM and MK assembled and annotated genes, analyzed the data and wrote the manuscript. Berbee ML, Taylor JW. 2007. Rhynie chert: a window into a lost world of complex plant-fungus interactions. New Phytol. 174:475–479. doi: 10.1111/j.14698137.2007.02080.x. Bezerra AR et al. 2013. Reversion of a fungal genetic code alteration links proteome instability with genomic and phenotypic diversification. Proc. Natl. Acad. Sci. U.S.A. 110:11079–11084. doi: 10.1073/pnas.1302094110. Butler G et al. 2009. Evolution of pathogenicity and sexual reproduction in eight Candida genomes. Nature. 459:657–662. doi: 10.1038/nature08064. Capra JA, Singh M. 2007. Predicting functionally important residues from sequence conservation. Bioinformatics. 23:1875–1882. doi: 10.1093/bioinformatics/btm270. Crooks GE, Hon G, Chandonia JM, Brenner SE. 2004. WebLogo: a sequence logo generator. Genome Res. 14:1188 – 1190. Cutfield JF, Sullivan PA, Cutfield SM. 2000. Minor structural consequences of alternative CUG codon usage (Ser for Leu) in Candida albicans exoglucanase. Protein Eng. 13:735–738. Darriba D, Taboada GL, Doallo R, Posada D. 2011. ProtTest 3: fast selection of bestfit models of protein evolution. Bioinformatics. 27:1164–1165. doi: 10.1093/bioinformatics/btr088. Diezmann S, Cox CJ, Schönian G, Vilgalys RJ, Mitchell TG. 2004. Phylogeny and evolution of medical species of Candida and related taxa: a multigenic analysis. J. Clin. Microbiol. 42:5624–5635. doi: 10.1128/JCM.42.12.5624-5635.2004. - 26 - Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 References Douzery EJP, Snell EA, Bapteste E, Delsuc F, Philippe H. 2004. The timing of eukaryotic evolution: Does a relaxed molecular clock reconcile proteins and fossils? Proc Natl Acad Sci U S A. 101:15386–15391. doi: 10.1073/pnas.0403984101. Dujon B et al. 2004. Genome evolution in yeasts. Nature. 430:35–44. doi: 10.1038/nature02579. Fitzpatrick DA, Logue ME, Stajich JE, Butler G. 2006. A fungal phylogeny based on 42 complete genomes derived from supertree and combined gene analysis. BMC Evol. Biol. 6:99. doi: 10.1186/1471-2148-6-99. Gomes AC et al. 2007. A genetic code alteration generates a proteome of high diversity in the human pathogen Candida albicans. Genome Biol. 8:R206. doi: 10.1186/gb-2007-8-10-r206. Hatje K et al. 2011. Cross-species protein sequence and gene structure prediction with fine-tuned Webscipio 2.0 and Scipio. BMC Res. Notes. 4:265. doi: 10.1186/17560500-4-265. Heckman DS et al. 2001. Molecular Evidence for the Early Colonization of Land by Fungi and Plants. Science. 293:1129–1133. doi: 10.1126/science.1061457. Hedges SB, Blair JE, Venturi ML, Shoe JL. 2004. A molecular timescale of eukaryote evolution and the rise of complex multicellular life. BMC Evolutionary Biology. 4:2. doi: 10.1186/1471-2148-4-2. Hedges SB, Dudley J, Kumar S. 2006. TimeTree: a public knowledge-base of divergence times among organisms. Bioinformatics. 22:2971–2972. doi: 10.1093/bioinformatics/btl505. Huson DH, Bryant D. 2006. Application of phylogenetic networks in evolutionary studies. Mol. Biol. Evol. 23:254–267. doi: 10.1093/molbev/msj030. Jackson AP et al. 2009. Comparative genomics of the fungal pathogens Candida dubliniensis and Candida albicans. Genome Res. 19:2231–2244. doi: 10.1101/gr.097501.109. Jeffries TW et al. 2007. Genome sequence of the lignocellulose-bioconverting and xylose-fermenting yeast Pichia stipitis. Nat. Biotechnol. 25:319–326. doi: 10.1038/nbt1290. Jones T et al. 2004. The diploid genome sequence of Candida albicans. Proc. Natl. Acad. Sci. U.S.A. 101:7329–7334. doi: 10.1073/pnas.0401648101. Kurtzman C, Fell JW, Boekhout T. 2011. The Yeasts: A Taxonomic Study. 5th edition. London, Burlington, San Diego: Elsevier. - 27 - Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 Hammesfahr B, Kollmar M. 2012. Evolution of the eukaryotic dynactin complex, the activator of cytoplasmic dynein. BMC Evol. Biol. 12:95. doi: 10.1186/1471-2148-1295. Kurtzman CP. 2011. Phylogeny of the ascomycetous yeasts and the renaming of Pichia anomala to Wickerhamomyces anomalus. Antonie Van Leeuwenhoek. 99:13– 23. doi: 10.1007/s10482-010-9505-6. Kurtzman CP, Robnett CJ. 2013a. Alloascoidea hylecoeti gen. nov., comb. nov., Alloascoidea africana comb. nov., Ascoidea tarda sp. nov., and Nadsonia starkeyihenricii comb. nov., new members of the Saccharomycotina (Ascomycota). FEMS Yeast Res. 13:423–432. doi: 10.1111/1567-1364.12044. Kurtzman CP, Robnett CJ. 1997. Identification of clinically important ascomycetous yeasts based on nucleotide divergence in the 5’ end of the large-subunit (26S) ribosomal DNA gene. J. Clin. Microbiol. 35:1216–1223. Kurtzman CP, Robnett CJ. 2013b. Relationships among genera of the Saccharomycotina (Ascomycota) from multigene phylogenetic analysis of type species. FEMS Yeast Res. 13:23–33. doi: 10.1111/1567-1364.12006. Liu X, Kaas RS, Jensen PR, Workman M. 2012. Draft genome sequence of the yeast Pachysolen tannophilus CBS 4044/NRRL Y-2460. Eukaryotic Cell. 11:827. doi: 10.1128/EC.00114-12. Massey SE et al. 2003. Comparative evolutionary genomics unveils the molecular mechanism of reassignment of the CTG codon in Candida spp. Genome Res. 13:544– 557. doi: 10.1101/gr.811003. Miranda I et al. 2007. A genetic code alteration is a phenotype diversity generator in the human pathogen Candida albicans. PLoS ONE. 2:e996. doi: 10.1371/journal.pone.0000996. Miranda I et al. 2013. Candida albicans CUG mistranslation is a mechanism to create cell surface variation. MBio. 4: pii e00285-13. doi: 10.1128/mBio.00285-13. Odronitz F, Kollmar M. 2007. Drawing the tree of eukaryotic life based on the analysis of 2,269 manually annotated myosins from 328 species. Genome Biol. 8:R196. doi: 10.1186/gb-2007-8-9-r196. Odronitz F, Kollmar M. 2006. Pfarao: a web application for protein family analysis customized for cytoskeletal and motor proteins (CyMoBase). BMC Genomics. 7:300. doi: 10.1186/1471-2164-7-300. Odronitz F, Pillmann H, Keller O, Waack S, Kollmar M. 2008. WebScipio: an online tool for the determination of gene structures using protein sequences. BMC Genomics. 9:422. doi: 10.1186/1471-2164-9-422. Ohama T et al. 1993. Non-universal decoding of the leucine codon CUG in several Candida species. Nucleic Acids Res. 21:4039–4045. - 28 - Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 Kurtzman CP, Suzuki M. 2010. Phylogenetic analysis of ascomycete yeasts that form coenzyme Q-9 and the proposal of the new genera Babjeviella, Meyerozyma, Millerozyma, Priceomyces, and Scheffersomyces. Mycoscience. 51:2–14. doi: 10.1007/s10267-009-0011-5. Rambaut A, Drummond A. 2009. FigTree v1.3.1. http://tree.bio.ed.ac.uk/software/figtree/, visited May 9, 2014. Robert V et al. 2013. MycoBank gearing up for new horizons. IMA Fungus. 4:371– 379. doi: 10.5598/imafungus.2013.04.02.16. Rocha R, Pereira PJB, Santos MAS, Macedo-Ribeiro S. 2011. Unveiling the structural basis for translational ambiguity tolerance in a human fungal pathogen. PNAS. 108:14091–14096. doi: 10.1073/pnas.1102835108. Ronquist F, Huelsenbeck JP. 2003. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 19:1572–1574. Santos MAS, Gomes AC, Santos MC, Carreto LC, Moura GR. 2011. The genetic code of the fungal CTG clade. C. R. Biol. 334:607–611. doi: 10.1016/j.crvi.2011.05.008. Smith SA, O’Meara BC. 2012. treePL: divergence time estimation using penalized likelihood for large phylogenies. Bioinformatics. 28:2689–2690. doi: 10.1093/bioinformatics/bts492. Stamatakis A, Hoover P, Rougemont J. 2008. A rapid bootstrap algorithm for the RAxML Web servers. Syst. Biol. 57:758–771. doi: 10.1080/10635150802429642. Stanke M, Morgenstern B. 2005. AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Res. 33:W465–467. doi: 10.1093/nar/gki458. Sugita T, Nakase T. 1999. Non-universal usage of the leucine CUG codon and the molecular phylogeny of the genus Candida. Syst. Appl. Microbiol. 22:79–86. doi: 10.1016/S0723-2020(99)80030-7. Suh S-O, Blackwell M, Kurtzman CP, Lachance M-A. 2006. Phylogenetics of Saccharomycetales, the ascomycete yeasts. Mycologia. 98:1006–1017. Suzuki C, Kashiwagi T, Hirayama K. 2002. Alternative CUG codon usage (Ser for Leu) in Pichia farinosa and the effect of a mutated killer gene in Saccharomyces cerevisiae. Protein Eng. 15:251–255. Szkopińska A. 2000. Ubiquinone. Biosynthesis of quinone ring and its isoprenoid side chain. Intracellular localization. Acta Biochim. Pol. 47:469–480. Talavera G, Castresana J. 2007. Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst Biol. 56:564–577. doi: 10.1080/10635150701472164. Thompson JD, Gibson TJ, Higgins DG. 2002. Multiple sequence alignment using ClustalW and ClustalX. Curr Protoc Bioinformatics. Chapter 2:Unit 2.3. doi: 10.1002/0471250953.bi0203s00. - 29 - Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 Schauer F, Hanschke R. 1999. [Taxonomy and ecology of the genus Candida]. Mycoses. 42 Suppl 1:12–21. Tourancheau AB, Tsao N, Klobutcher LA, Pearlman RE, Adoutte A. 1995. Genetic code deviations in the ciliates: evidence for multiple and independent events. EMBO J. 14:3262–3267. Turunen M, Olsson J, Dallner G. 2004. Metabolism and function of coenzyme Q. Biochim. Biophys. Acta. 1660:171–199. Watanabe K, Yokobori S-I. 2011. tRNA Modification and Genetic Code Variations in Animal Mitochondria. J Nucleic Acids. 2011:623095. doi: 10.4061/2011/623095. Wohlbach DJ et al. 2011. Comparative genomics of xylose-fermenting fungi for enhanced biofuel production. Proc. Natl. Acad. Sci. U.S.A. 108:13212–13217. doi: 10.1073/pnas.1103039108. Figure Legends Figure 1 Phylogenetic relationship between Saccharomycetes The unrooted phylogenetic network was generated using the Neighbor-Net method as implemented in SplitsTree 4.1.3.1. The Schizosaccharomycetes species were included as outgroup. The network strongly supports the Saccharomycetaceae and the CTG clade (highlighted in orange). The grouping of Pachysolen tannophilus is not unambiguously resolved. Species of the genus "Candida" are highlighted in blue (teleomorph names) and green (if anamorphs of the species are called "Candida") showing the paraphyly (or mis-assingment) of this genus. Orange and purple dots mark species, for which alternative or standard codon usage has already been shown elsewhere. - 30 - Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 Yokobori S, Suzuki T, Watanabe K. 2001. Genetic code variations in mitochondria: tRNA as a major determinant of genetic code plasticity. J. Mol. Evol. 53:314–326. doi: 10.1007/s002390010221. Figure 2 Sequence alignment of the yeast class I myosins highlighting leucines and serines encoded by CUG The protein sequence alignment represents part of the class I myosin alignment (for the complete alignment see fig. S2, Supplementary Material online). Numbers on the left denote the residue numbers of the first amino acids of the sequences in this section of the alignment. All CUG positions occurring in the aligned class I myosin genes are highlighted. We assigned the most probable translation scheme to each species and translated the CUG codons accordingly. Blue and green boxes indicate Figure 3 Conservation of serine and leucine positions A) The charts show the amino acid conservation at all alignment positions of the Gblocks reduced concatenated alignment of the 26 cytoskeletal and motor proteins, at which at least one leucine (upper charts) or one serine (lower charts) is present. The sequence conservation has been determined based on the Property-Entropy divergence, as described in (Capra & Singh 2007). With a window size of 0, each column is scored independently (left row), while the surrounding 3 columns are also taken into account with a window size of 3 (right row). Blue bars represent the number of alignment positions with a conservation score for leucine and serine residues, respectively, within the given half-bounded intervals. Red bars denote the number of alignment positions with respect to conservation, at which at least one CUG codon is present independent of its translation. Green bars give the total numbers of CUG codons at the respective alignment positions. B) The weblogos (Crooks et al. 2004) show the sequence conservation of two kinesin sub-families, kinesin-1 and kinesin-5, within the family-defining motor domain around the highly - 31 - Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 CUG codons coding for leucine and serine, respectively. conserved switch II and α-helix α4 motifs. At the position within α4 marked by a grey bar kinesin-1 sequences contain a highly conserved serine while kinesin-5 sequences contain a highly conserved leucine indicating the need to resolve subfamily relationships when determining CUG codon usage by sequence conservation. Figure 4 CUG codon usage at conserved leucine and serine positions The graph presents the CUG codon usage with respect to alignment position alignment positions with conservation scores of ≥90%, ≥80%, and ≥50% and a window size of 3, respectively, in the concatenated alignment of the 26 motor and cytoskeletal proteins. Red and blue colours denote the percentages of CUG codons present at alignment positions enriched in serines and leucines, respectively. For comparison, we plotted the percentages of CUG codons at positions of the assigned codon translation to the left (% CUG codons at leucine positions for species using the standard code and % CUG codons at serine positions for species using the AYCU) and the percentages of CUG codons present at alignment positions enriched in the respective other amino acid to the right. When considering only highly conserved alignment positions (≥90% conserved) the CUG codon translation assignment is unambiguous. Species using the AYCU are highlighted in bold. Figure 5 CUG codon positions conserved between species The heatmap represents the number of CUG codon positions shared between every two species. The upper triangle shows the number of shared CUG codon positions at those positions in the concatenated alignment of the full-length sequences, which have - 32 - Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 conservation. For each species we determined the percentage of CUG codons at a conservation score of at least 50%, the lower triangle the number of shared CUG codons at all alignment positions. The diagonal represents the total number of CUG codons in the respective species. The number of CUG codons is coloured on a logarithmic scale. Species encoding CUG as serine are typed in bold. Figure 6 Timeline of the CUG codon reassignment The tree presents the maximum-likelihood topology generated under the Γ + WAGF model in RAxML showing branch lengths for the concatenated alignments of 26 bootstrap replicates), MrBayes (posterior probabilities) and ClustalW trees (1,000 replicates) is indicated (for more details see fig. S1, Supplementary Material online). Species using the AYCU are highlighted in bold. With the splits of Sc. pombe and Sa. cerevisiae and Sa. cerevisiae and Ca. albicans set to 587 mya and 235 mya, respectively, the divergence of the CTG clade was estimated to 190 mya using treePL (Smith & O’Meara 2012). The scale bar denotes amino acid substitutions per site. The width and colour of the branches to extant species represent the total number of CUG codons in the respective concatenated sequences. Tables Table 1 Estimation of the time span of the CUG codon reassignment Sc. pombe-Sa. cerevisiae Sa. cerevisiae-Ca. albicans CUG codon ambiguity CUG codon reassignment ( 1,850 841a 520-841 740-855) 1,144a 730 415-730 605-695 - 33 - Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 cytoskeletal and motor proteins. Support for major branches of the RAxML (1,000 1,144a 841a 440-841 640-740 485 235b 135-235 195-225 587b 360 205-360 295-340 587b 235b 135-235 190-220 Estimated times in million years for the split between Sa. cerevisiae and Ca. albicans, Sa. cerevisiae and Sc. pombe and the reassignment of the CUG codon were calculated with a Penalized-likelihood program for the phylogenetic tree generated with RAxML a Time constrains for the splits between Sa. cerevisiae and Ca. albicans, and Sa. cerevisiae and Sc. pombe as reported by (Heckman et al. 2001) and b(Douzery et al. 2004), respectively, were included both individually and combined. For each CUG codon reassignment time computation, constrained time estimates are typed in bold. The CUG codon reassignment time estimates are given as differences between the time estimates for the divergence of P. tannophilus from the Pichiaceae and Ogataea species (first number), and the first occurrence of the tRNASerCAG (second number; separation of the CTG clade from the branch containing P. tannophilus, the Pichiaceae and Ogataea species). The CUG codon ambiguity time estimates, however, are given as differences between the time estimates for the latest possible fixation of the tRNASerCAG (first number; separation of Ca. tenuis, the most basal species in the CTG clade) and the earliest possible invention of the tRNASerCAG (second number; separation of the Saccharomycetaceae and the CTG clade). The time estimate for the divergence of Sc. pombe and Sa. cerevisiae (1,850 mya) based on the constraint of 841 mya for the split of Sa. cerevisiae and Ca. albicans predates the emergence of eukaryotes and should be ignored. - 34 - Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 using the WAG+Γ+F model. Supporting Information Figure S1 Phylogenetic trees This file contains the phylogenetic trees generated with ClustalW, RAxML and MrBayes based on the concatenated alignments of 26 motor and cytoskeletal proteins. Bootstrap support values and posterior probabilities are reported in absolute values dated by constraining the time for the splits between Sa. cerevisiae and Ca. albicans, Sa. cerevisiae and Sc. pombe is included. The scale bar denotes million years. Moreover, a RAxML tree was generated on a modified alignment, in which CTG encoded amino acids were replaced by “X”. In all trees, major branches as well as the “CTG-clade” are highlighted with colours. Figure S2 Alignment of the yeast class I myosins with all CUG positions highlighted. All CUG positions occurring in the aligned class I myosin genes are highlighted. Blue and green boxes indicate CUG codons coding for leucine and serine, respectively. - 35 - Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 (ClustalW) and relative values (RAxML and MrBayes). In addition, the RAxML tree Figure S3 Conservation of serine and leucine positions Opposed to fig. 3, alignment positions were scored on all sequences of our dataset, including inferred CUG translations. Scores were calculated with window sizes of 0 (charts on the left) and 3 (charts on the right). Figure S4 CUG codon usage at conserved leucine and serine positions Alignment positions were scored without sequences containing CUG codons. Opposed to fig. 4, the window size for score computations was set to 0. This table lists the teleomorph name (used as reference) of every analyzed species, names of anamorphs, alternatively used scientific names and the species abbreviations as used throughout this analysis. For every species the number of CapZ, actin and actin-related, tubulin, kinesin, myosin, and dynein heavy chain genes is given. Genome analyses, if published, accession numbers and sequencing centres are referenced. File S2 Concatenated alignment This zip-compressed file contains the concatenated multiple sequence alignment of the 26 cytoskeletal and motor proteins. File S3 Conservation of serine and leucine positions This file contains the raw data to fig. 3 and fig. S3. - 36 - Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 File S1 List of analyzed species File S4 CUG codon usage at conserved leucine and serine positions This file contains the raw data to fig. 4 and fig. S4. Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 - 37 - ru ab ck erh Wi om rha rap isisp ora Tetra nd am cke Tet rli pisis pora Vande ne yc mo es my ces blat tae ra in an phaffi i CBS i far S B C 67 ii C 7 BS C 09 88 sN rrii R NR RL RL Y-3 Y-1 66 031 628 0.1 4 4417 Yarrowia lipolytica CLIB122 olyspo ra DSM delbrueckii C ii n e s an en r. h s a n m v ii zy s ha n e o s 4 r e 98 4 li le myc s hanC 23 1 YL M ryo yce TC R NR a om M s i b 20 u 7 y e 2 n r 4 e a D eb TCC da t A i e D d a Can itani s u l ra 77 2 o a p l o s i c i fruct Clav a i w o k ni Metsch 6260 C C T A ii d n o guillierm Meyerozyma a BR alu zyma p Torulaspora ii N om cife CBS rwalto jad a s no 4 6 70 BS 1146 70294 Schizosaccharomyces osporus Schizoct osacch cryopharomyces ilus Schi zosa ccha p omb romyc Sch e es izos japaocchar nicuomy s ces Zygosaccharomyces bailii ISA1307 ii CBS732 x u o r s e c y m o Zygosacchar ns is N ia R CB nia a R f L ric nag Y- S 4 a Nak a n 12 21 nis aC ase 63 h om ii C BS yce BS 25 0 s ba 8 17 797 cilli spo rus Cand CBS ida g 772 labra 0 ta CB Nakase S138 omyce s delph ens Candida niva riensis CBS 9 is CBS 2170 983 Candida bracarensis CBS 10154 ne re te llii ai hs za c Kaz ach sta Ka um Na tan ym ac as ad ym oz ov oz um ov Na nia spo Klu ra v Klu yv ine ae yv erom T02 er om yce /19 AF yc s m arx es ian lac us tis va NR r. m RL arx Yian 11 us 40 K 10 -2 54 CD 2 ii U 86 m 18 C ha TC er ii A ick ar w 95 stu es 08 15 ae yc C1 #72 m es TC PG ro yc ii A BV ve om yp eD s s uy er aria go Kl yv bal m Klu ciu cym 651 the ium -12 mo hec RL Y t R N i Ere mo 6340 ver CBS Ere kluy rans cea otole han therm a 2644 Lac ance i NCYC Lach ancea walti Lach ellii CBS 4332 cc ha rom yce nse Candida cast Sa Ha Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 H-6 a l o c i r arbo -17217c s e c y rom RL Y e S288815 02 R N Saccha oxus evisia IFO 1 O 18 3-6C d a r er tae i IF 62 c es pa s c y e a evi us c m k y o i r a h m avz an rom s a Sacc e h c yc dri bay Sac u m ro s k ces a e h c c y y m Sac m o ar aro h h cc cc a Sa S CT C1 ccha romyces Pha ffo ia dr my ce tac ea e ku Wi be e Sc hiz osa ia ch Pi Cy Candida orthopsilosis Co 90-125 Candida met apsilosis CAN ME Candid Cand a albican s WO-1 Can ida dub linie Ca dida tr nsis nd CD3 o pica i 6 d a l S i s p m Sp at alto MYAh a 340 Sc t a sa 4 Xu he has spor 3 p ffe 16 or a ar a p bo rs om rar as s yc ali iae U es da FM ru st Gm ip 19 i ti NR .1A sC RL BS Y27 60 90 54 7 er kk De Pic hia ce ae psilosis Candida para us gispor es el4o2n39 romyc Lodde NRRL YB- 60 Y-24 RRL ilus N noph 93 n ta n S 19 ysole a CB Pach 5 ulat ap s S11 -1 ia c sG ori DL aish Kur ast ha la p orp ael lym gat o 9 ma ap 49 r Ko pa S2 ea CB is ata 12 ns Og ii M lle ev xe vz O a tae a g CT Gc la d 75 55 Reported standard codon usage tac eae Reported alternative codon usage CnbMyo1A CdnMyo1A CacaMyo1A NabMyo1A NdMyo1A CnbMyo1B CglMyo1A Sab_aMyo1B Sc_cMyo1B SaaMyo1B SakMyo1B SmiMyo1B Sap_aMyo1B NacMyo1 KaaMyo1 VpMyo1 TtpMyo1A TtpMyo1B TtbMyo1A TtbMyo1B ErgMyo1 ErcMyo1 HsvMyo1 KlMyo1 KmmMyo1 KlwMyo1 KaMyo1 KnMyo1 LwMyo1 LatMyo1 Lak_aMyo1 Sab_aMyo1A Sc_cMyo1A SaaMyo1A SakMyo1A SmiMyo1A Sap_aMyo1A ZrMyo1 TodMyo1 CglMyo1B WaMyo1Alpha WicMyo1 CyjMyo1 YlMyo1 CdnMyo1B Kop_bMyo1 NdMyo1B KcMyo1 OgpMyo1 DebMyo1 PiuMyo1 NabMyo1B ZbMyo1 Ca_bMyo1 StaMyo1 ShpMyo1 CllMyo1 MefMyo1 CnmMyo1 DhhMyo1 DehMyo1 CatMyo1 LoeMyo1 MrgMyo1 ShsMyo1 CameMyo1 MifMyo1Alpha CadMyo1 CaoMyo1 CapMyo1 Ct_aMyo1 PtaMyo1 CacaMyo1B NadMyo1 278 278 279 278 278 278 279 278 278 278 278 278 278 278 278 278 278 278 282 280 277 278 277 276 276 276 277 278 278 278 282 278 277 278 268 278 278 278 278 278 278 278 276 279 278 279 277 280 284 283 284 278 278 287 285 284 286 285 285 285 285 285 292 283 281 286 284 288 286 286 284 282 281 278 310 320 330 340 350 360 370 380 390 400 ....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....| KAMQIIGLSQDEQDQIFRMLAAILWIGNISFIENEEGNAQVRDTSVTDFVAYLLQID-SPSLTKSLVERIVETNHGMKRGSVYHVPLNIVQATAVRDALA KAMQVIGLSQEEQDQIFRMLASILWIGNISFIENEEGNAQVRDTSVTDFVAYLLQID-SPSLTKALVERIVETNHGMKRGSVYHVPLNIVQATAVRDALA KAARTIGLSQEEQDQIFRQLAAILWIGNLTFVENEEGNAQVRDTSVSDFIAYLLQVE-SELLIKCIVERTVETNHGMKRGSVYHVPLNIVQANAVRDALA KAMQTIGLAQEEQDNIFRMLAAILWIGNISFIENEEGNAQIRDTSVTDFVAYLLQIN-SDILITAITERTVETNHGMKRGSVYHIPMNIVQATAVRDALA KAMQIIGLSQDEQDQIFRMLASILWIGNISFIENEEGNAQVRDTSVTDFVAYLLQVD-SPSLTKALVERIVETSHGMKRGSVYHVPLNIVQATAVRDALA KAMEIIGLSKEEQDQIFRMLAVILWIGNISFIENEEGNAQIRDTSVTAFVAYLLQVQ-EELLIKSLIERIIETGHGAKRGSTYHSPLNIVQATAVRDALA KAMQVIGLAQEEQDQIFRMLAAILWIGNISFIENEEGNAQVRDTSVTDFVAYLLQVD-SQSLIKALVERIVETNHGSRRGSVYHVPLNIVQATAVRDALA KAMKVIGLGQEEQDQIFRMLAAILWIGNISFIENEEGNAQVRDTSVTDFVAYLLQID-GQLLVKSLVERIMETNHGMKRGSVYHVPLNIVQADAVRDALA KAMRVIGLGQEEQDQIFRMLAAILWIGNVSFIENEEGNAQVRDTSVTDFVAYLLQID-SQLLIKSLVERIMETNHGMKRGSVYHVPLNIVQADAVRDALA KAMRVIGLEQDEQDQIFRMLAAILWIGNVSFIENEEGNAQVRDTSVTDFVAYLLQID-SQLLLKSLVERIMETSHGMKRGSVYHVPLNVVQADAVRDALA KAMRVIGLGQEEQDQIFRMLAAILWIGNVSFTENEEGNSQVRDTSVTDFVAYLLQID-SQLLIKSLVERIMETNHGMKRGSVYHVPLNIVQADAVRDALA KAMRVIGLGQEEQDQIFRMLAAILWIGNVSFIENEEGNAQVRDTSVTDFVAYLLQID-SQLLIKSLVERIMETNHGMKRGSVYHVPLNVVQADAVRDALA KAMRVIGLGQEEQDQIFRMLAAILWIGNVSFIENEEGNAQVRDTSVTDFVAYLLQID-SQLLTKSLVERIMETNHGMKRGSVYHVPLNIVQADAVRDALA KAMEVIGLHQEEQDQIFRMLAAVLWIGNVSFVENEEGNAQVRDTSVTDFVAYLLQID-APLLIKSLVERVMETNHGMRRGSVYHVPLNIVQATAVRDALA KAMQVIGLTQQEQDEIFRMLSAILWIGNITFIENEEGNAQVRDTSVTDFVAYLLQVD-SQLLIKSIVERTMETNHGMRRGSIYHVPLNIVQATAVKDALA KAMQVIGLSQDEQDQIFRMLSSILWVGNISFVENEEGNAQVRDTSVTDFVAYLLQID-SQVLMKALVERTMETSHGMRRGSVYHVPLNIVQATAVRDALA KAMQIIGLSQDEQDQIFRMLASILWIGNISFVENEEGNAQVRDTSVTDFVAYLLQID-ASILIKGLVERTMETSHGMRRGSVYHVPLNIVQANAAKDALA KAMQIIGLSQDEQDQIFRMLAFILWIGNISFVENEEGNAQVRDTSVTDFVAYLLQID-ASILIKGLVERTMETSHGMRRGSVYHVPLNIVQANAAEDALA KAMEIIGLDQNEQDQIFRMLAAILWIGNITFEENDEGNAQVRDTSVTDFVAYLLEVD-SPLLIKSLVERIMETSHGSQRGSVYHVPLNITQATAVRDALA KAMQVIGLSQDEQDQIFRMLSAILWIGNITFVENDEGNAQVADSSVTDFVAYLLQVD-AGVLVKSLVERIMETNHGMKRGSVYHVPLNRVQATAVRDALA EAMNVIGLSQAEQDEIFRLLSAILWIGNVTFMEDDEGNAKIADTSITDFVAYLLQVD-AGLLVKSLVERTIETTHGMRRGSIYNVPLNIVQATAVRDALA KAMGVIGLSQAEQDEIFRMLAAILWIGNITFAENDEGNAQVGDTTVTDFLAYLLQVD-PDLLLKSLVERVIETNHGMKRGSIYNVPLNIVQATAVRDALA KAMNTIGVTQQEQDEIFRMMAAILWIGNVTFVENDEGNAQVKDTSVTDFVAYLLQVD-SQILIKCLVERIMETNHGMKRGSVYHVPLNRVQATAVRDALA KAMQVIGLSQEEQDQIFRMLSAILWIGNVTFVENNEGNAEVRDTSVTDFVAYLMQVD-SGLLIKCLVERVMETGHGSRRGSVYHVPLNVVQATAVRDALA KAMDVIGLSQDEQDQIFRMLSAILWIGNISFIENDEGNSQVRDTSVTDFVAYLLQVD-ASLLIKCLVERVMETGHGARRGSVYHVPLNVVQANAVRDALA KAMEIIGLSQDEQDQIFRMLSAILWIGNISFVENDEGNSEVRDTSVTDFVAYLMQVD-SSLLVKCLVERVMETGHGSRRGSVYHVPLNVVQATAVKDALA KAMQVIGLSQDEQDQIFRMLAAILWIGNISFVENEEGNAQIRDTSVTDFVAYLLQVD-SNLLIKSLVERVMETGHGSKRGSIYHVPLNIVQASAVKDALA KAMQIIGLSQEEQDQIFRVLSSILWIGNITFAENEEGSAAVRDTSVTDFVAYLLQID-AALLTQSLVERTMETNHGMRRGSIYHVPLNVVQATAVRDALA NAMSVIGITQHEQDEVFRFLAAILWIGNISFTENEEGNAQVRDTSVTDFVAYLLQVD-SNLLVQALVERVMETNHGMKRGSIYHVPLNVVQATAVKDALA KAMQVIGITQQEQDELFRLLSAILWIGNISFTENEEGNAQVCDTSVTDFVAYLLQVD-AHFLTQALVERVMETNHGMKRGSVYHVPLNVVQATAVKDALA KAMQVIGLSQEEQDQIFRLLATILWIGNISFTEDEEGNAQVRDTSVTDFVAYLLQVD-SQLLIKCLVERVMETSHGMRRGSVYHVPLNIVQATAVKDALA EAMRTIGLVQEEQDQIFRMLAAILWIGNISFIENEEGNAQVRDTSVTDFVAYLLQVD-APLLIKCLVERIMQTSHGMKRGSVYHVPLNAVQATAARDALA EAMRTIGLVQEEQDQIFRMLAAILWIGNISFIENEEGNAQVGDTSVTDFVAYLLQVD-ASLLVKCLVERIMQTSHGMKRGSVYHVPLNPVQATAVRDALA EAMKTIGLVQEEQDQIFRMLAAILWIGNISFVENEEGNAQVRDTSVTDFVAYLLQVD-ASLLIKCLVERIMQTSHGMKRGSVYHVPLNAVQATAVKDALA ------GLGQEDQYQFFRMLAAIMWMANISFIENEEANAQVRDTSVTDFAAYILQVD-APFLIKCMVDRIMQTILGMKRGSVYHVPLNTVQATAVRDALA EAMRTIGLAQEEQDQIFRMLAAILWVGNISFIENEEGNAQVRDTSVTDFVAYLLQVD-ASLLIKCLVERIMQTSHGMKRGSVYHVPLNPVQATAVRDALA EAMRTIGLAQEEQDQIFRMLAAILWIGNISFIENEEGNAQIRDTSVTDFVAYLLQVD-ASLLMKCLVERIMQTSHGMKRGSVYHVPLNPVQATAVRDALA AAMRVIGLSQEEQDEIFRVLAAILWTGNITFMENDEGNAQVRDTSVTDFVAYLLQVD-SQLLVKSLVERIMETNHGMRRGSVYHVPLTTVHATAVKDALA KAMQVIGLSQEEQDQIFRLLAAILWIGNITFVEDEEGNAKIRDTSVTDFVAYLLQVE-SELLIKSLVERVLETNHGMRRGSVYHVPLNAVQATAVKDALA NAMRTIGLTKSEQDQIFRALAAILWIGNISFVENEAGNAEIRDKSVTTFVAYLLEVQ-EELLIKALIERIIETTHGAKRGSTYHSPLNIIQATAVRDALA NAMQTVGITQPEQDQIFRVLAAILWIGNITFVENEEGNAQVRDTSVTDFVAYLLQVD-SETLNKAVTERVMETSHGMRRGSVYNVPLNITQATAVRDALA SAMQTIGVTQDEQDQIFRILAAILWIGNITFMENDEGNSQVKDTSVTDFVAYLLQVD-SEILIKSITQRVMETSHGMKRGSVYEVPLNITQAIAVKDALA KAMQIIGIDQEQQDQIFRILAAILWIGNITFVEDEQGNSTVRDSSVTKFVAYLISVD-EQTLINSLITRVMTTMN-----ETYDIPLNPTQATAVKNALA AAMNLIGLTQAEQDNLFKLLAAILWIGNMSFVEDKDGNAAIADVSVPNFVAYLLEVD-AESVVKAVTQRIMETSRGGRRGSVYEVALNIAQATSVRDALA NAMNIIGLSKEEQDQIFRMLAAILWIGNISFIEDEQGNAQIRDTSVTTFVAYLLQVQ-EELLIKSLVERIIETGHGAKRGSTYHSPLNIVQATAVRDALA KAMQTIGLTQDEQDQVFRILAAILWIGNISFVENEDGNAQIRDSSVTDFVAYLLQVN-AQILTNSLIERIVETSHGSKRGSIYHVPLNITQATAVRDALS RAMDIIGLAKEEQDQIFRLLAAILWIGNISFVENEQGNAQIRDTSVTTFVAYLLQVQ-EDMLIKSLVERVIETSHGAKRGSTYHSPLNIVQATAVRDALA KAMRTIGMTQEEQDELFSLLAAVLWIGNVSFAEDAEGNSTIRDTSVTDFVAYLLQVD-SQILCKSLTERTLETNHGMKRGSIYHVPLNMTQATAVRDALA ASMTTIGLTQEEQDQVFRVLAAILWIGNITFVEDAEGNAQVRDTGVTDFVAYLLQVN-SEVLVKSIIERTMETSHGMRRGSVYHVPLNPVQATASRDALA KAMDVIGITDEERDQIFRILAGILWIGNITFTEDEEGNAAIADTSVTDFVAYLLQVD-AQTLCKSVVERTIQTFHGMRRGSIYHSPLNPVQATASRDALA KAMDTIGLSSEERDHVFRLLSAILWIGNVSFIEDEEGNSKIRDESVTNFVAYLLQVD-ATLLCKSLTERTMETSHGMKRGSIYHVPLNTTQATAVKDALA RAMETIGISQEEQDNIFRLLAAILWTGNISFVENEEGNAQIRDTSVTDFVAYLLQVD-SQMLIGSIIERIMETSHGSRRGSIYNVPLNIVQATAVKDALA KAMQVIGLTQEEQDEIFRVLAAILWIGNITFGENDEGNAEVRDTSVTDFVAYLLQVD-SQLLIKSLVERIMETNHGMRRGSVYHVPLNTVQATAVRDALA NAMKIIGLTQQEQDNIFRMLASILWIGNISFVEDENGNAAIRDDSVTNFAAYLLDVN-PEILKKAIIEKTIETSHGMRRGSTYHSPLNIVQATAVRDALA NAMNIIGLTQDEQDNIFKMLAAILWIGNISFVEDESGNAAIRDETVTQFVAYLLDVPSAEIMKKAIIERTIETTHGSRRGSTYHVPLNIVQATSVRDALA NAMNIIGLTQEEQDNIFRMLAAILWVGNISFVEDESGNAAIRDETVTQFVAYLLDVS-PEIMKKAIIERTIETTHGSRRGSTYHVPLNIVQATSVRDALA RAMEIIGLSQAEQDNIFRMLASILWIGNISFVEDESGNATIRDAGVTNFVAYLLEVS-PEILQKAIVERIIETSHGMRRGSTYHVPLNIVQATAVRDALA NAMNVIGLAQVEQDNIFRMLAAILWIGNISFVEDENGNAAVRDEGVTNFVAYLLEVN-AEILKKSIIERIIETSHGMRRGSTYHVPLNITQATAVRDALA NAMQVIGLTQQEQDSIFRMLASILWIGNISFVEDEHGNAAIRDESVTNFVGYLLDVS-GEIVKKSIIERTIETSFGSRRGSTYHSPLNIVQATAVRDALA RAMQVIGLSQEEQDNIFRMLASILWIGNISFVEDEHGNAAIRDESVTAFVAYLLDVN-AETLKTSLIQRVMQTSHGMRRGSTYHVPLNIVQATSVRDALA RAMQVIGLSQDEQDNIFRMLASILWVGNVSFVEDDNGNAAVRDESVTAFIAYLLDVD-AETLKTSLIQRVMQTSHGMRRGSTYHVPLNIVQATSVRDALA NAMKVIGITPQEQDHIFRMLAAILWVGNISYVEDEHGNAAIRDESVINFVAYLLETD-AESVKKSITEKIVQTSHGMRRGSTYHSPLNIVQATAVRDALA NAMNIIGLSQAEQDNIFRILASILWTGNISFVEDESGNAAIRDDSVTNFVAYLLDVN-AEILKKAITERTIETSHGMKRGSTYHVPLNIVQATAVRDALA QAMNTIGLSKAEQDNIFRSLASILWIGNISFVENEDGNAAIRDDTVTTFVAYLLEVD-ANVLKKSILERVIETSHGMRRGSTYHVPLNIVQATASRDALA SAMKIIGLTELEQNNIFRMIASILWIGNVSFVEDESGNAAIRDDSVTQFVAYLLEVN-PEILKKAIVERVIETTHGMRRGSTYHVPLNIVQATSVRDALA NAMNVIGLTQDEQDNIFRMLASILWIGNISFVEDESGNAAIRDDSVTNFVAYLLDVN-PEILKKAIIERTIETSHGMKRGSTYHVPLNIVQATAVRDALA KAMQIIGLTQDEQDNIFRVLASILWIGNISFVEDENGNAIPRDESVTNFVAYLLNVE-PVILKTAFVQRVMQTSHGMRRGSTYHVPLNIVQATAVRDALA NAMKIIGLTQQEQDNIFRMLAVILWIGNISFIEDENGNATIRDDSVTNFVAYLLDVN-SEILKKAIIERTIETSHGMRRGSTYHSPLNIVQATAVRDALA NAMNVIGLSQEEQDNIFRMLASILWIGNISFVEDESGNAAIRDDSVTNFVAYLLDVN-PEILKKAIIERTIETSHGMKRGSTYHVPLNIVQATAVRDALA NAMNVIGLSQDEQDNIFRMLASILWIGNISFIEDESGNAAIRDDSVTNFVAYLLDAN-PDILKKAIIERTIETSHGMKRGSTYHVPLNIVQATAVRDALA NAMQIIGLSQDEQDQIFRMLASILWVGNISFVEDENGNAAIRDESVTNFVAYLLGVN-TEILKKAIIERTIETSHGMRRGSTYHSPLNIVQATAVRDALA NAMSIIGLSQEEQDQIFRILSAILWIGNVLFVENEDANSAIGDQGVINYIAYLLQVD-AEQLAKSLTERIMETSHGMKRGSIYHVPLNLTQAYAVRDALA NAMKTIGLSQTEQDHIFRLLAAILWIGNISFIENEEGSAQIRDTSVTDFVAYLLEVD-SSLLITSIVERILETSHGTKRGSIYHTPLNIVQATAVRDGLA KAMQTIGLAQEEQDQIFRMLAAILWIGNISFIENEEGNAQVRDTSVTDFVAYLLQVD-AAVLIKALVERIMETSHGMRRGSVYHVPLNIVQATAVRDALA A window size 0 window size 3 900 2500 800 2000 leucine residues 700 occurence occurence 600 1500 1000 500 400 300 200 500 0 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 5 0 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 conservation score [%] conservation score [%] 450 350 400 300 350 occurence 300 200 150 250 200 150 100 100 50 0 5 0 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 5 0 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 conservation score [%] B 4 kinesin-1 bits 3 2 I K L I N A S R M E N A QV R H Y E Q V Y 4 kinesin-5 bits 3 2 1 T K R Q R V I DG R A S F N D A S P KN L C D H I S MYF I F V M VG L FVR I L Y I V K Y F E 0 N N G R I SQ TVTT I 0 N LVDLAGSEK KTG A G LE AK IN SLS LG VIN LTD S S Q D N A R R I R T A N K NLVDLAGSE SG A E G IN SL LG VIN L K S LF 1 S VT I K R T SN A I DR SQ T S V M K L V L Q I G KL L ST L Q T Q N K AS TST YD I R M RA V N A N SA GK S K AS H I L N F H A C E QT FAR VMDV P R conservation score [%] S L M SS PE NC AQ RK K LQ A A G N N D SS GV Y E H N L I K N NTMVST Q V 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 0 50 T A F LQL S M T I H S I K Y T R I N FA S W I A G N F S V D H L H G K T G S A K E Q S Y A N I E S T N K Q K SIV V HMK T E R DA V F Q M E L G Q S Q I I S S T I D N L M G I K A A F G H R V KS N V K N S N X AS V KS S Q N I K N V F S L T MV G T E KN S C I I N M H G R C N Q G L NAP S QT T AS N A I Q 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 occurence 250 switch II Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 100 α4 C serine residues # alignment positions with a score in the respective range # alignment positions with a score in the respective range and with at least 1 CUG codon # CUG codons at alignment positions with a score in the respective range [%] CUG codons 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Yarrowia lipolytica CLIB122 Wickerhamomyces anomalus NRRL Y-366 Wickerhamomyces ciferrii NRRL Y-1031 Cyberlindnera jadinii NBRC 0988 Kluyveromyces aestuarii ATCC 18862 Kluyveromyces wickerhamii UCD 54-210 Kluyveromyces marxianus var. marxianus KCTC 17555 Kluyveromyces lactis NRRL Y-1140 Lachancea thermotolerans CBS 6340 Lachancea waltii NCYC 2644 Eremothecium gossypii ATCC 10895 Eremothecium cymbalariae DBVPG#7215 Lachancea kluyveri NRRL Y-12651 Tetrapisispora phaffii CBS 4417 Vanderwaltozyma polyspora DSM 70294 Tetrapisispora blattae CBS 6284 Torulaspora delbrueckii CBS 1146 Zygosaccharomyces bailii ISA1307 Zygosaccharomyces rouxii CBS732 Candida castellii CBS 4332 Nakaseomyces bacillisporus CBS 7720 Candida glabrata CBS138 Nakaseomyces delphensis CBS 2170 Candida nivariensis CBS 9983 Candida bracarensis CBS 10154 Saccharomyces bayanus 623-6C Saccharomyces kudriavzevii IFO 1802 Saccharomyces mikatae IFO 1815 Saccharomyces cerevisiae S288c Saccharomyces paradoxus NRRL Y-17217 Saccharomyces arboricola H-6 Kazachstania africana CBS 2517 Kazachstania naganishii CBS 8797 Naumovozyma dairenensis CBS 421 Naumovozyma castellii NRRL Y-12630 Hanseniaspora vineae T02/19AF Komagataella pastoris GS115 Pichia kudriavzevii M12 Dekkera bruxellensis CBS 2499 Ogataea parapolymorpha DL-1 Kuraishia capsulata CBS 1993 Pachysolen tannophilus NRRL Y-2460 Candida tenuis NRRL Y-1498 Spathaspora passalidarum NRRL Y-27907 Spathaspora arborariae UFMG-19.1A Candida metapsilosis CANME Candida orthopsilosis Co 90-125 Candida parapsilosis Lodderomyces elongisporus NRRL YB-4239 Candida tropicalis MYA-3404 Candida maltosa Xu316 Candida albicans WO-1 Candida dubliniensis CD36 Scheffersomyces stipitis CBS 6054 Millerozyma farinosa CBS7064 Debaryomyces hansenii var. hansenii MTCC 234 Debaryomyces hansenii CBS767 Metschnikowia fructicola 277 Clavispora lusitaniae ATCC 42720 Meyerozyma guilliermondii ATCC 6260 [%] CUG codons Serine, min. 90% conservation score Leucine, min. 90% conservation score Serine, min. 80% conservation score Leucine, min. 80% conservation score Serine, min. 50% conservation sore Leucine, min. 50% conservation sore Yarrowia lipolytica CLIB122 Wickerhamomyces anomalus NRRL Y-366 Wickerhamomyces ciferrii NRRL Y-1031 Cyberlindnera jadinii NBRC 0988 Kluyveromyces aestuarii ATCC 18862 Kluyveromyces wickerhamii UCD 54-210 Kluyveromyces marxianus var. marxianus KCTC 17555 Kluyveromyces lactis NRRL Y-1140 Lachancea thermotolerans CBS 6340 Lachancea waltii NCYC 2644 Eremothecium gossypii ATCC 10895 Eremothecium cymbalariae DBVPG#7215 Lachancea kluyveri NRRL Y-12651 Tetrapisispora phaffii CBS 4417 Vanderwaltozyma polyspora DSM 70294 Tetrapisispora blattae CBS 6284 Torulaspora delbrueckii CBS 1146 Zygosaccharomyces bailii ISA1307 Zygosaccharomyces rouxii CBS732 Candida castellii CBS 4332 Nakaseomyces bacillisporus CBS 7720 Candida glabrata CBS138 Nakaseomyces delphensis CBS 2170 Candida nivariensis CBS 9983 Candida bracarensis CBS 10154 Saccharomyces bayanus 623-6C Saccharomyces kudriavzevii IFO 1802 Saccharomyces mikatae IFO 1815 Saccharomyces cerevisiae S288c Saccharomyces paradoxus NRRL Y-17217 Saccharomyces arboricola H-6 Kazachstania africana CBS 2517 Kazachstania naganishii CBS 8797 Naumovozyma dairenensis CBS 421 Naumovozyma castellii NRRL Y-12630 Hanseniaspora vineae T02/19AF Komagataella pastoris GS115 Pichia kudriavzevii M12 Dekkera bruxellensis CBS 2499 Ogataea parapolymorpha DL-1 Kuraishia capsulata CBS 1993 Pachysolen tannophilus NRRL Y-2460 Candida tenuis NRRL Y-1498 Spathaspora passalidarum NRRL Y-27907 Spathaspora arborariae UFMG-19.1A Candida metapsilosis CANME Candida orthopsilosis Co 90-125 Candida parapsilosis Lodderomyces elongisporus NRRL YB-4239 Candida tropicalis MYA-3404 Candida maltosa Xu316 Candida albicans WO-1 Candida dubliniensis CD36 Scheffersomyces stipitis CBS 6054 Millerozyma farinosa CBS7064 Debaryomyces hansenii var. hansenii MTCC 234 Debaryomyces hansenii CBS767 Metschnikowia fructicola 277 Clavispora lusitaniae ATCC 42720 Meyerozyma guilliermondii ATCC 6260 # CUGs 770 # CUGs at positions with a conservation score of ≥50% Yarrowia lipolytica CLIB122 Wickerhamomyces anomalus NRRL Y-366 Wickerhamomyces ciferrii NRRL Y-1031 Cyberlindnera jadinii NBRC 0988 Kluyveromyces aestuarii ATCC 18862 Kluyveromyces wickerhamii UCD 54-210 Kluyveromyces marxianus var. marxianus KCTC 17555 Kluyveromyces lactis NRRL Y-1140 Lachancea thermotolerans CBS 6340 Lachancea waltii NCYC 2644 Eremothecium gossypii ATCC 10895 Eremothecium cymbalariae DBVPG#7215 Lachancea kluyveri NRRL Y-12651 Tetrapisispora phaffii CBS 4417 Vanderwaltozyma polyspora DSM 70294 Tetrapisispora blattae CBS 6284 Torulaspora delbrueckii CBS 1146 Zygosaccharomyces bailii ISA1307 Zygosaccharomyces rouxii CBS732 Candida castellii CBS 4332 Nakaseomyces bacillisporus CBS 7720 Candida glabrata CBS138 Nakaseomyces delphensis CBS 2170 Candida nivariensis CBS 9983 Candida bracarensis CBS 10154 Saccharomyces bayanus 623-6C Saccharomyces kudriavzevii IFO 1802 Saccharomyces mikatae IFO 1815 Saccharomyces cerevisiae S288c Saccharomyces paradoxus NRRL Y-17217 Saccharomyces arboricola H-6 Kazachstania africana CBS 2517 Kazachstania naganishii CBS 8797 Naumovozyma dairenensis CBS 421 Naumovozyma castellii NRRL Y-12630 Hanseniaspora vineae T02/19AF Komagataella pastoris GS115 Pichia kudriavzevii M12 Dekkera bruxellensis CBS 2499 Ogataea parapolymorpha DL-1 Kuraishia capsulata CBS 1993 Pachysolen tannophilus NRRL Y-2460 Candida tenuis NRRL Y-1498 Spathaspora passalidarum NRRL Y-27907 Spathaspora arborariae UFMG-19.1A Candida metapsilosis CANME Candida orthopsilosis Co 90-125 Candida parapsilosis Lodderomyces elongisporus NRRL YB-4239 Candida tropicalis MYA-3404 Candida maltosa Xu316 Candida albicans WO-1 Candida dubliniensis CD36 Scheffersomyces stipitis CBS 6054 Millerozyma farinosa CBS7064 Debaryomyces hansenii var. hansenii MTCC 234 Debaryomyces hansenii CBS767 Metschnikowia fructicola 277 Clavispora lusitaniae ATCC 42720 Meyerozyma guilliermondii ATCC 6260 #CUGs at identical alignment positions 316 100 31 10 3 1 Schizosaccharomyces pombe 972hSchizosaccharomyces cryophilus NRRL Y-48691 Schizosaccharomyces octosporus yFS286 Schizosaccharomyces japonicus yFS275 Yarrowia lipolytica CLIB122 Wickerhamomyces anomalus NRRL Y-366 Wickerhamomyces ciferrii NRRL Y-1031 Cyberlindnera jadinii NBRC 0988 Kluyveromyces aestuarii ATCC 18862 Kluyveromyces wickerhamii UCD 54-210 Kluyveromyces marxianus var. marxianus KCTC 17555 Kluyveromyces lactis NRRL Y-1140 Lachancea thermotolerans CBS 6340 Lachancea waltii NCYC 2644 Eremothecium gossypii ATCC 10895 Eremothecium cymbalariae DBVPG#7215 Lachancea kluyveri NRRL Y-12651 Tetrapisispora phaffii CBS 4417 Vanderwaltozyma polyspora DSM 70294 Tetrapisispora blattae CBS 6284 Torulaspora delbrueckii CBS 1146 Zygosaccharomyces bailii ISA1307 Zygosaccharomyces rouxii CBS732 Candida castellii CBS 4332 Nakaseomyces bacillisporus CBS 7720 Candida glabrata CBS138 Nakaseomyces delphensis CBS 2170 Candida nivariensis CBS 9983 Candida bracarensis CBS 10154 Saccharomyces bayanus 623-6C Saccharomyces kudriavzevii IFO 1802 Saccharomyces mikatae IFO 1815 Saccharomyces cerevisiae S288c Saccharomyces paradoxus NRRL Y-17217 Saccharomyces arboricola H-6 Kazachstania africana CBS 2517 Kazachstania naganishii CBS 8797 Naumovozyma dairenensis CBS 421 Naumovozyma castellii NRRL Y-12630 Hanseniaspora vineae T02/19AF Komagataella pastoris GS115 Pichia kudriavzevii M12 Dekkera bruxellensis CBS 2499 Ogataea parapolymorpha DL-1 Kuraishia capsulata CBS 1993 Pachysolen tannophilus NRRL Y-2460 Candida tenuis NRRL Y-1498 Spathaspora passalidarum NRRL Y-27907 Spathaspora arborariae UFMG-19.1A Candida metapsilosis CANME Candida orthopsilosis Co 90-125 Candida parapsilosis Lodderomyces elongisporus NRRL YB-4239 Candida tropicalis MYA-3404 Candida maltosa Xu316 Candida albicans WO-1 Candida dubliniensis CD36 Scheffersomyces stipitis CBS 6054 Millerozyma farinosa CBS7064 Debaryomyces hansenii var. hansenii MTCC 234 Debaryomyces hansenii CBS767 Metschnikowia fructicola 277 Clavispora lusitaniae ATCC 42720 Meyerozyma guilliermondii ATCC 6260 100/100/100 100/100/100 587 Ma 100/100/100 100/100/88 100/100/100 100/100/100 235 Ma 100/100/100 100/100/52 100/100/100 Downloaded from http://gbe.oxfordjournals.org/ by guest on October 6, 2014 87/100/100 100/100/100 56/100/- 235 220 100 175 135 Ma amino acid substitutions per site 0.2 0 190 200 300 400 500 600 700 800 # CTG codons