Roy N. Platt II & David A. Ray Deptartment of Biological
Transcription
Roy N. Platt II & David A. Ray Deptartment of Biological
Understanding genome evolution in non-model taxa is negatively affected by homology based, transposable element identification Roy N. Platt II & David A. Ray Deptartment of Biological Sciences, Texas Tech University Transposable elements (TEs) are mobile genetic elements with the ability to replicate throughout a host genome. In some taxa TEs reach copy numbers in the hundreds of thousands or millions and can occupy more half of the genome. The increasing number of reference genomes from non-model species has outpaced efforts to identify and annotate TE content and, when applied, annotation methods vary significantly between projects. Here we demonstrate the pitfalls in a homology-dependent method of TE identification using examples from Mammalia and Insecta. De novo repeat identification and manual curation identified more than a hundred new TEs in the both the naked mole rat (Heterocephalus glaber) and prairie vole (Microtus ochrogaster) genomes. When these genomes were re-annotated using these novel repeats as well as the available rodent repeat libraries the portion of the genome recognized increased 3-5%. More importantly, the average genetic distance within TE families decreased, implying younger, more recent TE accumulation than was previously thought. Reanalyses of the postman butterfly (Heliconius melpomene) recovered similar results—increased recognition of younger TEs. These observations imply that homology-based searches are unable to identify novel lineage specific repeats and that the accuracy of homology-based TE annotations decrease as phylogenetic distance between taxa increases. This would mean families or, in the case of horizontal transfer events, entire classes of TEs may go unrecognized. In order to understand the role that TEs may play in genome evolution, they must first be identified using de novo repeat identification and manual curation. Class I Retrotransposons LTRs ERV ERV1 ERV2 ERV3 Gypsy LTR LINEs CR1 L1 L2 Penelope R4 RTE RTEX Tx1 SINEs SINE1/7SL SINE2 SINE3/5S Vingi Unk Unclassified non-LTRs Unclassified Percent transposable elements identified Abstract The Problem Millions of years Figure 2. Homology-based TE annotations using human transposable elements (TEs). TEs in several mammalian genomes were identified and quantified using human TEs. The percentage of TEs identified using human TEs is given as a percentage of the known repeat content. Time since divergence from the human lineage for each taxa was taken TimeTree.org. Taxonomically similar species are grouped together by color. The dotted line represents 100% recognition. 4 4 1 0 0 0 1 Naked mole rat 2 1 2 1 2 3 Mismatches Mismatches Figure 1. Repeat identification. The accuracy of homolog-based repeat identification is driven by the query element used. In the example above, a “human” element is used to identify elements in the mouse genome. As a result, all of the repeats have been identified, but the number of mismatches is artificially skewed. On the other hand, repeat identification with the more appropriate “Rodent” repeat recovers the same repeats, but with only one mismatch between consensus element and each individual locus. The number of mismatches, or mutations, is used as a proxy for age. The repeats in the mouse genome are almost identical, but the estimated age of the elements varies drastically based on the query element used. Methods Figure 1. Species examined. The naked mole rat (Heterocephalus glaber), prairie vole and postman butterfly (Heliconius melpomene). de novo (Mb) 601.6 251.4 0.9 11.3 147.7 84.8 0.1 6.6 188.6 1.6 186.9 0 0 0 0 0.1 0 161.6 58.4 103.2 0 0 0 0 0 33.16 0 14.45 15.33 1.43 0.13 0.02 1.8 40.15 1.73 18.85 16.42 1.24 0.13 0.02 1.76 14.75 0 3.87 8.03 0.25 0.02 0.01 2.57 14.7 0 3.9 8.0 0.3 0 0 2.6 Unclassified Tes Unclassified 5.61 5.61 20.1 20.1 7.00 7.00 7.6 7.6 633.57 721.84 531.88 623.9 Table 1.---Transposable element load in the naked mole rat (Heterocephalus glaber) and the prairie vole (Microtus ochrogaster) using rodent specific and de novo repeat transposable element libraries. Rodent specific libraries were taken from Repbase (August 2014). De novo libraries were combined with the rodent specific libraries in an effort to generate the complete annotations. Major Findings •Using de novo repeat identification more than 100 novel TE subfamilies were recovered in each of the prairie vole and naked mole rate genomes. •Novel TE subfamilies occupied more than 100 Mb in both rodent genomes an increase of 1520% of what was previously estimated. Postman butterfly Homology-based curation 1. Genomes were masked with clade specific repeats (Rodentia & Arthropoda) 2. Repeat content was quantified with RepeatMasker De novo curation 1. Genomes were masked with clade specific repeats (Rodentia & Arthropoda) 2. Novel repeats were identified with RepeatModelor (Heliconius from Lavoie et al. 2014) 3. Novel repeats were manually verified through a Blast, extension, alignment protocol 4. Repeats were classified based on sequence hallmarks and the 80-80-80 rule 5. Repeat content was quantified with the de novo and ancestral repeats using RepeatMasker Fully curation Rodent (Mb) 510.13 169.23 1.17 8.63 71.46 85.65 0.07 2.25 172.28 1.61 170.57 0.02 0 0 0.01 0.07 0 168.60 66.38 102.15 0.04 0 0.03 0.02 0.02 Prarie vole Class II DNA Transposons PiggyBac TcMariner hAT MuDR Helitron Kolobok Unk Total Prairie vole 2 TE derived nucleotides Count Count Homology only 3 3 Naked Mole Rat Rodent (Mb) de novo (Mb) 594.8 661.65 157.39 175.2 7.55 7.45 17.05 15.47 21.35 14.61 110.65 84.39 0.54 0.51 0.25 52.77 368.84 400.35 16.18 15.94 352.16 383.94 0.12 0.11 0.01 0.01 0.01 0.01 0.02 0.02 0.33 0.31 0.01 0.01 68.51 86.04 68.42 74.29 0 11.66 0.04 0.04 0 0 0.05 0.05 0.06 0.06 0.06 0.06 Divergence from consensus Figure 3. Differences in TE accumulation histories of the (A & B) naked mole rat (Heterocephalus glaber), (C & D) prairie vole (Michrotus ochrogaster), and (E & F) postman butterfly (Heliconius melpomene) before and after de novo TE identification and curation. RepeatMasker searches against the (A) mole rat and (C) prairie vole used all known mammal TEs and all known arthropod TEs were used against the (E) postman butterfly genome to identify all known TEs based on homology only. De novo identification and curation altered the content, quantity and distribution of elements identified for the (B) mole rat, (D) prairie vole and (F) postman butterfly genomes. Divergence from a consensus sequence from each element was calculated and binned to demonstrate the accumulation profile for each taxa. For the mole rate and prairie vole, highly mutable CpG sites were excluded from analyses. •In all three species, the difference between homology-dependent and de novo repeat identification resulted reduced nucleotide diversity within repeat subfamilies. •In mammals, re-analysis using de novo repeats shifted the estimated age of lineage specific repeats by 40-45 million years. This number reflects the age difference between the subject genome and the closest relative with known repeats (usually Mus and Rattus). Recommendations The examples presented herein indicate that the homology-based analyses commonly employed by genome sequencing, assembly and analysis projects do not provide an accurate picture of TE content or accumulation patterns. As more genome sequences become available it is imperative to provide full, detailed repeat annotations. Relying on homology to elements from a closest relative will create a negative feed back loop where poor repeat annotations in “taxa A” will lead to poor annotations in “taxa B”… The most accurate repeat annotations are possible through: 1. Repeat identification through de novo computational resources 2. Verification of element capture, often time requiring manual curation 3. Classification of novel elements to the family level or beyond 4. Re-annotation using known ancestral repeats plus newly identified elements By abiding to the principles outlined herein, our ability to understand the biology of TEs and genome evolution in general will be significantly impacted. Acknowledgments We would like to thank Robert Hubley, Laura Berdugo-Blanco, Sarah Mangum, and Wesli Stubbs for their support. Citations are available upon request.