Supplementary Online Material UEST FOR UALITY âBUSCO
Transcription
Supplementary Online Material UEST FOR UALITY âBUSCO
Supplementary Online Material BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs Felipe A. Simão†, Robert M. Waterhouse†*, Panagiotis Ioannidis, Evgenia V. Kriventseva, and Evgeny M. Zdobnov * Department of Genetic Medicine and Development, University of Geneva Medical School and Swiss Institute of Bioinformatics, rue Michel-Servet 1, 1211 Geneva, Switzerland. † Equal contribution. * To whom correspondence should be addressed: Robert.Waterhouse@unige.ch, Evgeny.Zdobnov@unige.ch Contents: 1. BUSCO: Benchmarking Universal Single-Copy Orthologs...................................................................... 2 1.1. BUSCO selection............................................................................................................................... 2 1.2. Hidden Markov models, ancestral sequences and block profiles ...................................................... 2 1.3. Candidate BUSCO matches from genome assemblies ...................................................................... 4 1.4. Gene prediction: assessing genome assemblies and transcriptomes ................................................. 4 1.5. BUSCO match assignment ................................................................................................................ 4 1.6. Classification: Complete, Duplicated, Fragmented, Missing ............................................................ 5 1.7. Training Augustus gene finding parameters ...................................................................................... 5 2. BUSCO completeness versus N50 contiguity ........................................................................................... 5 3. BUSCO versus CEGMA assessment of genome assembly completeness ................................................ 6 4. BUSCO assessments of genomes, transcriptomes, and gene sets ............................................................. 7 5. BUSCO and CEGMA analysis run-times ............................................................................................... 12 6. References ............................................................................................................................................... 13 UEST FOR UALITY “BUSCO CALIDAD” “BUSCO QUALIDADE” http://busco.ezlab.org Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 1 of 13 1. BUSCO: Benchmarking Universal Single-Copy Orthologs 1.1. BUSCO selection Benchmarking Universal Single-Copy Orthologs (BUSCO) sets are collections of orthologous groups with near-universally-distributed single-copy genes in each species, selected from OrthoDB root-level orthology delineations across arthropods, vertebrates, metazoans, fungi, and eukaryotes (Kriventseva, et al., 2014; Waterhouse, et al., 2013). BUSCO groups were selected from each major radiation of the species phylogeny requiring genes to be present as single-copy orthologs in at least 90% of the species; in others they may be lost or duplicated, and to ensure broad phyletic distribution they cannot all be missing from one sub-clade. The species that define each major radiation were selected to include the majority of OrthoDB species, excluding only those with unusually high numbers of missing or duplicated orthologs, while retaining representation from all major sub-clades. Their widespread presence means that any BUSCO can therefore be expected to be found as a single-copy ortholog in any newly-sequenced genome from the appropriate phylogenetic clade (Waterhouse, et al., 2011). A total of 38 arthropods (3’078 BUSCO groups), 41 vertebrates (4’425 BUSCO groups), 93 metazoans (1’008 BUSCO groups), 125 fungi (1’438 BUSCO groups), and 99 eukaryotes (431 BUSCO groups), were selected from OrthoDB to make up the initial BUSCO sets which were then filtered based on uniqueness and conservation as described below to produce the final BUSCO sets for each clade, representing 2’675 genes for arthropods, 3’023 for vertebrates, 843 for metazoans, 1’438 for fungi, and 429 for eukaryotes. For bacteria, 40 universal marker genes were selected from (Mende, et al., 2013). 1.2. Hidden Markov models, ancestral sequences and block profiles Hidden Markov models: For each BUSCO group, multiple sequence alignments (MSAs) were built with ClustalOmega (Sievers and Higgins, 2014) using the orthologous protein sequences of each BUSCO. The MSAs were then used to build amino acid-level hidden Markov model (HMM) profiles using HMMER 3 (Eddy, 2011). Subsequently, all BUSCO input sequences were searched (hmmsearch) against the complete library of HMM profiles to identify and remove any BUSCO groups whose members could not be reliably distinguished from each other by their profiles, and hence ensure reliable profile-delineated orthology. In total, 376, 852, and 156 groups were removed in this way from the arthropod, vertebrate, metazoan sets, respectively, while none were removed for the fungi or eukaryote datasets. The remaining, reliablydistinguishable BUSCO sets were then analysed to delineate the two parameters ‘expected-score’ and ‘expected-length’ that define the BUSCO-specific cut-offs used to classify a match as orthologous or not and as complete or not. The ‘expected score’ cut-off is defined as 90% of the minimum bitscore from an HMM search of all of a BUSCO group’s members against its own HMM profile (i.e. the lowest scoring match of the sequences used to build the profile). To be classified as a true ortholog, any BUSCO-matching gene from the species being assessed (from its genome, transcriptome, or gene set) must score above the ‘expectedscore’ cut-off. For a match to be classified as ‘complete’, it must satisfy the ‘expected-length’ cut-off, which is defined using each BUSCO group’s protein length distribution (Figure S1). Any BUSCO-matching gene from the species being assessed whose protein length falls within two standard deviations (2σ) of the BUSCO group’s mean length is classified as ‘complete’. Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 2 of 13 Consensus sequences: For each BUSCO group, an amino acid consensus sequence was generated from its respective HMM profile using HMMER’s default hmmemit settings for a majority-rule consensus sequence. These consensus sequences are used during BUSCO assessments of genome assemblies to search the genome of the species being assessed to identify the best-matching genomic regions that may encode the corresponding BUSCO-matching gene. Figure S1. Distribution of the percent differences between BUSCO group member proteins and the group’s mean protein length (negative = shorter than the mean, positive = longer than the mean, values of one and two standard deviations are shown with lines). Insets: spread of BUSCO group member protein lengths compared to BUSCO group mean lengths for arthropods (left) and vertebrates (right). Block profiles: For each BUSCO group, a ‘block profile’ was built to guide automated gene predictions with Augustus (Keller, et al., 2011). Block profiles are position-specific frequency matrices that model conserved regions of multiple sequence alignments. The BUSCO group block profiles were created from their corresponding protein multiple sequence alignments using the msa2prfl script from the Augustus package. Several highly-divergent BUSCO groups failed to produce reliable block profiles, even after processing their alignments with the Augustus preparealign script, and were therefore removed from the assessment sets: 27, 149, 51, 0 and 2 BUSCO groups were removed from the arthropod, vertebrate, metazoan, fungi and eukaryote sets, respectively. Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 3 of 13 1.3. Candidate BUSCO matches from genome assemblies Regions in a genome likely to encode BUSCO-matching genes are identified by tBLASTn searches (Camacho, et al., 2009) with the reconstructed consensus sequences of each BUSCO. Neighbouring highscoring segment pairs (HSPs) from the tBLASTn searches are merged if located within 50 Kb of each other, thus defining the span of the genomic regions to be evaluated. These genomic regions are then ranked according to the total length of the consensus sequence aligned, and up to three regions are selected for the subsequent gene prediction steps. The second- and third-ranked regions must have consensus sequence alignment lengths of at least 70% of the aligned length of the top ranking region. Selecting more than just the best candidate BUSCO match allows for the identification of normally-rarely duplicated BUSCOs from the assessed genome, which, if numerous, could indicate potentially erroneously assembled haplotypes. Lastly, the selected genomic regions are extended with 5 Kbp (small genomes) and 20 Kbp (large genomes) flanking regions (default parameters, users can specify their own flank-extension lengths). 1.4. Gene prediction: assessing genome assemblies and transcriptomes The candidate BUSCO-matching regions identified in the previous step are extracted from the genome being assessed for processing by the Augustus automated gene prediction procedure. Gene prediction is performed on each candidate region using the corresponding BUSCO group’s block profile, and default gene finding parameters (unless otherwise specified by the user). Successful Augustus gene prediction for each BUSCO group produces an initial BUSCO gene set whose protein sequences are then evaluated using the BUSCO-specific cut-offs to determine true orthology and completeness. High-confidence predicted BUSCO genes can then be selected from this initial gene set for the training of Augustus to rerun the automated gene prediction procedure with these specific genome-trained parameters (see below). For assessing transcriptomes, if the transcripts have not already been pre-processed to extract protein-coding genes then the longest open reading frame (ORF) is selected for assessment. 1.5. BUSCO match assignment This step uses the properties of each BUSCO group’s HMM profile to determine whether a significantly matching protein sequence is likely orthologous or just homologous. Significant matches are first determined by searching the full set of protein sequences to be assessed against the complete library of BUSCO group HMM profiles using HMMER’s hmmsearch. As described above, filtering of the initial BUSCO sets ensured that each library contains only reliably-distinguishable profiles. The set of protein sequences to be assessed may be from the Augustus-predicted BUSCO gene set, a transcriptome-based gene set, or the annotated ‘Official Gene Set’ (OGS). For each hmmsearch sequence-profile alignment, two measures are computed and evaluated: the alignment bitscore and the total length of sequence aligned to the HMM profile. For a BUSCO-matching gene to be considered orthologous, the alignment bitscore must be greater than or equal to the ‘expected-score’ of the corresponding BUSCO group (see above for ‘expected-score’ definition). Genes that pass the ‘expected-score’ cut-off are then evaluated for protein length completeness as described below. Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 4 of 13 1.6. Classification: Complete, Duplicated, Fragmented, Missing The final stage of the assessments classify each arthropod, vertebrate, metazoan, fungal, or eukaryote BUSCO as complete, duplicated, fragmented, or missing from the gene set being assessed. Classification of BUSCO-matching genes that meet the ‘expected-score’ cut-off employs the protein length distribution of each BUSCO to determine whether the ortholog is ‘Complete’ or ‘Fragmented’. Orthologs are considered to be ‘Complete’ if the length of their aligned sequence is within two standard deviations (2σ) of the BUSCO group’s mean length (i.e. 95% expectation), otherwise they are classified as ‘Fragmented’ recoveries (Figure S1). A BUSCO is classified as ‘Duplicated’ when multiple BUSCO-matching genes meet both the ‘expected-score’ and the ‘expected-length’ cut-offs, i.e. multiple copies of full-length orthologs are found in the gene set being assessed. Lastly, any BUSCO without a BUSCO-matching gene that meets the ‘expectedscore’ cut-off is classified as ‘Missing’. 1.7. Training Augustus gene finding parameters Augustus predictions using generic parameters may produce sub-optimal gene predictions. Training the prediction parameters using the most reliable gene structures obtained from the initial set of predictions can substantially improve the results. To train Augustus, BUSCO-matching genes classified as ‘Complete’ and single-copy are selected to form a high-quality training dataset. The selected gene structures are extracted, and used to build GenBank files (gff2smallgb) suitable for training Augustus (etraining). This procedure results in the creation of genome-specific gene finding parameters; for the vast majority of genomes evaluated, when compared to ‘generic’ gene finding parameters, these genome-specific parameters result in substantial increases in the sensitivity and specificity of Augustus predictions, both at gene and exon levels. A second round of Augustus gene prediction is then performed using these genome-specific parameters on all BUSCO-matching candidate regions where initial predictions failed or did not yield a ‘Complete’ ortholog. Orthology assessment, protein length evaluations, and final classifications are then performed as outlined above to produce the final BUSCO assessment results. Augustus allows for the possibility of further sensitivity and specificity gains by applying multiple rounds of metaparameter optimisation performed using OptimizeAugustus. However, this extra optimisation step comes at the cost of generally more than double the run-time for a typical genome assembly assessment, without large improvements in assessment sensitivity. Thus, for default genome assembly assessments, this extra optimisation step is not performed unless specified by the user (--long mode). This option is made available to users because although the improvements from this extra optimisation step are minimal for the purposes of assembly assessments, they can prove valuable when using BUSCO sets to train gene predictors for subsequent use as part of multi-evidence-based whole genome annotation pipelines. 2. BUSCO completeness versus N50 contiguity BUSCO assessment of genome assembly completeness is designed to provide a more detailed quantification of assembly quality than traditional measures such as scaffold N50 metrics of assembly contiguity. Comparing BUSCO completeness with N50 contiguity for a selection of genomes ranging from fragmented draft assemblies to chromosome-level genome assemblies reveals the low correlation (r=0.149) between these measures (Figure S2). Thus, even fragmented assemblies with relatively low N50 values can encode fairly complete gene sets, and some assemblies that appear to be of good quality based on contiguity measures are not necessarily more complete in terms of expected gene content. Additionally, when assessing gene sets, it is clear that species with very high gene counts are not necessarily the most complete, nor are those with rather low gene counts necessarily incomplete (Waterhouse, 2015). For a typical eukaryotic draft assembly, BUSCO assessments suggest that assemblies with N50 values on the order of 50 Kbp are capable of yielding fairly complete gene sets. Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 5 of 13 Figure S2. BUSCO completeness versus N50 contiguity. Nine outliers with N50 values above 10’000 Kbp are not shown, each of which achieve more than 90% BUSCO completeness. 3. BUSCO versus CEGMA assessment of genome assembly completeness The Core Eukaryotic Genes Mapping Approach (CEGMA) is a widely-used method to assess genome assembly completeness in terms of gene content (Parra, et al., 2007; Parra, et al., 2009), but does not provide a means for directly assessing gene sets. CEGMA employs a set of 248 conserved Core Eukaryotic Genes (CEGs) expected to be present in any newly sequenced eukaryotic genome. The CEGs are derived from eukaryotic KOGs (Tatusov, et al., 2003) and are composed of orthologous protein sequences from six eukaryotic species (human, fruit fly, roundworm, thale cress, fission yeast and baker’s yeast), for which a corresponding HMM profile is built from their multiple sequence alignments. In order to perform a like-for-like comparison of the CEGMA and BUSCO genome assembly and gene set assessments, a subset of 250 of the 429 eukaryote BUSCOs was selected with the lowest variations of their ‘expected-score’ and ‘expected-length’ parameters. As the CEGMA pipeline does not perform gene set assessments, an analysis pipeline was built to use the CEGMA HMM profiles instead of the BUSCO HMM profiles. In addition, the pipeline employed the cut-offs that CEGMA uses to determine the presence/absence (from the provided ‘cutoff_file’ with the cut-offs for CEGMA HMMs) and complete/partial (complete, >70% CEG length) status of potentially orthologous matches. Thus, BUSCO assessments of genome assemblies and gene sets were performed with normal default options except for substituting the full eukaryote BUSCO set with a subset of only 250 in order to match the number of CEGMA CEGs. The CEGMA assessments of genome assemblies were performed with normal default options, and CEGMA assessments of gene sets were enabled by building a pipeline to use CEGMA HMM profiles and cut-offs. The results for the assessments of 40 species are shown in Figure 2 of the main text. They reveal generally consistent BUSCO assessments across highly divergent lineages from fungi to human, with somewhat less consistent results from the CEGMA assessments (BUSCO linear regression more closely follows the diagonal than that of CEGMA). Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 6 of 13 Linear regressions of each set, adjusted R2: BUSCO R2 = 0.718 CEGMA R2 = 0.413 R2 = SSR / SST where SSR = ∑ (ŷi - )2, SST = ∑ (yi - )2 yi is the ith observed value ŷi is the ith expected value from the best-fit line and is the mean of y To evaluate against the diagonal (x = y) instead of the best-fit, the expected value (ŷi) simply becomes the x value (xi), and there is no intercept term (i.e. x = y = 0) so: R2(x=y) = 1 – ( SSE / SST ) where SSE = ∑ (yi - ŷi)2. BUSCO: R2(x=y) = 1 – ( 1281.6 / 3440.5 ) = 0.63 CEGMA: R2(x=y) = 1 – ( 5944.3 / 1936.3 ) = -2.07 4. BUSCO assessments of genomes, transcriptomes, and gene sets The BUSCO assessment pipeline was applied to 70 available genome assemblies and their corresponding official gene sets, as well as to 93 additional gene sets, and 96 transcriptomes. The detailed results are shown in Table S1 in C[D],F,M,n BUSCO notation. The evaluated genome assemblies include both high quality reference genomes (e.g. Homo sapiens), as well as de novo assemblies of non-model organisms, sampling a wide range of different fold-coverage levels, N50 sizes, sequencing technologies, and assembly strategies. These genomes represent the four major BUSCO lineages with 41 arthropods from 13 different orders, 3 vertebrates from 3 different orders, 11 basal metazoans, and 15 fungal species from 12 different orders. The gene sets chosen for these assessments comprise: 41 arthropods, 26 vertebrates, 11 basal metazoans and 15 fungal species. 96 transcriptomes were also evaluated; sequences were typically derived from mRNA extracted from different tissue types. The transcriptomes analysed cover a total of 11 fungal species (14 transcriptomes), 39 arthropods (44 transcriptomes), 18 vertebrates (28 transcriptomes) and 10 basal metazoans (13 transcriptomes). Duplications [D] were not assessed (n.a.) for unfiltered gene sets or transcriptomes that contained multiple transcripts of the same gene as this would lead to overestimates of BUSCO duplications. Table S1. Current assessment completeness metrics in BUSCO notation (C:complete [D:duplicated], F:fragmented, M:missed, n:genes) sampling different types of data and a variety of eukaryotic species. Lineage Species Homo sapiens Mus musculus Vertebrates Ornithorhyncus anatinus Callithrix jacchus Pan troglodytes Sample type Genome Gene set Genome Gene set Genome Gene set Gene set Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Gene set Transcriptome Transcriptome Identifier GCA_000001405.15 GRCh37.75 GCA_000001635.4 GRCm38.75 GCF_000002275.2 OANA5.75 C_jacchus3.2.1.75 GI:532219616 Bladder GI:532292355 hypocampus GI:532349506 Cortex GI:532452938 S. muscle GI:532524775 Cerebellum CHIMP2.14.75 GI:410228237adipose SC GI:410308999 Fibroblast N50 (Kbp) BUSCOs assessment 67,794 C:89% [D:1.5%], F:6.0%, M:4.5%, n:3023 C:99% [D:1.7%], F:0.0%, M:0.0%, n:3023 52,589 C:78% [D:3.0%], F:19%, M:2.5%, n:3023 C:99% [D:2.5%], F:99%, M:0.1%, n:3023 991 C:55% [D:0.8%], F:25%, M:18%, n:3023 C:72% [D:1.1%], F:19%, M:8.2%, n:3023 C:97% [D:2.9%], F:1.7%, M:0.8%, n:3023 C:76% [D:17%], F:5.5%, M:18%, n:3023 C:79% [D:18%], F:4.5%, M:15%, n:3023 C:34% [D:7.6%], F:34%, M:64%, n:3023 C:69% [D:13%], F:6.0%, M:24%, n:3023 C:76% [D:19%], F:5.1%, M:18%, n:3023 C:96% [D:0.5%], F:1.2%, M:1.9%, n:3023 C:75% [D:15%], F:3.8%, M:20%, n:3023 C:75% [D:16%], F:3.7%, M:21%, n:3023 Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 7 of 13 Lineage Species Anolis carolinensis Latimeria chalmnae Rana clamitans Pseudoacris regilla Salmo salar Oreochromis niloticus Ameiurus nebulosus Ursus maritimus Tripterygion delaisi Atractaspis aterrima Latimeria menadoensis Hynobius chinensis Carduelis chloris Maylandia zebra Chinchilla lanigera Ailuropoda melanoleuca Bos taurus Danio rerio Felis catus Ficedula albicollis Gallus gallus Gorilla gorilla Loxodonta africana Macaca mulatta Monodelphis domestica Mustela putorius Oreochromis niloticus Oryctolagus cuniculus Oryzias latipes Pongo abelii Sus scrofa Taeniopygia guttata Takifugu rubripes Xenopus tropicalis Xiphophorus maculatus Acromyrmex echinatior Acyrtosiphon pisum Aedes aegypti Anopheles gambiae Arthropods Apis mellifera Atta cephalotes Bombyx mori Camponotus floridanus Danaus plexippus Daphnia pulex Dendroctonus ponderosa Drosophila anannasse Sample type Transcriptome Gene set Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Gene set Gene set Gene set Gene set Gene set Gene set Gene set Gene set Gene set Gene set Gene set Gene set Gene set Gene set Gene set Gene set Gene set Gene set Gene set Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Identifier GI:410268357 Endothelium AnoCar2.0.75 GI:614142443 Skeletal GI:464801713 Whole GI:387756559 Muscle GI:451274083 Unknown GI:451272305 Unknown GI:666988260 Mixed GI:555682626 Spleen GI:472819489 Unknown GI:510063642 Fat GI:572723144 Brain GI:673456880 Venom GI:673404158 Venom GI:559559797 Testis GI:570932341 Unknown GI:617996660 Blood GI:614241491 Kidney GI:618625375 Trachea ailMel1.75 UMD3.175 Zv9.75 Felis_catus_6.2.75 FicAlb_1.4.75 Galga4.75 gorGor3.1.75 loxAfr3.75 MMUL_1.75 BROADO5.75 MusPutFur1.0.75 Orenil1.0.75 OryCun2.0.75 MEDAKA1.75 PPYG2.75 Sscrofa10.2.75 taeGut3.2.4.75 FUGU4.75 JGI_4.2.75 Xipmac4.4.2.75 Aech_2.0 Aech_OGS_v3.8 GCA_000142985.2 GCA_000142985.2.22 AaegL3 AaegL3.2 AgamP4 AgamP4.2 Amel_v4.5 Amel_OGS_v3.2 Acep 1.0 Acep OGS v1.2 GCA_000151625.1 GLEAN set Cflor_v3.3 Cflor_OGS_v3.3 DanPle_1.0.22 DanPle_1.0.22 GCA_000187875.1 GCA_000187875.1.22 GCA_000355655.2 GCA_000355655.2.22 Dana_r1.3 N50 (Kbp) 1,110 86 1,547 49,364 997 5,154 4,008 451 52 642 628 4,599 BUSCOs assessment C:75% [D:15%], F:3.5%, M:21%, n:3023 C:89% [D:2.6%], F:6.8%, M:3.4%, n:3023 C:58% [D:14%], F:8.7%, M:32%, n:3023 C:27% [D:15%], F:18%, M:53%, n:3023 C:37% [D:6.9%], F:11%, M:50%, n:3023 C:21% [D:0.3%], F:13%, M:65%, n:3023 C:20% [D:0.4%], F:16%, M:63%, n:3023 C:19% [D:7.8%], F:6.6%, M:74%, n:3023 C:39% [D:0.4%], F:16%, M:44%, n:3023 C:7.3% [D:0.2%], F:10%, M:82%, n:3023 C:50% [D:29%], F:5.5%, M:44%, n:3023 C:35% [D:13%], F:17%, M:47%, n:3023 C:0.7% [D:0.0%], F:1.0%, M:98%, n:3023 C:4.4% [D:0.5%], F:6.8%, M:88%, n:3023 C:71% [D:15%], F:6.5%, M:22%, n:3023 C:59% [D:7.3%], F:13%, M:26%, n:3023 C:31% [D:0.2%], F:12%, M:55%, n:3023 C:64% [D:15%], F:8.7%, M:26%, n:3023 C:80% [D:44%], F:5.7%, M:14%, n:3023 C:97% [D:1.3%], F:1.8%, M:0.3%, n:3023 C:97% [D:1.3%], F:1.6%, M:0.5%, n:3023 C:95% [D:8.3%], F:3.2%, M:1.7%, n:3023 C:96% [D:1.2%], F:2.8%, M:0.5%, n:3023 C:88% [D:2.0%], F:4.1%, M:7.8%, n:3023 C:90% [D:2.4%], F:3.5%, M:6.0%, n:3023 C:96% [D:2.6%], F:1.7%, M:2.1%, n:3023 C:96% [D:1.5%], F:2.3%, M:1.0%, n:3023 C:94% [D:2.0%], F:4.5%, M:0.9%, n:3023 C:95% [D:4.0%], F:2.3%, M:1.6%, n:3023 C:97% [D:1.4%], F:1.7%, M:1.0%, n:3023 C:96% [D:5.1%], F:1.4%, M:2.5%, n:3023 C:93% [D:2.7%], F:3.0%, M:3.2%, n:3023 C:83% [D:3.2%], F:5.4%, M:11%, n:3023 C:95% [D:1.1%], F:3.3%, M:1.1%, n:3023 C:83% [D:7.4%], F:6.8%, M:10%, n:3023 C:81% [D:3.2%], F:7.5%, M:11%, n:3023 C:89% [D:5.2%], F:3.5%, M:7.3%, n:3023 C:93% [D:3.4%], F:3.5%, M:2.5%, n:3023 C:93% [D:3.6%], F:4.7%, M:1.3%, n:3023 C:91% [D:2.6%], F:8.0%, M:0.6%, n:2675 C:96% [D:8.8%], F:2.8%, M:0.5%, n:2675 C:72% [D:6.1%], F:15%, M:12%, n:2675 C:89% [D:14%], F:4.1%, M:5.9%, n:2675 C:86% [D:13%], F:10%, M:3.2%, n:2675 C:93% [D:17%], F:3.6%, M:3.0%, n:2675 C:93% [D:4.7%], F:4.1%, M:2.5%, n:2675 C:97% [D:10%], F:1.4%, M:0.8%, n:2675 C:93% [D:2.9%], F:5.1%, M:0.9%, n:2675 C:97% [D:9%], F:2.1%, M:0.1%, n:2675 C:89% [D:2.6%], F:8.7%, M:1.3%, n:2675 C:91% [D:7.7%], F:7.5%, M:0.5%, n:2675 C:73% [D:2.2%], F:17%, M:8.3%, n:2675 C:75% [D:7.0%], F:14%, M:10%, n:2675 C:92% [D:3.1%], F:6.6%, M:0.5%, n:2675 C:95% [D:8.7%], F:3.9%, M:0.4%, n:2675 C:83% [D:8.6%], F:11%, M:4.3%, n:2675 C:86% [D:9.0%], F:9.5%, M:3.7%, n:2675 C:83% [D:3.9%], F:11%, M:5.1%, n:2675 C:84% [D:10%], F:11%, M:4.0%, n:2675 C:77% [D:6.1%], F:15%, M:7.2%, n:2675 C:82% [D:11%], F:10%, M:6.6%, n:2675 C:96% [D:3.7%], F:1.9%, M:1.9%, n:2675 Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 8 of 13 Lineage Species Drosophila erecta Drosophila grimshawi Drosophila melanogaster Drosophila mojavensis Drosophila persimilis Drosophila pseudobscura Drosophila sechelia Drosophila simulans Drosophila virilis Drosophila willistoni Drosophila yakuba Harpegnathos saltator Heliconius melpomene Ixodes scapularis Linepithema humile Lutzomyia longipalpis Manduca sexta Megaselia scalaris Metaseiulus occidentalis Musca domestica Nasonia vitripennis Pediculus humanus Phlebotomus papatasi Pogonomyrmex barbatus Solenopsis invicta Rhodnius prolixus Strigamia maritima Tetranychus urticae Tribolium castaneum Acanthoscurria geniculata Anopheles sinensis Anthonomus grandis Sample type Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Transcriptome Transcriptome Transcriptome Identifier Dana_r1.3 Dere_r1.3 Dere_r1.3 Dgri_r1.3 Dgri_r1.3 Dmel_r5.55 Dmel_r5.55 Dmoj_r1.3 Dmoj_r1.3 Dper_r1.3 Dper_r1.3 Dpse_r3.1 Dpse_r3.1 Dsec_r1.3 Dsec_r1.3 Dsim_r1.4 Dsim_r1.4 Dvir_r1.2 Dvir_r1.2 Dwil_r1.3 Dwil_r1.3 Dyak_r1.3 Dyak_r1.3 Hsal_v3.3 Hsal_OGS_v3.3 Hmel_v1.22 Hmel_v1.22 IscaW1 IscaW1.3 Lhum_v1.0 Lhum_OGS_v1.2 Llonj1.1 Llonj1.1 GCA_000262585.1 OGS2_20140407 Mscal_v1.22 Mscal_v1.22 Mocc_1.0 Mocc_1.0 v2.0.2 v2.0.2 Nvit_v1.0 Nvit_OGS_v1.2 PhumU2 PhumU2.1 Ppapi1.1 Ppapi1.1 Pbar_v1.0 Pbar_OGS_v1.2 Sinv_v1.0 Sinv_OGS_v2.2.3 RproC1 RprocC1.2 Smar1.22 GCA_000239435.1.22 GCA_000239435.1 GCA_000239435.1.22 Tcas3.22 Tcas_OGS_v2 GI:598795695 whole GI:656597267 unknown GI:562777735 whole N50 (Kbp) 18,748 8,399 23,011 24,764 1,869 12,541 2,123 857 10,161 4,511 21,770 601 194 76 1,402 85 664 1 896 226 698 497 0.87 819 558 847 139 2,993 19,135 BUSCOs assessment C:98% [D:9.6%], F:0.8%, M:0.1%, n:2675 C:98% [D:4.7%], F:1.4%, M:0.4%, n:2675 C:99% [D:9.3%], F:0.2%, M:0.1%, n:2675 C:97% [D:6.2%], F:2.2%, M:0.4%, n:2675 C:99% [D:11%], F:0.4%, M:0.0%, n:2675 C:98% [D:6.4%], F:0.6%, M:0.3%, n:2675 C:99% [D:9.1%], F:0.2%, M:0.0%, n:2675 C:97% [D:4.4%], F:2.2%, M:0.4%, n:2675 C:99% [D:9.6%], F:0.8%, M:0.1%, n:2675 C:93% [D:5.6%], F:5.8%, M:0.8%, n:2675 C:93% [D:9.3%], F:5.6%, M:0.7%, n:2675 C:96% [D:6.3%], F:2.2%, M:1.1%, n:2675 C:98% [D:11%], F:0.6%, M:0.6%, n:2675 C:96% [D:5.1%], F:2.8%, M:0.7%, n:2675 C:96% [D:8.9%], F:3.0%, M:0.3%, n:2675 C:85% [D:4.6%], F:9.0%, M:5.0%, n:2675 C:84% [D:7.6%], F:6.9%, M:8.0%, n:2675 C:96% [D:5.2%], F:2.4%, M:0.6%, n:2675 C:99% [D:9.6%], F:0.7%, M:0.1%, n:2675 C:97% [D:5.5%], F:1.7%, M:0.4%, n:2675 C:99% [D:10%], F:0.6%, M:0.2%, n:2675 C:97% [D:6.5%], F:1.5%, M:0.7%, n:2675 C:98% [D:10%], F:0.8%, M:0.2%, n:2675 C:89% [D:3.2%], F:9.6%, M:1.1%, n:2675 C:95% [D:9.0%], F:3.8%, M:0.7%, n:2675 C:77% [D:2.0%], F:11%, M:10%, n:2675 C:74% [D:6.7%], F:14%, M:11%, n:2675 C:58% [D:1.7%], F:21%, M:19%, n:2675 C:69% [D:6.6%], F:23%, M:7.1%, n:2675 C:92% [D:3.3%], F:7.0%, M:0.6%, n:2675 C:95% [D:8.8%], F:4.0%, M:0.1%, n:2675 C:73% [D:6.3%], F:10%, M:16%, n:2675 C:66% [D:9.7%], F:13%, M:20%, n:2675 C:81% [D:4.4%], F:12%, M:6.1%, n:2675 C:80% [D:10%], F:10%, M:8.2%, n:2675 C:16% [D:0.6%], F:21%, M:61%, n:2675 C:21% [D:1.4%], F:20%, M:58%, n:2675 C:76% [D:4.9%], F:12%, M:10%, n:2675 C:82% [D:14%], F:10%, M:6.5%, n:2675 C:91% [D:4.3%], F:5.3%, M:2.7%, n:2675 C:97% [D:29%], F:2.3%, M:0.5%, n:2675 C:91% [D:6.0%], F:5.1%, M:3.2%, n:2675 C:94% [D:10%], F:4.0%, M:1.1%, n:2675 C:92% [D:3.9%], F:6.1%, M:1.6%, n:2675 C:93% [D:9.1%], F:4.9%, M:1.3%, n:2675 C:33% [D:3.2%], F:33%, M:33%, n:2675 C:54% [D:6.1%], F:20%, M:25%, n:2675 C:90% [D:2.9%], F:8.5%, M:0.7%, n:2675 C:93% [D:8.2%], F:6.5%, M:0.3%, n:2675 C:74% [D:2.4%], F:19%, M:6.3%, n:2675 C:80% [D:6.5%], F:14%, M:5.4%, n:2675 C:85% [D:2.5%], F:12%, M:2.5%, n:2675 C:74% [D:8.3%], F:9.1%, M:16%, n:2675 C:84% [D:5.9%], F:12%, M:3.2%, n:2675 C:87% [D:12%], F:8.3%, M:4.6%, n:2675 C:61% [D:4.5%], F:12%, M:25%, n:2675 C:69% [D:11%], F:9.6%, M:20%, n:2675 C:95% [D:5.8%], F:3.9%, M:0.8%, n:2675 C:95% [D:10%], F:3.0%, M:1.3%, n:2675 C:65% [D:n.a.], F:13%, M:20%, n:2675 C:36% [D:n.a.], F:22%, M:41%, n:2675 C:18% [D:n.a.], F:16%, M:65%, n:2675 Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 9 of 13 Lineage Species Bactrocera dorsalis Belgica antartica Calanus finmarchicus Ceratitis capitata Cherax quadricarinatus Corydalinae sp. Delia antiqua Dendroctonus frontalis Drosophila ercepeae Drosophila malerkotliana m. Drosophila malerkotliana p. Drosophila merina Drosophila miranda Drosophila pseudoananassae n. Drosophila pseudoananassae p. Drosophila serrata Echinogammarus veneris Enallagma hageni Folsomia candida Hyalella azteca Ips typographus Ixodes scapularis Ixodes ricinus Latrodectus hesperus Melita plumosa Mengenilla moldrzyki Musca domestica Nannochorista philpotti Nilaparvata lugens Orchesella cincta Polistes canadensis Pontastacus leptodactylus Priacma serrata Spodoptera exigua Stegodyphus mimosarum Teleopsis dalmanni Teleopsis whitei Themira biloba Brugia malayi Caenorhabditis briggsae Caenorhabditis elegans Other metazoans Caenorhabditis japonica Helobdella robusta Loa loa Lottia gigantea Nematostella vectensis Schistosoma mansoni Strongylocentrotus purpuratus Trichoplax adhaerens Sample type Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Identifier GI:618068638 unknown GI:418280542 whole GI:592958556 unknown GI:647215886 unknown GI:577749858 whole GI:512174511 hypodermis GI:661070030 whole GI:604701913 whole GI:452943093 whole GI:570540147 unknown GI:570549742 unknown GI:570523813 unknown GI:570504412 unknown GI:645592147 unknown GI:570451470 unknown GI:570485056 whole GI:480512000 unknown GI:595402945 unknown GI:459275420 total GI:570625125 unknown GI:510074665 unknown GI:510092454 unknown GI:459277393 antenna GI:604952323 Synganglion GI:556088131 salivary GI:618730332 unknown GI:510208131 whole GI:660742704 whole GI:604923024 unknown GI:661012745 whole GI:672467144 salivary GI:570587022 unknown GI:452055806 multiple GI:556694752 hypodermis GI:557011125 hepatopancreas GI:661240973 Unknown GI:548816146 unknown GI:598904898 whole GI:615270444 whole GI:619803922 whole GI:654236640 wildtype GCA_000002995.3 B_malayi_3.0.22 CB4 CB4.22 GCA_000002985.3 WBcel235.22 GCA_000147155.1 C_japonica-7.0.1.22 GCA_000326865.1 GCA_000326865.1.22 GCA_00018385.2 Loa_loa_v3.22 GCA_00032785.1 GCA_00032785.1.22 GCA_000209225.1 GCA_000209225.1.22 GCA_000237925.2 ASM2379v2.22 GCA_000002235.2 GCA_000002235.2.22 GCA_000150275.1 N50 (Kbp) 37 17,512 17,494 94 3,060 174 1,870 472 34,464 167 5,978 BUSCOs assessment C:87% [D:n.a.], F:5.9%, M:6.4%, n:2675 C:79% [D:n.a.], F:10%, M:9.8%, n:2675 C:84% [D:n.a.], F:7.3%, M:8.5%, n:2675 C:78% [D:n.a.], F:11%, M:10%, n:2675 C:87% [D:n.a.], F:7.3%, M:5.6%, n:2675 C:7.8% [D:n.a.], F:7.6%, M:84%, n:2675 C:14% [D:n.a.], F:20%, M:64%, n:2675 C:55% [D:n.a.], F:15%, M:28%, n:2675 C:56% [D:n.a.], F:22%, M:21%, n:2675 C:18% [D:n.a.], F:16%, M:65%, n:2675 C:19% [D:n.a.], F:16%, M:64%, n:2675 C:29% [D:n.a.], F:24%, M:45%, n:2675 C:25% [D:n.a.], F:20%, M:53%, n:2675 C:91% [D:n.a.], F:4.2%, M:4.0%, n:2675 C:6.2% [D:n.a.], F:21%, M:72%, n:2675 C:8.5% [D:n.a.], F:21%, M:70%, n:2675 C:40% [D:n.a.], F:22%, M:36%, n:2675 C:20% [D:n.a.], F:8.0%, M:71%, n:2675 C:6.9% [D:n.a.], F:7.6%, M:85%, n:2675 C:47% [D:n.a.], F:14%, M:38%, n:2675 C:5.9% [D:n.a.], F:3.8%, M:90%, n:2675 C:6.6% [D:n.a.], F:5.4%, M:87%, n:2675 C:19% [D:n.a.], F:20%, M:59%, n:2675 C:27% [D:n.a.], F:26%, M:46%, n:2675 C:77% [D:n.a.], F:8.4%, M:13%, n:2675 C:82% [D:n.a.], F:8.4%, M:9.3%, n:2675 C:6.4% [D:n.a.], F:6.3%, M:87%, n:2675 C:9.5% [D:n.a.], F:13%, M:76%, n:2675 C:64% [D:n.a.], F:19%, M:15%, n:2675 C:31% [D:n.a.], F:31%, M:37%, n:2675 C:74% [D:n.a.], F:12%, M:12%, n:2675 C:44% [D:n.a.], F:11%, M:44%, n:2675 C:51% [D:n.a.], F:22%, M:26%, n:2675 C:73% [D:n.a.], F:11%, M:14%, n:2675 C:44% [D:n.a.], F:12%, M:43%, n:2675 C:11% [D:n.a.], F:16%, M:72%, n:2675 C:29% [D:n.a.], F:14%, M:55%, n:2675 C:14% [D:n.a.], F:16%, M:68%, n:2675 C:92% [D:n.a.], F:6.0%, M:1.6%, n:2675 C:90% [D:n.a.], F:4.6%, M:5.3%, n:2675 C:71% [D:n.a.], F:16%, M:11%, n:2675 C:60% [D:1.5%], F:13%, M:25%, n:843 C:77% [D:9.7%], F:5.1%, M:17%, n:843 C:76% [D:2.9%], F:7.5%, M:16%, n:843 C:85% [D:11%], F:3.5%, M:11%, n:843 C:85% [D:6.9%], F:2.8%, M:11%, n:843 C:90% [D:11%], F:1.7%, M:7.5%, n:843 C:63% [D:4.8%], F:13%, M:22%, n:843 C:67% [D:9.4%], F:11%, M:20%, n:843 C:74% [D:3.4%], F:10%, M:14%, n:843 C:85% [D:12%], F:9.9%, M:4.2%, n:843 C:80% [D:6.6%], F:2.4%, M:17%, n:843 C:81% [D:8.5%], F:4.5%, M:14%, n:843 C:89% [D:2.3%], F:4.3%, M:5.8%, n:843 C:90% [D:13%], F:7.8%, M:2.1%, n:843 C:78% [D:3.5%], F:10%, M:10%, n:843 C:83% [D:15%], F:14%, M:2.8%, n:843 C:56% [D:4.3%], F:8.3%, M:34%, n:843 C:65% [D:7.8%], F:8.3%, M:26%, n:843 C:87% [D:6.5%], F:7.8%, M:4.9%, n:843 C:83% [D:19%], F:15%, M:0.7%, n:843 C:81% [D:1.1%], F:7.8%, M:10%, n:843 Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 10 of 13 Lineage Species Ancylostoma ceylanicum Aplysia californica Apostichopus japonicus Asterias amurensis Bithynia siamensis goniomphalos Evechinus chloroticus Henricia sp. AR-2014 Patiria miniata Patiria pectinifera Procotyla flyviatilis Ashbya gossypii Aspergillus nidulans Cryptococcus neoformnas Gibberella zeae Komagataella pastoris Neurospora crassa Phaeosphaeria nodorum Puccinia graminis Saccharomyces cerevisiae Fungi Schizosaccharomyces pombe Sclerotina sclerotiorum Tuber melanosporum Ustilago maydis Verticillium dahliae Yarrowia lipolytica Agaricus subrufescens Armillaria ostoyae Hypsizygus marmoreus Ophiocordyceps sinensis Phakopsora pachyrhizi Puccinia striiformis f.sp. tritici Pyrenochaeta lycopersici Spraguea lophii Termitomyces clypeatus Trametes sanguinea Uromyces appendiculatus Sample type Gene set Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Genome Gene set Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Transcriptome Identifier ASM1507v1.22 GI:595744344 Unknown GI:613602134 chemokine GI:614063388 Gills GI:606015213 Heart GI:594457164 Salivary GI:638469663 Unknown GI:638532954 Unknown GI:480970007 Unknown GI:559461775 Unknown GI:638872012 Unknown GI:638728087 Ovary GI:638651248 Unknown GI:528026207 Unknown GCA_000091025.4 N50 (Kbp) 1,476 GCA_000011425.1 3,704 GCA_000091045.1 1,438 GCA_000240135.2 5,350 GCA_000027005.1 2,394 GCA_000182925.1 6,000 GCA_000146915.1 1,045 GCA_000149925.1 964 GCA_000146045.2 924 GCA_000002945.2 4,539 GCA_000146945.1 1,625 GCA_000151645.1 638 GCA_000328475.1 127 GCA_000150675.1 1,273 GCA_000002525.1 3,633 GI:645683639 Unknown GI:480500433 RNA1 GI:612225315 Unknown GI:630075070 Unknown GI:452772923 Thai1 GI:509494464 PST GI:509507311 Haustorium GI:509515198 Spore GI:589143963 unknown GI:520759716 Spore GI:595370870 treated GI:595351039 untreated GI:511189810 BAFC2126 GI:452898896 SWBR1 BUSCOs assessment C:85% [D:11%], F:12%, M:2.3%, n:843 C:16% [D:n.a.], F:38%, M:44%, n:843 C:88% [D:n.a.], F:8.1%, M:2.8%, n:843 C:88% [D:n.a.], F:8.4%, M:3.5%, n:843 C:77% [D:n.a.], F:12%, M:9.3%, n:843 C:41% [D:n.a.], F:23%, M:34%, n:843 C:68% [D:n.a.], F:24%, M:6.9%, n:843 C:59% [D:n.a.], F:28%, M:11%, n:843 C:57% [D:n.a.], F:24%, M:17%, n:843 C:92% [D:n.a.], F:5.3%, M:2.6%, n:843 C:90% [D:n.a.], F:7.9%, M:1.1%, n:843 C:88% [D:n.a.], F:10%, M:1.1%, n:843 C:80% [D:n.a.], F:18%, M:1.6%, n:843 C:54% [D:n.a.], F:18%, M:26%, n:843 C:95% [D:4.5%], F:1.8%, M:2.9%, n:1438 C:95% [D:7.3%], F:3.8%, M:0.9%, n:1438 C:98% [D:1.8%], F:0.9%, M:0.2%, n:1438 C:95% [D:11%], F:2.8%, M:1.8%, n:1438 C:92% [D:5.4%], F:2.5%, M:4.8%, n:1438 C:90% [D:7.1%], F:5.9%, M:3.1%, n:1438 C:98% [D:1.3%], F:1.3%, M:0.2%, n:1384 C:97% [D:11%], F:2.0%, M:0.2%, n:1384 C:93% [D:5.0%], F:4.5%, M:2.0%, n:1438 C:93% [D:8.5%], F:3.8%, M:2.7%, n:1438 C:98% [D:6.5%], F:0.6%, M:0.6%, n:1438 C:97% [D:10%], F:1.5%, M:0.6%, n:1438 C:96% [D:6.0%], F:3.1%, M:0.2%, n:1438 C:91% [D:9.7%], F:8.4%, M:0.4%, n:1438 C:63% [D:5.6%], F:20%, M:15%, n:1438 C:85% [D:11%], F:8.0%, M:6.3%, n:1438 C:96% [D:5.2%], F:0.4%, M:2.7%, n:1438 C:98% [D:8.6%], F:1.1%, M:0%, n:1438 C:89% [D:3.8%], F:2.7%, M:7.7%, n:1438 C:90% [D:9.5%], F:5.7%, M:3.3%, n:1438 C:70% [D:3.5%], F:3.8%, M:25%, n:1438 C:67% [D:8%], F:7.4%, M:25%, n:1438 C:95% [D:5.0%], F:4.1%, M:0.6%, n:1438 C:91% [D:9.0%], F:6.2%, M:2.3%, n:1438 C:92% [D:5.9%], F:3.1%, M:4.4%, n:1438 C:88% [D:7.5%], F:6.6%, M:5.0%, n:1438 C:95% [D:4.4%], F:3.5%, M:0.9%, n:1438 C:94% [D:9.4%], F:4.5%, M:0.9%, n:1438 C:97% [D:5.4%], F:2.1%, M:0.6%, n:1438 C:96% [D:8.8%], F:2.9%, M:0.6%, n:1438 C:7.7% [D:n.a.], F:28%, M:63%, n:1438 C:45% [D:n.a.], F:42%, M:11%, n:1438 C:59% [D:n.a.], F:34%, M:6.4%, n:1138 C:38% [D:n.a.], F:36%, M:24%, n:1438 C:9.3% [D:n.a.], F:12%, M:78%, n:1438 C:32% [D:n.a.], F:35%, M:32%, n:1438 C:22% [D:n.a.], F:33%, M:43%, n:1438 C:17% [D:n.a.], F:32%, M:49%, n:1438 C:94% [D:n.a.], F:4.8%, M:0.1%, n:1438 C:6.4% [D:n.a.], F:11%, M:82%, n:1438 C:95% [D:n.a.], F:4.3%, M:0.0%, n:1438 C:91% [D:n.a.], F:7.5%, M:1.1%, n:1438 C:18% [D:n.a.], F:30%, M:50%, n:1438 C:34% [D:n.a.], F:25%, M:39%, n:1438 Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 11 of 13 5. BUSCO and CEGMA analysis run-times The total run-times of default-parameter BUSCO and CEGMA assessments of genome assemblies and gene sets were evaluated on the analysis on representative species from different metazoan lineages (Table S2). All analyses were performed using 4 CPUs with up to 8 GB of RAM. BUSCO assessments were performed using the eukaryote and metazoan sets, as well as the largest specific set for each species. Table S2. BUSCO and CEGMA assessment run-times on four representative species. Species Dataset Genome, 180 Mbp Drosophila melanogaster Gene set, 13’918 Genome, 269 Mbp Heliconius melpomene Gene set, 12’669 Genome, 3’381 Mbp Homo sapiens Gene set, 20’364 Genome, 100 Mbp Caenorhabditis elegans Gene set, 20’447 Analysis 2’675 arthropod BUSCOs 843 metazoan BUSCOs 429 eukaryote BUSCOs 250 eukaryote BUSCOs 248 CEGMA genes 2’675 arthropod BUSCOs 843 metazoan BUSCOs 429 eukaryote BUSCOs 250 eukaryote BUSCOs 248 CEGMA genes 2’675 arthropod BUSCOs 843 metazoan BUSCOs 429 eukaryote BUSCOs 250 eukaryote BUSCOs 248 CEGMA genes 2’675 arthropod BUSCOs 843 metazoan BUSCOs 429 eukaryote BUSCOs 250 eukaryote BUSCOs 248 CEGMA genes 3’023 vertebrate BUSCOs 843 metazoan BUSCOs 429 eukaryote BUSCOs 250 eukaryote BUSCOs 248 CEGMA genes 3’023 vertebrate BUSCOs 843 metazoan BUSCOs 429 eukaryote BUSCOs 250 eukaryote BUSCOs 248 CEGMA genes 843 metazoan BUSCOs 429 eukaryote BUSCOs 250 eukaryote BUSCOs 248 CEGMA genes 843 metazoan BUSCOs 429 eukaryote BUSCOs 250 eukaryote BUSCOs 248 CEGMA genes Run-time 7.6h 3.2h 1.4h 0.81h 2.5h 1.4h 0.5h 0.36h 0.15h N/A 8.1h 3.6h 0.91h 0.58h 5.7h 0.35h 0.18h 0.12h 0.1h N/A 29h 13h 6.5h 2.8h 25.3h 2.6h 1.2h 0.5h 0.21h N/A 5.3h 1.36h 0.88h 1.7h 0.5h 0.3h 0.1h N/A Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 12 of 13 6. References Camacho, C., et al. (2009) BLAST+: architecture and applications, BMC Bioinformatics, 10, 421. Eddy, S.R. (2011) Accelerated Profile HMM Searches, PLoS Comput Biol, 7, e1002195. Keller, O., et al. (2011) A novel hybrid gene prediction method employing protein multiple sequence alignments, Bioinformatics, 27, 757-763. Kriventseva, E.V., et al. (2014) OrthoDB v8: update of the hierarchical catalog of orthologs and the underlying free software, Nucleic Acids Res. Mende, D.R., et al. (2013) Accurate and universal delineation of prokaryotic species, Nat Methods, 10, 881-884. Parra, G., Bradnam, K. and Korf, I. (2007) CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes, Bioinformatics, 23, 1061-1067. Parra, G., et al. (2009) Assessing the gene space in draft genomes, Nucleic Acids Res, 37, 289-297. Sievers, F. and Higgins, D.G. (2014) Clustal Omega, accurate alignment of very large numbers of sequences, Methods Mol Biol, 1079, 105-116. Tatusov, R., et al. (2003) The COG database: an updated version includes eukaryotes., BMC Bioinformatics, 4, 41. Waterhouse, R.M. (2015) A maturing understanding of the composition of the insect gene repertoire, Current Opinion in Insect Science, 1. Waterhouse, R.M., et al. (2013) OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs, Nucleic Acids Research, 41, D358-D365. Waterhouse, R.M., Zdobnov, E.M. and Kriventseva, E.V. (2011) Correlating Traits of Gene Retention, Sequence Divergence, Duplicability and Essentiality in Vertebrates, Arthropods, and Fungi, Genome Biology and Evolution, 3, 75-86. Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 13 of 13