Supporting Information Appendix Oryza associated with rice adaptation

Transcription

Supporting Information Appendix Oryza associated with rice adaptation
Supporting Information Appendix
Rapid diversification of five Oryza AA genomes
associated with rice adaptation
Qun-Jie Zhang1,2,*, Ting Zhu1,2,*, En-Hua Xia1,2,*, Chao Shi1,2,*, Yun-Long Liu1,*, Yun
Zhang1,*, Yuan Liu1,2,*, Wen-Kai Jiang1, You-Jie Zhao1, Shu-Yan Mao1, Li-Ping
Zhang1, Hui Huang1, Jun-Ying Jiao1, Ping-Zhen Xu1, Qiu-Yang Yao1,2, Fan-Chun
Zeng3, Li-Li Yang1, Ju Gao3, Da-Yun Tao4, Yue-Ju Wang5, Jeffery L. Bennetzen6,
Li-Zhi Gao1**
1. Plant Germplasm and Genomics Center, Germplasm Bank of Wild Species in
Southwestern China, Kunming Institute of Botany, the Chinese Academy of
Sciences, Kunming 650201, China
2. University of the Chinese Academy of Sciences, Beijing 100039, China
3. Faculty of Life Science and Technology, Kunming University of Science and
Technology, Kunming 650500, China
4. Institute of Cereal Crop Sciences, Yunnan Academy of Agricultural Sciences,
Kunming 650205, China
5. Department of Natural Sciences, Northeastern State University, Broken Arrow,
OK 74014, USA
6. Department of Genetics, University of Georgia, Athens, GA 30602-7223, USA
* Authors who have made equal contributions
**Corresponding Author:
Li-Zhi Gao
Tel/Fax: +86-871-5223277
E-mail: Lgao@mail.kib.ac.cn
Table of Contents
Supplemental Section S1 –Genome Sequencing, Assembly and Validation
1.1 Plant materials
1.2 DNA library construction and Illumina sequencing
1.3 Estimation of genome size
1.3.1 Estimation of nuclear DNA content by flow cytometric analysis
1.3.2 Estimation of nuclear DNA content by k-mer analysis
1.4 Genome assembly and quality assessment
1.4.1 de novo sequence assembly
1.4.2 Assessment of assembly quality
Supplemental Section S2 –Genome Annotation
2.1 Transcriptome sequencing
2.1.1 Plant materials and growth conditions
2.1.2 RNA extraction and transcriptome sequencing
2.2 Prediction of protein-coding gene models and quality validation
2.2.1 Protein-coding gene annotation
2.2.2 Gene function annotation
2.2.3 Quality validation of gene models
2.3 Annotation of non-coding RNA genes
2.4 Annotation of repeat sequences
2.4.1 Annotation of transposable elements
2.4.2 Annotation of SSRs
Supplemental Section S3–Phylogenomic Analysis and Speciation Timing of the
AA-genome Oryza Species
3.1 Identification of orthologous genes across AA-genome Oryza species
3.2 Phylogenomic analysis
3.3 Speciation times of the six AA-genome Oryza species
Supplemental Section S4–Comparisons of Orthologous Genomic Regions across
the six AA-genome Oryza Species
4.1 Construction of a synteny map for the six AA-genome Oryza
4.2 Estimation of genome divergence
4.3 Comparative sequence analyses of orthologous genomic regions in the six AA-genome
Oryza species
4.3.1 GS5-orthologous genomic regions
4.3.2 PROG1-orthologous genomic regions
Supplemental Section S5 –Genome-wide Assessment of Structural Variation in
the AA-genome Oryza species
5.1 Assessment of structural variation
5.2 Assessment of segmental duplications
Supplemental Section S6 –Dynamics and Evolution of Gene Families across the
AA-genome Oryza Species
6.1 Identification of gene families
6.2 Gene family expansions and contractions
6.3 Lineage-specific gene families
6.4 Accelerated evolution of gene families
6.5 Molecular evolution of agronomically important gene families
6.5.1 NBS-LRR resistance gene family
6.5.2 WRKY gene family
6.5.3 MADS-box gene family
Supplemental Section S7 –Identification of Gene Loss in the AA-genome Oryza
species
7.1 Loss of gene families
7.2 Identification of gene loss events and/or novel genes in SAT based on WGS reads
mapping
Supplemental Section S8 –Gain and Loss of Agronomically Important Genes
across the AA-genome Oryza Species
8.1 Computational identification of the gain and loss of agronomically important genes using
whole genome reads mapping
8.2 Evolutionary dynamics of rice speciation genes
8.3 Computational validation and experimental confirmation of the PROG1 gene in the eight
AA-genome Oryza species
Supplemental Section S9–Molecular Evolution of Protein-coding Genes
9.1 Accelerated evolution of protein-coding genes
9.2 Genome-wide scan for positive selection
9.3 Gene function classification
9.3.1 Functional analyses of positively selected genes
9.3.2 Flower development and reproduction
9.3.3 Stress resistance
Supplemental Section S10 –Evolutionary Analyses of Non-coding RNAs across
AA-genome Oryza Species
10.1 Evolutionary dynamics of non-coding RNA genes among the AA-genome Oryza species
10.2 Sequence evolution of non-coding RNA genes
10.3 Exemplar studies of some important miRNA genes in AA-genome Oryza species
REFERENCES
Supporting Figures
Supporting Tables
Supplemental Section S1 –Genome Sequencing, Assembly and Validation
1.1 Plant materials
A total of five accessions of AA-genome Oryza species, including O. nivara, O. glaberrima,
O. barthii, O. glumaepatula and O. meridionalis, were requested from Genetic Resources Center
(GRC), International Rice Research Institute (IRRI) (Table S1). Names of the sequenced species
were abbreviated, with NIV representing O. nivara, GLA representing O. glaberrima, BAR
representing O. barthii, GLU representing O. glumaepatula, and MER representing O.
meridionalis. All samples were grown in the greenhouse at the Germplasm Bank of Wild Species
in Southwest China, Kunming Institute of Botany, the Chinese Academy of Sciences. Fresh leaves
were harvested and used either directly for isolation of nuclei or frozen in liquid nitrogen and
stored at -80°C prior to DNA extraction for Illumina library construction.
1.2 DNA library construction and Illumina sequencing
We employed whole-genome shotgun (WGS) sequencing strategy with next-generation
sequencing technologies to sequence the five AA-genome Oryza species of NIV, GLA, BAR,
GLU and MER. For genome sequencing high quality DNA of NIV, GLA, BAR, GLU and MER
was individually prepared using a standard CTAB extraction with modification [1]. The quantity
and quality of the extracted DNA were checked by both electrophoresis on an 0.8% agarose gel
and a Nanodrop D-1000 spectrophotometer (NanoDrop Technologies, Wilmington, Delaware).
For each species, paired-end libraries including two types of small-insert libraries (~300bp,
~500bp) and four types of large-insert libraries (2 Kb, 4 Kb, 6 Kb, 8 Kb) were prepared by using
Illumina‘s paired-end and mate-pair kits, respectively (Illumina, San Diego, CA) (Table S2). At
least 5 µg genomic DNA was fragmented by nebulization with compressed nitrogen gas for the
short-insert paired-end libraries. And 10-30 µg of high-quality starting genomic DNA was
required to fragment by nebulization with compressed nitrogen gas for the long-insert mate-pair
libraries. The DNA fragments were circularized by self-ligation, while after the digestion of linear
DNA, circularized DNA was again fragmented. Before adapter ligation, the fragments that
contained the biotinylated ends of the original size-selected fragment were purified using
streptavidin-coated magnetic beads. After quality control and concentration estimation of DNA
samples with an Agilent 2100 bioanalyzer (Agilent Technologies, Palo Alto, CA, USA), libraries
were sequenced with both Illumina Genome Analyzer Ⅱx (GAⅡx) and HiSeq 2000 sequencers
by following the standard Illumina protocols (Illumina, San Diego, CA). A total of 48 libraries
were finally used to generate ~232 Gb of raw paired-end read data with their lengths ranging from
36-120 bp. A series of filtering and checking steps were used to eliminate artificial reads caused
by PCR duplication and adapter contamination. This yielded ~123 Gb of data that were judged
high-quality reads for the subsequent de novo genome assembly (Table S2). Library quality was
checked by determination of both distribution of insertion size and sequence depth (Figure S1a).
The real insertion lengths were then determined by mapping paired-end reads to the complete O.
sativa ssp. japonica. cv. Nipponbare genome sequence (abbreviated as SAT, MSU v7.0) (Figure
S1b). We also constructed different types of insert libraries to decrease the risk of non-random
distribution of the data (Table S2).
1.3 Estimation of genome size
1.3.1 Estimation of nuclear DNA contents by flow cytometric analysis
The NIV, GLA, BAR, GLU and MER plants used as DNA sources for genomic sequencing
were individually collected in May, 2012 from the greenhouse at the Kunming Institute of Botany,
Chinese Academy of Sciences. Approximately 40-50 mg of mature leaf tissue was used for sample
preparation. Nuclei suspensions were prepared in Otto buffer [2, 3], namely Otto I: 100 mM citric
acid, 2.0 mM dithiothreitol (Sigma-Aldrich CHIEMIE Gmbh, Steinheim, Germany), 0.5% (v/v)
Tween 20 (pH 2-3) and Otto II: 400 mM NaPO4.12H2O (pH 8-9). In each case, the leaves were
chopped with a new sharp razor blade directly into a petri dish containing 1 mL of ice-cold Otto I
buffer. The resulting homogenate was filtered through a 50-m nylon filter and centrifuged 150 ×g
for 5 min to remove cell fragments and large debris. The supernatant was removed (leaving ~100
µl of liquid), and then the nuclei were resuspended by gentle shaking, followed by adding 100 µl
of fresh ice-cold Otto I buffer. The nuclei were incubated at room temperature for 60 min with
occasional shaking. 1 ml of Otto II buffer (50 g mL-1 RNase (Fluka, Buchs, Switzerland) and 50
g mL-1 propidium iodide (PI) (Sigma, St Louis, MO, USA)) were then added to the nuclei. The
sample was kept in the dark at room temperature generally for 5 - 15 min and was then analyzed
by flow cytometry.
Nuclear samples were analyzed by using a BD FACSCalibur (USA) flow cytometer. The
instrument was equipped with an air-cooled argon-ion laser tuned at 15 mW and operating at 488
nm. PI fluorescence was collected through a 645-nm dichroic long-pass filter and a 620-nm
band-pass filter. Prior to analysis the instrument was checked for a linear response with
Flow-Check fluorospheres (Beckman-Coulter, Hialeah, FL, USA). The amplifier system was set
to a constant voltage and gain throughout the experiments. The results were acquired using the
Cellquest software and the coefficients of variation (CV) were calculated to estimate DNA content.
Nuclear genome size was calculated as a linear relationship between the ratio of 2C peaks of the
sample and standard.
In this study, an improved Otto buffer was proven to be suitable for estimating these rice
species with low CVs < 5%. SAT was selected as standard with 0.794 pg (~389 Mb) genome size
[4, 5]. Nuclear genome size was calculated as a linear relationship between the ratio of 2C peaks
of the sample and standard. Based on 1 pg DNA= 978 mega base pairs (Mbp) [3], the genome
sizes of NIV, GLA, BAR, GLU and MER were approximately estimated to be 341, 380, 370, 388
and 413 Mb with CVs < 5% (Table S3; Figure S2).
1.3.2 Estimation of nuclear DNA contents by k-mer analysis
The genome sizes of the sequenced rice species were further estimated by using k-mer
frequencies. The method has been successfully employed to assess genome size in some species
such as giant panda, pigeon pea and Setaria italica [6-8]. In k-mer frequency distribution analysis,
we filtered reads from the short insertion size libraries, in which the duplication rate is low, to
reduce the influence of sequencing errors. Genome size, G, was estimated by Lander-Waterman:
N is the number of reads used, L is the average length of reads, K is k-mer length (set as 17 in
our analysis), and D is sequencing depth. The peak value of the frequency curve of k-mer was
calculated by the pregraph module implemented in SOAPdenovo (version 1.05) [9], and then
transformed to sequencing depth D by determining occurrence distribution (Figure S1a).
Estimates of genome size for each species varied quite a bit between k-mer and flow
cytometric analyses (Table S3). In general, the results showed that genome sizes varied < 9%
across the five AA-genome Oryza species.
1.4 Genome assembly and quality assessment
1.4.1 de novo sequence assembly
The whole-genome de novo assemblies were built using SOAPdenovo, in which M = 1 was
adopted in order to obtain high-quality contigs. Gapcloser (v2.0) was then followed to fill gaps by
paired-end reads. We constructed numerous libraries and generated datasets with different depths
depending on the different levels of heterozygosity of these rice species to maximally reduce the
influence of non-randomness of sequence reads in the genome. Finally, 13, 7, 7, 9 and 11 libraries
were used for NIV, GLA, BAR, GLU and MER, respectively (Tables S2 and S4). To link and
extend the generated scaffolds we next downloaded BAC End Sequences (BESs) from OMAP
(http://www.omap.org/). A total of 51,820 and 34,747 pairs for NIV and GLA, respectively, were
first added by using Bambus [10] to build the currently reported assemblies. Recently released
BES data of 10,228 and 13,515 pairs for MER and GLU, respectively, were also employed to
extend scaffolds for these two species with relatively strict cutoffs (Table S4).
After adding BES data to NIV and GLA the lengths of ScafN50 were increased from ~133
Kb to ~512 Kb and from ~241 Kb to ~722 Kb, respectively, but there was little effect on CtgN50.
The coverage of the assembled genomes ranged from 87.8% (MER) to 94.9% (NIV) while
comparing to their estimated lengths by k-mer analysis, and the unclosed gaps in scaffolds
extended from ~16.9 Mb (BAR) to ~35.3 Mb (NIV) (Table S4). Among these five assembled
genomes, >94% of scaffolds were longer than 10 Kb (96.8%, 97.5%, 94.2%, 97.1% and 95.3% for
NIV, GLA, BAR, GLU and MER, respectively) (Table S5; Figure S3). Comparisons between the
five assembled genomes (NIV, GLA, BAR, GLU and MER) and the SAT genome showed that
these unclosed gaps were mainly composed of transposable elements (TEs) (Table S6). The
average length of protein-coding genes was ~2.85 Kb in SAT [11], with 97.3% of genes shorter
than 10 Kb.
1.4.2 Assessment of assembly quality
To assess the quality of the assembled genomes we used three evaluation approaches. First,
we separately mapped our assembled genome sequences (NIV, GLA, BAR, GLU and MER) to the
SAT genome to estimate the coverage of the assemblies using Mummer [12] (Figure S4a). Our
results showed that the mapping rates to the repeat sequence-free SAT genome ranged from 77.5%
to 95.8% (Table S7). However, such estimates should not only reflect real assembly quality but
also come from genome divergence from SAT. Thus, we further compared orthologous genomic
segments based on the aligned orthology map of the six AA-genome Oryza species (SAT, NIV,
GLA, BAR, GLU and MER) (see Supplemental Section S4.1 for details), and calculated the
coverage of the gene-containing genomic regions corresponding to the SAT genome for each
species. Our results showed high degrees of coverage, ranging from 97.8% to 98.7% (Table S7).
We next evaluated the assembly quality by using Nipponbare gene models to examine whether
they are split into multiple contigs in the assemblies. Our results showed that, at three different
cutoffs of gene coverage (50%, 80% and 90%), the numbers of genes that are split into multiple
contigs varied slightly among the five rice species but the majority of them seemed to be intact in
the assembled contigs. The numbers of genes that are split into multiple contigs are fewer than
180 in the five rice species (Table S7). Second, we took the four species (NIV, GLA, GLU and
MER) with BES data from OMAP and aligned these BES with our de novo assembled genomes.
Of them, BES datasets for NIV, GLA and BAR were separately mapped to our de novo assembled
scaffolds before the de novo assemblies were improved by BES integration (Figure S4b). After
the removal of multi-matched BESs that might have been caused by repeat sequences, the
orientation differences between the two datasets were calculated, yielding potential error values
from 0.84 % (GLU) to 1.8 % (MER) (Table S8). The majority of these apparent assembly errors
may be products of unmasked (i.e., low copy number) repeat sequences, but some apparent
miss-assemblies might also reflect intra-specific rearrangements. Finally, by running both
mummer and blastz [13], the assembly quality was further evaluated by aligning our assembled
genomes against sequences from the short arms of Chromosome 3 generated by OMAP. These
short arm sequences (http://www.omap.org/), which were assembled based on BAC shotgun
sequences using the Sanger sequencing technology, were masked by RepeatMasker [14]. The
lengths of short arms of Chromosome 3 for GLA, GLU and MER were approximately 17.5 Mb,
16.4 Mb and 18.5 Mb, and the masked regions together with gaps were calculated to represent
25.8%, 30.5% and 30.7%, respectively, in the three genomes. The mapping results (Figure S5)
showed that the identity and coverage between the two datasets were calculated >99.2% and
83.5%, respectively, for the whole short arm of these chromosome 3 arms. The above-described
evaluation suggested that the assembled genome sequences have resulted in a reliable framework
for subsequent comparative analysis between the AA-genome Oryza species.
Supplemental Section S2 –Genome Annotation
2.1 Transcriptome sequencing
2.1.1 Plant materials and growth conditions
Five accessions of AA-genome species (NIV, GLA, BAR, GLU and MER) were used for
transcriptome sequencing (Tables S1 and S9). Seeds were subjected to heat treatment (52℃ for
five days) until dormancy was broken. Then the dehulled seeds were sterilized with 5% NaClO
bleach and washed five times with sterilized water. Sterilized seeds were germinated and grown
on MS medium at 26 ± 2℃ under a 12 h-light/12 h-dark photoperiod. Roots and shoots from
one-month-old seedlings were collected, immediately frozen in liquid nitrogen and stored at -80℃
until RNA isolation. Plants for NIV, GLA and BAR were grown in the greenhouse at 28 ± 2℃ at
the Kunming Institute of Botany, Chinese Academy of Sciences. Considering the photoperiod and
temperature-sensitive characteristics of some wild rice species, GLU and MER were grown in the
greenhouse at 30 ± 2℃ at the Yunnan Institute of Tropical Crops, Jinghong, Yunnan, China. Flag
leaves and panicles at booting stage were collected from the greenhouse-grown plants,
immediately frozen in liquid nitrogen and stored at -80℃ until RNA isolation.
2.1.2 RNA extraction and transcriptome sequencing
Total RNA was isolated using a Water Saturated Phenol method. Briefly, one gram tissue was
ground to a fine powder using liquid nitrogen and then was thoroughly covered by the extraction
buffer (100 mM Tris-Cl pH8.5, 100 mM NaCl, 20 mM EDTA pH8.0; 1% SDS, 2%
beta-mercaptoethanol) and water saturated phenol. After the thaw of the homogenate, chloroform
and NaOAC (pH 4.0) were added. The mixture was votexed for 1 min and centrifuged at 12,000
rpm for 15 min at 4℃. The supernatant was precipitated with 2.5 volumes of 100% ethanol and 5
M NaCl at -20℃ overnight. The pellet was washed with 70% ethanol and dried. All samples were
treated with RNase-free DNAse I (Takara) in the presence of RNAase inhibitor to remove residual
DNA. The extracted RNA was quantified using NanoDrop-1000 UV-VIS spectrophotometer
(NanoDrop) and checked for integrity using an Agilent 2100 Bioanalyzer (Agilent Technologies,
Palo Alto, CA, USA).
For high-throughput sequencing, sequencing libraries were constructed by following the
manufacturer‘s recommendations for the Illumina RNA-Seq kit (mRNA-Seq Sample Prep Kit P/N
1004814). In brief, oligo (dT) beads were used to isolate poly-A containing mRNA. First-strand
cDNA was synthesized using random hexamer-primer and reverse transcriptase (Invitrogon). The
second-strand cDNA was synthesized using RNase H (Invitrogen) and DNA polymerase I (New
England BioLabs). After second strand cDNA synthesis and adaptor ligation, fragments of
approximately 300 bp were excised by gel electrophoresis. cDNA fragments were enriched by
PCR for 18 cycles. Then products were sequenced with the Illumina HiSeq 2000 using the paired
end read module (100 bp). In total, we obtained of 144.0 Gb clean data from 20 libraries (Tables
S9 and S10).
2.2 Prediction of protein-coding gene models and quality validation
2.2.1 Protein- coding gene annotation
Before performing gene prediction, we masked the five assembled genomes using
RepeatMasker [14] with two approaches, soft mask for Exonerate (Version 2.2.0) and hard mask
for ab-initio.
2.2.1.1 ab initio gene prediction
We used AUGUSTUS (Version 2.5.5) [15], GlimmerHMM [16] and GeneMarkHMM [17] to
perform the ab initio annotation. The training models used for these three packages were ―maize‖,
―rice‖, and ―O_sativa‖, respectively.
2.2.1.2 Homology-based gene prediction
We performed homology-based gene prediction with Exonerate and GeneWise (Version 2.2.0)
[18], separately. a) Exonerate: Exonerate, which is a splice-site-informed software package for
sequence alignment, allows users to align sequences with many alignment models. We aligned the
homologous peptides from the sorghum, maize, SAT, O. sativa ssp. indica cv. 93-11 (abbreviated
as IND) and brachypodium proteomes to the rice genome assembly using Exonerate. Here, the
allowed minimum and maximum intron lengths were 20 bp and 3000 bp, respectively. b)
GeneWise: to reduce the running-time of GeneWise, we first aligned the SAT protein sequences
with the assembled rice genomes and grouped all the HSPs into gene-like structures by genBlastA
[19] with suitable parameters (-e 1e-2 –m 6000). Then we cut off the target gene fragments in the
genome by extending 5000 bp at both ends of the alignment regions, and finally aligned the
protein sequences of the SAT genome to these DNA fragments with GeneWise.
2.2.1.3 EST-aided annotation
To improve the quality of gene predictions, we employed an EST assembly tool, Program to
Assemble Spliced Alignments (PASA) [20], to assemble 167,613 rice EST sequences into
protein-coding gene models. We began by filtering and aligning these sequences onto the genome
assembly, and further filtered these alignments and then clustered them based on the alignment
compatibility. Finally, through a dynamic programming process, these EST alignment clusters
were stitched into a set of consistent, non-overlapping EST assemblies. The output included a
fasta file for the proteins and a GFF3 file describing the gene structures.
2.2.1.4 EvidenceModeler combing
We used EVidenceModeler (EVM) [21] package to combine the ab-intio gene predictions,
protein alignments and transcription alignments described above into weighted consensus gene
structures. Additionally, we also ran the program pasa_asmbls_to_traing_set.dbi to extract all the
longest ORFs from the resulting PASA alignment assemblies, termed PASA-supported terminal
exons supplement, for EVM combining. EVM provide a flexible and intuitive annotation system
for combining diverse evidence types into a single automated gene structure. We fed the genome
sequences, gene predictions, alignment data, and an evidence weight file into the EVM package.
The output contained candidate gene models. After finishing the above-described steps, we finally
used PASA and Oryza cDNA/EST data of rice (ftp://ftp.gramene.org/pub/gramene/genebuild/)
again to update the EVM consensus predictions, adding UTR annotations and models for
alternatively spliced isoforms.
2.2.1.5 Further filtering
For further filtering the above annotated candidate gene models, three additional steps were
taken. First, we removed the models with peptide lengths shorter than 50aa; second, we removed
models, where stop codons existed within the peptides; and third, TE-related gene models were
also filtered out. This last step was accomplished by using our predicted Oryza proteome to query
the MSU Oryza Repeat Database with TBLASTN. Gene models with matches above specific
cut-offs (1e-10, coverage ≥ 40%) and that were annotated as TE-related genes were removed.
After this filtering, we identified 39,045, 41,490, 41,476, 42,283, 41,605 and 39,106 genes for
NIV, GLA, BAR, GLU and MER, respectively (Table S11).
2.2.2 Gene function annotation
The motifs and domains within gene models were identified by InterProScan (Version 4.5)
[22] and by Pfam, using screens against protein databases. Gene Ontology IDs for each gene were
obtained from the corresponding InterPro entry. If the best hit of the genes in any of these
processes was ―function unknown‖ or ―putative‖, the second-best hits were used to assign the
function until there were no more hits that met the alignment criteria. Then this gene was
characterized as functionally unknown (Table S11).
2.2.3 Quality validation of gene models
Three methods were used for gene model quality validation, namely transcriptome evidence,
EST evidence and homologous peptide evidence. First, the five reference transcriptomes for NIV,
GLA, BAR, GLU and MER from the twenty EST libraries (Tables S9 and S10) were first aligned
to the assembled genomes using Tophat [23], and then alignments were used as input for Cufflinks
[24] with the default parameters to finish the transcript assemblies (Table S12). The Cuffcompare
program [24] was then used to compare Cufflink-assembled transcripts to our predicted gene
models. Second, we downloaded the EST sequences from the NCBI dbEST database
(keywords:txid4527[Organism:exp]), PlantGDB (http://www.plantgdb.org/ESTCluster/) database
and Gramene database (http://www.gramene.org/), and then BLAT was used to align all the EST
sequences to the gene models with identity ≥ 95% and coverage ≥ 90%. Third, four datasets of
peptides from the SAT, IND, GLA and BRA genomes were downloaded from the Gramene
database (http://www.gramene.org/), and then all these peptide sequences were aligned to the gene
models by using BLAT with identity ≥ 30% and coverage ≥ 90%. Numbers and percentages of
gene models that were supported by transcripts (RNA-Seq), protein and/or ESTs for NIV, GLA,
BAR, GLU and MER, respectively, were summarized in Table S12.
2.3 Annotation of non-coding RNA genes
The five different types of non-coding RNA (ncRNA) genes, namely transfer RNA (tRNA) genes,
ribosomal RNA (rRNA) genes, small nucleolar RNA (snoRNAs) genes, small nuclear RNA (snRNAs)
genes and microRNA (miRNAs) genes, were predicted using de novo and homology search methods. To
better examine the evolutionary dynamics of ncRNA genes among the six AA-genome Oryza, the
methodology was first tested by annotating tRNA, rRNA, snoRNA and snRNA genes in the SAT genome
and was subsequently applied to the NIV, GLA, BAR, GLU and MER genomes. We used tRNAscan-SE
algorithms (version 1.23) with default parameters [25] to identify tRNA genes. We annotated 551, 491, 588,
621 and 598 tRNA genes in the NIV, GLA, BAR, GLU and MER genomes, respectively (Table S13;
Figure S6). The rRNA genes (5S, 18S, and 28S) were predicted using RNAmmer algorithms with default
parameters [26]. We annotated 10, 23, 16, 29 and 61 rRNA genes in the NIV, GLA, BAR, GLU and MER
genomes, respectively (Table S13; Figure S6). The snoRNA genes were annotated using snoScan with the
yeast rRNA methylation sites and yeast rRNA sequences provided by the snoScan distribution [27]. In total,
we annotated 259, 232, 229, 194 and 195 snoRNA genes in the NIV, GLA, BAR, GLU and MER genomes,
respectively (Table S13; Figure S6). The snRNA genes were identified by using the INFERNAL software
on the Rfam database (release 9.1) with default parameters [28, 29]. In total, we annotated 116, 121, 120,
125 and 117 snRNA genes in the NIV, GLA, BAR, GLU and MER genomes, respectively (Table S13;
Figure S6).
We annotated microRNAs in two steps. First, we downloaded the existing rice miRNA entries from
miRBase release 18.0 [30]. Then, the conserved miRNAs were identified by mapping all
miRBase-recorded SAT miRNA precursor sequences against the assembled NIV, GLA, BAR, GLU and
MER genomes using BLASTN with cutoffs at E-value < 1e-5, identity > 80%, and query coverage > 80%.
Second, additional miRNA genes were identified by aligning all miRBase-recorded grass miRNA
precursor sequences against our assembled genomes using BLASTN with cutoffs at E-value < 1e-5,
identity > 60%, and query coverage > 60%, because deep-sequencing genome projects have found
numerous new miRNAs in grasses very recently. The grasses used in our analysis were Sorghum bicolor,
Saccharum officinarum,Triticum aestivum,T. turgidum,Zea mays,Brachypodium distachyon,Aegilops
tauschii, Hordeum vulgare and Festuca arundinacea. When a miRNA was mapped to a target Oryza
genome, the surrounding sequence was next checked for hairpin structures. Those loci that fulfilled
miRNA precursor secondary structures were annotated as additional miRNA genes. We excluded miRNA
genes that identified multiple hits, most likely repeated sequences, in the six assembled rice genomes
(Tables S13; Dataset S1; Figure S6).
2.4 Annotation of Repeat Sequences
2.4.1 Annotation of Transposable Elements
Construction of transposon library by RepeatModeler. Previously annotated transposons in the
rice
genome
were
downloaded
from
Repbase
[31].
In
addition,
RepeatModeler
(http://www.repeatmasker.org/RepeatModeler.html) was used to identify and model repeat families. This
approach included two de-novo repeat finding programs, RECON [32] and RepeatScout [33]. The output
of RepeatModeler was used to construct the library of DNA transposons and non-LTR retrotransposons
that was supplied to RepeatMasker to annotate the five assembled Oryza genomes.
Construction of a library of de novo-identified LTR retrotransposons. First, de novo searches
were performed with LTR_STRUCT [34] against the five sequenced AA-genomes and the SAT reference
genome for the purpose of comparison, and then the output was checked with LTR_Finder [35]. False
positives were removed with blast searches between LTRs and internal sequences. All remaining intact
LTR retrotransposons that were de novo identified were classified into families by BLASTClust, a program
within the standalone BLAST package used to cluster either protein or nucleotide sequences. We used the
family classification criterion that 5‘ LTR sequences of the same family would share at least 80% identity
over at least 80% of their length [36]. PFAM [37] was next run against the intact LTR retrotransposon
library to determine the order of protein-coding genes. The position of the RT gene relative to the IN
(integrase) gene in pol gene was used to classify retroelements into Ty3-gypsy (PR-RT-IN), Ty1-copia
(PR-IN-RT) or unclassified superfamilies [38]. The obtained results were checked by homology searches
using RT gene sequences of Copia and Gypsy. We then classified LTR retrotransposons containing RT
genes into families with BLASTClust using the criterion that the RT protein sequence homology was >80%
identity over >80% of the RT gene length. We then aligned LTR retrotransposons elements using
ClustalW [39] and manually checked with the above-described criterions. By this strategy, a total 3,271
LTR retrotransposons were identified and placed into 651 families. To improve the efficiency or the use of
these LTR retrotransposons in a RepeatMasker library, nested TE sequences were removed by BLASTN
searches and sequences were combined to create exemplars for each family. We further removed a total of
37 false positives from the identified single-member families through BLASTN searches against the
genome sequences and NR database. Then, we combined the collected Oryza elements with LTR
retrotransposons downloaded from Repbase Update [31] to create a RepeatMasker library containing 1,502
sequences for further annotation of LTR retrotransposons in the six AA-genome species. The annotation of
TEs in the six rice genomes is summarized in Table S14.
2.4.2 Annotation of SSRs
Simple Sequence Repeats (SSRs) in the six rice genomes were identified and located using MISA
(http://pgrc.ipk-gatersleben.de/misa/). Then, all the annotated SSRs were classified by the size and copy
number of their tandemly repeated: monomer (one nucleotide, n ≥ 12), dimer (two nucleotides, n ≥ 6),
trimer (three nucleotides, n ≥ 4), tetramer (four nucleotides, n ≥ 3), pentamer (five nucleotides, n ≥ 3), and
hexamer (six nucleotidess, n ≥ 3). We combined SSRs from the plus and minus strands and the differences
caused by reading frame [40]. For example, we combined (AC)n, (CA)n, (GT)n and (TG)n into (AC)n. As
a result, the monomer category had two subtypes as (A/T)n and (G/C)n, the dimer had four members
(AC/GT, AG/CT, AT, GC), the trimer had ten subtypes, the tetramer had 33 subtypes, the pentamer had
102 subtypes, and the hexamer had 350 subtypes. Each studied Oryza species had all subtypes of the first
five types. But for the hexamer, 306 subfamilies were shared by the six species, while the subtypes
AACGTT, ACCGGT, AGCGCT, AACGGT and AATCGT were not found in any species. The annotation
of SSRs in the six rice genomes is summarized in Tables S15 and Dataset S2.
Data analysis of the SSRs from monomers to hexamers in the six AA-genome Oryza species showed
that trimers were the most abundant; following by tetramers and dimers, while hexamers contributed the
fewest SSRs (Figure S7). We further classified each type into two subgroups by using the length of the
SSRs, ≥ 20 bp and < 20 bp [41]. As shown in Figure S7, most of the SSRs were shorter than 20 bp in these
rice genomes. The proportions of SSRs longer than 20 bp were slightly higher for dimers, pentamers and
hexamers than the other three types. For dimers, the percentage of SSRs < 20 bp was one-third in SAT, but
accounted for less than a quarter in the other AA-genome species.
Supplemental Section S3–Phylogenomic Analysis and Speciation Timing of the
AA-genome Oryza Species
3.1 Identification of orthologous genes across AA-genome Oryza species
We identified high-confidence 1:1 orthologs by combining three methods: OrthoMCL [42], reads
mapping and synteny analyses. First, we searched the orthologous clusters by using OrthoMCL, as
described in Supplemental Section S6.1. As the number of orthologous clusters and their sizes depend on
the BLAST similarity threshold and MCL inflation i-parameter, higher similarity cut-offs and larger
inflation resulted in a greater number of small clusters. We identified conserved orthologous clusters by
using a re-cluster method based on OrthoMCL, with which we changed the BLAST e-values from 1e-30 to
1e-1 and inflation i-parameters from 1.5 to 5.0. A conserved orthologous cluster was defined by having its
size and members not change with the adjustment of e-values and inflation i-parameters. After obtaining
conserved OrthoMCL clusters, we used a house-perl script to extract the 1:1 orthologous clusters from
these conserved OrthoMCL clusters. In total, we identified 7,692 candidate 1:1 orthologs among the seven
Oryza genomes with the FF- genome species, O. brachyantha (BRA) [43], as an outgroup.
Second, we began by filtering (phred score ≥ 20; length ≥ 25 bp) and aligning the trimmed reads to
the SAT genome using bowtie [44] with suitable parameters (-n 2 -e 70 -l 28 -m 1 -S) to identify
single-copy genes. At these parameter settings, only unique mapping reads were allowed. As a result, there
are not supposedly any reads mapped to multiple-copy genes. However, a small number of reads were still
mapped to some multiple-copy genes, owing to sequencing errors. We calculated average depth and
coverage for each SAT gene based on the reads mapping results. Although multiple-copy genes may be
mapped with a few reads due to sequencing errors, the read depth of these multiple-copy genes will not
exceed that of the whole genome sequencing (34.2×). Subsequently, we filtered those genes with
coverages < 100% and average depths < 34.2×, and the retained sequences (coverage = 100%; depth ≥
34.2×) were judged as representing the candidate single-copy genes. In total, this aproach identified 14,477
single-copy genes in the SAT genome.
Third, in order to identify collinear gene pairs between the SAT and BRA genomes, we compared the
SAT protein sequences against the protein sequences of the BRA genome by using BLASTP (1e-5), and
filtered the BLAST results before chaining with the blast_to_raw.py script (shipped in quota-alignment
software package) [45]. In this way, local duplications and matches were removed that had < 0.5 C-score
(--tandem_Nmax = 10 –cscore = 0.5). Next, quota_align.py was used to generate synteny blocks in the
pairwise genome comparisons (--merge –Dm = 20 --min_size = 5 –quota = 1:1) [45]. Finally, qa_plot.py
was employed to generate a dot plot for visualizing the quota_align.py results, and qa_to_pairs.py was
applied to obtain a total of 18,059 collinear gene pairs between the SAT and BRA genomes.
The high-confidence 1:1 orthologous gene sets were rigorously confirmed by using these four criteria:
a) gene sets must be the conserved 1:1 orthologs that were obtained via the re-cluster-OrthoMCL pipeline;
b) the conserved 1:1 orthologs of SAT must be the single-copy genes established by reads mapping
method; c) genes between the SAT and BRA genomes within the conserved 1:1 orthologs must be
supported by the conserved synteny; d) the length of each gene among the conserved 1:1 orthologs must be
longer than 300aa. Consequently, this approach identified 2,305 orthologous single-copy gene families
with high confidence among the seven AA-genome Oryza species (Table S16). Of these, we next
randomly selected seven candidate genes (LOC_Os01g59660, LOC_Os01g71960, LOC_Os01g72220,
LOC_Os02g56850, LOC_Os03g01590, LOC_Os03g12660, LOC_Os08g44530) from the SAT genome for
PCR amplification and sequencing with ABI 3730, and experimentally validated 7/7 genes (100%) in the
NIV, GLA, BAR, GLU and MER genomes, demonstrating efficient and accurate identification of
orthologous genes among the AA-genome Oryza species.
3.2 Phylogenomic analysis
We performed phylogenomic analysis of the six sequenced AA-genome Oryza species by using 1:1
single-copy orthlogous genes. A total of 2,305 1:1 single-copy orthologous genes were individually aligned
with MAFFT [46] and then concatenated to construct a phylogenomic tree using RaxML [47] with BRA as
outgroup. These genes robustly resolved phylogenetic relationships among the six AA-genome species
with all the nodes obtaining full (100%) bootstrap support (Figure S8). In spite of the power of
genome-wide orthologous genes to resolve phylogenetic relationships among closely related species [48],
highly incongruent gene trees from individual genes were quite common [49, 50]. Hence, we randomly
sampled a set of 100 genes across the genomes to separately reconstruct gene trees; among them, we
observed that only 11 (~10%) completely support the inferred species tree. Consensus networks [51]
constructed from these gene trees also showed substantial incongruence among different topologies
(Figure S9). This finding may be caused by the different evolutionary histories of these genes, and
incomplete lineage sorting may also be a reason for the observed results [49, 50].
3.3 Speciation times of the six AA-genome Oryza species
To time the speciation events leading to the major extant AA- lineages, we applied the average dS of the
orthologous genes with a synonymous substitution rate of 6.5 × 10-9 substitutions per site per year [52].
The divergence time dating suggested a recent radiation of about 4.8 Myr within major AA-genome
lineages and 35.3 Myr between AA- and FF- genomes (Figure S8). These results are quite different from
previous molecular clock estimates that Oryza diversified by the Middle Miocene period (15-14 Mya) [53]
and AA lineages emerged very recently, only about 2 Mya [54, 55], using chloroplast gene fragments or
fewer nuclear genes, respectively. Our large data sets (2,305 single-copy orthologous genes) thus give
more authoritative estimation of the divergence time and support new fossils that predict an older Oryzeae
origin of about 63 Mya [56].
Supplemental Section S4–Comparisons of Orthologous Genomic Regions across the six
AA-genome Oryza Species
4.1 Construction of a synteny map for the six AA-genome Oryza
To aid detailed evolutionary analyses, we identified and aligned orthologous genomic regions from
the six assembled AA-genomes using MERCATOR [72] and MAVID [57]. A total of seven steps were
taken, as follows: first, we obtained the results of gene annotation for each genome (Table S11); second,
all exons from a candidate genome were compared against all exons in the other five genomes using BLAT,
and significant alignments between exons were recorded; third, a graph with each vertex corresponding to
an exon and edges between vertices whose consistent exons had significant alignments was constructed;
fourth, cliques in this graph were identified; fifth, neighboring cliques were joined to form runs; sixth, the
extent of each run for each genome was outputted as orthologous segments; and, finally, orthologous
sequence alignments were provided with the confirmed phylogenetic relationships ((((SAT: 0.002395, NIV:
0.002659): 0.001616, (BAR: 0.001242, GLA: 0.001793): 0.001899): 0.001268, GLU: 0.004631):
0.010971, MER: 0.014768)) [58]. From this analysis,we obtained a total of 2,971 orthologous genomic
segments across the six AA-genome Oryza species, ranging in identity from 74.7% in MER to 85.1% in
SAT (Table S17). Note that the majority of these six genomes were covered by the identified orthologous
segment regions, which consisted of 40,435, 39,258, 39,902, 39,182, 39,258 and 36,106 genes in SAT, NIV,
GLA, BAR, GLU and MER, respectively. The average gene number per block ranged from ~12.2 in MER
to ~13.6 in SAT. Of the 2,971 orthologous genomic segments, there were 1,202 segments that were larger
than 100 Kb (43.07%) (Table S18). This orthologous synteny map across AA-genome Oryza species
provided a framework to subsequently perform comparative and evolutionary analyses.
4.2 Estimation of genome divergence
To determine the global extent of genome divergences between Asian cultivated rice (SAT) and the
other five AA-genome Oryza analyzed, we used the three data sets of orthologous nucleotide sequences,
protein sequences, and 103 orthologous genomic segments totaling ~15 Mb (~3.9% of the rice genome)
with an average length of approximately 100 Kb (Table S19). Levels of divergence represented by
synonymous (dN) and nonsynonmous substitutions (dS) of the 2,305 orthologous genes were separately
calculated by using the codeml program as described in PAML [59]. The average synonymous
substitutions exceeded nonsynonmous substitutions for each pairwise comparison, suggesting a widespread
purifying selection on these detected orthologous genes. Both synonymous and nonsynonmous
substitutions were in agreement with their phylogenetic positions in the topology, indicating the increase of
substitutions with the species divergence. Pairwise genome divergences between SAT and these species at
the amino acid level were also consistent with their phylogenetic positions in the topology (Table S19;
Figure S8). Genome divergence between orthologous genomic segments represented by pairwise distances
implemented in MEGA5 [60] far exceeded orthologous nucleotide sequence variation inside gene coding
sequences, indicating that neutral evolution, especially TE insertions and subsequent turnover in intergenic
regions, accounts for most of the genomic sequence divergence [61].
4.3 Comparative sequence analyses of orthologous genomic regions in the six AA-genome Oryza
species
Comparative genomics is a powerful approach to understand gene and genome evolution. By placing
multiple Oryza genome comparisons in a phylogenetic context, previous studies recorded many historic
genomic changes that led to the diversification of this genus and have obtained a general understanding
about Oryza genome evolution [61-64]. Of the 2,971 genomic segments we identified in this study, we
carefully checked ~100 and found that some segments exhibited a high conservation of genomic
architecture among these six rice species, while others diverged unexpectedly rapidly from one to another.
To improve the sensitivity of evolutionary inferences and study sequence evolution of the six Oryza
AA-genomes studied here, that diverged over a relatively short time frame of ~4.8 Myr, we selected and
compared two genomic regions surrounding two agronomically important loci, GS5 and PROG1.
4.3.1 GS5-orthologous genomic regions
The GS5 locus is a major gene controlling grain size and weight in rice [65, 66]. We compared a
31.5-kb reference segment from the SAT genome surrounding GS5 with orthologous regions from the
other five AA-genome species to study sequence evolution around a domestication locus. Sequence
alignment and annotation of the orthologous genomic regions for these species revealed highly conserved
gene colinearity and structure in the GS5 region (Figure S10; Tables S20 and S21). Although the majority
of TEs were conserved among orthologous genomic regions, lineage-specific differences in transposon
amplification appear to be responsible for the different sizes of the examined orthologous genomic regions.
Consistent with patterns observed in Supplementary Section S10, we found that TE-insertions into the
three genes were mainly DNA transposons such as miniature inverted-repeat transposable elements
(MITEs), while TEs in intergenic regions were often LTR retrotransposons. Although there were several
cases of TE insertions specific to a single studied species within GS5-orthologous genomic regions,
patterns for most TE insertions and/or deletions within either genic or intergenic regions were shared, and
met phylogenetic expectations, for two or more of the studied species. We observed that, in some cases,
that different types of TEs were nested together to form mixed TE clusters.
To determine selection pressures on the GS5 gene, we used the codeml program implemented in
PAML to separately perform analyses of dN/dS (ω) ratios under the branch and site models [67]. Estimates
under both models indicate that the two cultivated species have historically experienced positive selection
(both ω values of SAT and GLA >> 1), while their wild progenitors were under purifying selection (dN
value of NIV = 0.000002, dS value of NIV = 0.000000; ω value of BAR < 1). GS5 genes were also
observed to be under strong purifying selection for the other two wild species (ω values of GLU and MER
<< 1) (Figure S11). We also investigated selection pressure using the site model, and observed some
specific sites of the GS5 gene under strong positive selection in the AA-genome Oryza species (Table S22).
These results suggest that the GS5 gene underwent directional artificial selection independently during the
domestication of Asian and African cultivated rice, thus leading to the convergent property of increased
grain size and weight of cultivated rice compared to their wild ancestors.
4.3.2 PROG1-orthologous genomic regions
The PROG1 gene determines several important agronomic characters, including prostrate growth,
great grain number and high grain yield in Asian cultivated rice [68, 69]. To investigate the evolution of
the studied Oryza AA-genomes, we compared a 279-kb reference segment encompassing PROG1 of the
SAT genome with orthologous regions from the other five diploid AA-genomes. Here, we retrieved
278,955 bp, 253,313 bp, 201,587 bp, 192,527bp, 226,763 bp and 196,320 bp of the PROG1 orthologous
genomic regions from the SAT, NIV, GLA, BAR, GLU and MER genomes, respectively. The alignment
of sequences for these species and annotation of the orthologous genomic regions showed generally
conserved gene colinearity and structure in the PROG1 region (Tables S23 and S24). The conservation of
genomic structures and TE presence within the PROG1genomic regions could largely be explained by the
phylogenetic relationships of the studied species. Of them, GLA and BAR exhibited the largest colinearity,
while relatively lowered sequence conservation was conserved amongst SAT, NIV, GLU and MER. However,
our analysis revealed the architectural complexities and dynamic evolution of this region that have
occurred over the past 4.8 Myr. Comparative analysis of the evolutionary history of the annotated genes
indicated independent and lineage-specific gene gain and loss from this region among the six genomes as
frequent causes of synteny disruption. In addition to lineage-specific amplification and subsequent loss of
LTR retrotransposon elements in SAT, NIV and GLU, for example, massive lineage-specific insertions or
movements of gene/gene fragments also account for length differences amidst the examined orthologous
genomic regions. It is interesting to find that, not counting the conserved gene content, several genes may
have been apparently generated de novo or uniquely deleted in the AA-genome lineages. For example, the
PROG1 gene could be identified in SAT, NIV and GLU but was apparently lost in GLA, BAR and MER.
In comparison, 10 genes out of the 42 genes in the SAT genome is largely responsible for the expansion of
sequence length in this species. Compared with the SAT PROG1 region, we detected several
micro-rearrangements in NIV including a relatively long-range deletion spanning about 30 Kb.
Supplemental Section S5 –Genome-wide Assessment of Structural Variation in the
AA-genome Oryza species
5.1 Assessment of structural variation
2,971 orthologous genomic segments across the six AA-genome Oryza species (Table S18), produced
by Mercator [72], were used to detect genome-wide insertions and deletions (indels). Mercator identifies
syntenic regions with one to one orthology relationships to ensure only one best alignment for each locus.
All segments of five de novo assemblies (NIV, GLA, BAR, GLU and MER) were individually aligned to
corresponding orthologous region of SAT by LASTZ [70], with high-scoring segment pairs chaining
option, ambiguous ‗N‘ treatment and gap-free extension tolerance up to 50 Kb options enabled. Using
numbers of algorithms of SOAPsv [71] (http://soap.genomics.org.cn/SOAPsv.html), alignment errors and
inaccurately predicted gaps in the assemblies were corrected, and the best hits contributing most to the
co-linearity between orthologous segments were selected if more than one alignment overlapped at the
same SAT locus. Finally, we extracted gaps in the best pairwise alignments as candidate insertions (gaps
opened in SAT but sequences existed in corresponding AA-genome scaffolds) and vice versa. We
compared all indels with assembly gap positions in the five assembled genomes, and found that only ~2%
of insertions corresponded to gap regions while no deletions overlapped with gaps (Table S25).
Furthermore, these gaps were all wrapped in non-N insertion sequences and away from breakpoints,
indicating that these insertions are bona fide structural variations, although their lengths cannot be
precisely detected due to the existence of assembly gaps.
In the five assembled rice genomes, we identified a total of 232,900-514,924 putative insertions
(ranging from 1-47,650 bp in length) corresponding to 29.43-55.61 Mb, and 240,928-539,026 deletions
(ranging from 1-40,230 bp in length) affecting 10.16-22.12 Mb (Table S25). Note that the lengths of
identified insertions may be under-represented due to assembly gaps. When we inquired into patterns of
the size distribution, assembly gap-related insertions were also removed because of the difficulty in
inaccurately predicting the lengths of gaps. The size distribution of structural variations is consistent with
previous findings [73] that longer variations were less abundant. The majority of detected deletions were
small in size; at least 90% of all indels were < 100 bp, whereas 70% were < 10 bp (Figure S12). There
were several exceptions of the significantly increased number of indels that range in size from 100 bp to
300 bp; as previously reported [74, 75], this may be explained by the enrichment of DNA transposon
insertions (Figure S13). For example, TE annotation of these indels (230-250 bp) showed that more than
half of them composed of DNA transposons, ranging from 52.3% (NIV) to 78.4% (MER), of which Tc1
and Tourist/Harbinger were apparently predominant. However, we also detected some indels spanning
more than 40 Kb (Table S25; Figure S12). The larger indels are most likely under-represented in our
output data due to the constraints of the applied detection method, in which the majority of indels in the
five rice genomes were smaller than 1 Kb (Figure S12).
The corresponding genomic positions of these identified indels shorter than 50 bp in length were
mapped and located in the SAT genome, showing that a total of 86% of the predicted indels were located in
intronic (20%) and intergenic regions (66%) (Figure S14). As expected, indels were less abundant in
protein-coding regions (5%), suggesting that they are more likely to have negative impacts and thus easily
eliminated by purifying selection. We analyzed the size distribution of insertions and deletions within
protein-coding sequences and found that there were peaks at positions that are multiples of three (Figure
S15), owing to negative selection on frame-shift indels [76]. Overall, these findings demonstrate that
far-reaching structural variation has affected not only genomic architectural heterogeneities but also the
evolution of protein-coding genes.
To understand how large-scale structural variants have driven the genome evolution, we identified all
genomic structural variants by mapping them onto phylogenetic tree of the six AA-genome Oryza species.
Here we applied a ‗code‘ to denote the genomic structural variation event occurred in the six rice genomes.
‗1‘ symbolizes the orthologous presence in a species, ‗0‘ signifies the absence, and the ‗code‘ order is
followed by evolutionary relationships of the species tree (SAT/NIV/BAR/GLA/GLU/MER). Note that
each bit denotes the state in a species. For example, ‗010000‘ indicates that the indel only occurred in NIV.
We characterized and classified these indels into the two types. First, the regular type indicates that the
insertion/deletion event is clearly supported by the robust phylogenetic relationships of the six studied
species (Figure S8) (e.g., ‗000100‘ denotes that an indel only appears in GLA, and ‗110000‘ designates a
SAT/NIV-specific indel); and second, the random type includes all other insertion and deletion events that
cannot be located by any phylogenetic relationships (Table S26). There were 991,155 lineage-specific
insertions and 963,201 lineage-specific deletions (regular type), while the 169,453 insertions and 139,069
deletions (random type) occurred in the six AA-genome Oryza species; the former is 6-7 times more than
the latter.
To estimate the space of these indels inserted into and/or removed from the six AA-genomes, we
further examined the regular type of structural variants by locating them at the five timetabled nodes (a:
0.26 Myr; b: 1.2 Myr; c: 1.6 Myr; d: 1.8 Myr; e: 4.8 Myr). Our results suggested that, during the past 4.8
Myr, the rates of either insertions or deletions are constant and thus have been greatly contributing to the
rice gene and genome evolution. The numbers of structural variation events occurred along different
branches are proportional to their branch lengths, further supporting the observed results that the
occurrences of large-scale structural variations are closely related to their phylogenetic positions and
divergence times from SAT (Figure S8; Table S26). The number of insertions/deletions occurred in
African branch (GLA/BAR), for example, was one-third lower than Asian branch (SAT/NIV) due to the
relatively recent speciation and lineage divergence.
5.2 Assessment of segmental duplications
We performed a genome-wide comparative analysis of high-identity segmental duplications in the six
AA-genome Oryza species. As the large and high identical duplicated regions were often missing,
collapsed, or mis-assigned, there was an ascertainment bias on the detected power with the quality of
genome assembly. To avoid the effect of different assemblies, we used the whole-genome shotgun
sequence detection methods (WSSD) [77-79] to detect recent segmental duplications in these rice genomes.
The applied databases included the next-generation sequencing paired-end reads of the five rice genomes
generated by the Illumina platform, as described in Supplemental Section S1; short-read sequencing data
set
of
SAT
was
obtained
from
the
DDBJ
Sequence
Read
Archive
(http://trace.ddbj.nig.ac.jp/dra/index_e.shtml, acc=DRR001555) (Table S27). Trimmomatic [80] was used
to trim adapter and low quality (Phred quality < 20) sequences. All known common repeats and
low-complexity sequences were masked with RepeatMasker [81], and a second level of masking with
Tandem Repeats Finder (TRF) [82] was run to remove short tandem repeats. We mapped the filtered reads
to the SAT reference genome using bwa [83], removed PCR duplicates, and calculated average insert sizes
and standard deviations statistics for paired ends that mapped in the correct orientation using Picard
(http://picard.sourceforge.net) (Table S27). mrFAST [84] was performed to place filtered reads to all
possible locations in the reference genome, allowing for up to 4 bp of the read length edit distance. Reads
exceeding 72 bp in length were truncated to 72 bp. We classified discordant read pairs with mapping span >
average+4std [85, 86].
We initially assessed the dynamic range response of short sequence data mapped by mrFAST through
determining the read depth for a set of 2,305 1:1 orthologous genes where copy number status had been
previously confirmed with computational methods (see Supplemental Section S3.1 for details). Using
these benchmark loci, we determined the average read depth and variance for 5-kb (unmasked) regions for
all chromosomal loci. Using mrCaNaVaR [78], read-depth profiles were independently constructed for
each sequencing library, corrected for G+C bias introduced during library construction (Table S27). We
considered regions as a high-identity segmental duplication interval based on the criteria [87] when 6/7
consecutive 5 Kb genome windows having a read-depth at least three standard deviations above the mean
depth calculated for the single-copy regions.
The ability and power of the developed WSSD method have been proved efficient [88-90] in
comprehensively identifying putative segmental duplications greater than 20 Kb in size and accurately
predicting absolute copy number variation of duplicated segments and genes using the next-generation
sequence reads. In this analysis, we identified a total of 11.83 Mb of segmental duplications > 20 Kb in
SAT, and 4.93 Mb of reference sequences were duplicated in NIV, 4.51 Mb in GLA, 4.82 Mb in BAR, 5.62
Mb in GLU, and 6.36 Mb in MER (Dataset S3a), which involved a total of 1,628 genes. Gene ontology (GO)
analysis of these genes showed that their functional classes were significantly enriched for several specific
biological functions (e.g., cell death, response to stress, defense response genes) (Dataset S3b). The
identification of a number of segmental duplications, fewer than large-scale of insertion and deletion
events, suggests that mechanisms other than non-allelic homologous recombination may also have made
contributions to rice genome evolution.
Supplemental Section S6 –Dynamics and Evolution of Gene Families among the
AA-genome Oryza Species
6.1 Identification of gene families in the AA-genome Oryza species
OrthoMCL pipeline [42] was used to cluster the predicted genes from the seven Oryza genomes,
including SAT, NIV, GLA, BAR, GLU and MER, and BRA, into gene families on the basis of the
similarities of protein sequences. Alternative splicing and TEs were filtered out from the original
proteomes (Table S28). The longest isoforms in size were retained and an all-against-all comparison using
BLASTP (1e-5) was performed. Clustering was then performed based on a Markov cluster algorithm (MCL)
using OrthoMCL (inflation 1.5). This analysis resulted in a total of 39,293 gene families (Table S29).
6.2 Lineage-specific gene families of the AA-genome Oryza species
After collecting all gene families, we classified them according to the presence or absence of genes
for a given species and determined which gene families were lineage-specific. Among the 1,779 gene
clusters representing 3,805 genes unique to BAR, 728 (19.13%) contained InterPro [22] domains or were
assigned gene ontology categories. The remaining 3,077 were the previously unidentified predicted genes
of unknown function. The most GO terms associate with GLA and BAR lineage-specific families included
protein binding (GO:0005515), oxidation-reduction process (GO:0055114) and zinc ion binding
(GO:0008270) (Dataset S4). By comparison, 1,039 clusters containing 2,239 genes were unique to SAT
and NIV lineages. Of the 2,239 genes, 591 (26.40%) had functions which mainly involved in protein
phosphorylation (GO:0006468), oxidation-reduction process (GO:0055114) and protein binding
(GO:0005515) (Dataset S4).
6.3 Gene family expansions/contractions of the AA-genome Oryza species
In order to estimate rates of gene gain and loss, we applied an updated version of the likelihood model
[91] and implemented in the software package CAFE v2.2 [92]. This method models gene family evolution
as a stochastic birth and death process, where genes are gained and lost independently along each branch
of a phylogenetic tree. A parameter,  , describes the rate of change as the probability that a gene family
either expands (via gene gain) or contracts (via gene loss) per gene per million years, and can now be
estimated independently for all branches. After excluding lineage-specific families and likely annotation
artifacts, we totally inferred 17,563 gene families that may have been present in the most recent common
ancestor of rice (MRCA) (‗‗Likelihood Analysis‘‘ in Dataset S5). For these gene families (n = 17,563),
parameters were estimated by maximizing the likelihood of the observed family sizes. We first attempted
to estimate a fully parameterized model with the 12 different values of  , one for each branch of the tree,
with the latest version of the program CAFÉ v2.2 [92]. Given the search space of 12-parameter (12-p)
model which is likely too large to find a single global maximum, we created a 3-p model by assigning
branches to one of the three rate categories — fast ( 1 ), medium ( 2 ), and slow ( 3 ) — based on the best
branch-specific rate estimates from the above 12-p model. This 3-parameter (3-p) model always converged
to a single maximum as below ( 1 = 0.0602;
2 = 0.0507; 3 = 0.0023). The ―fast‖ branches of the 3-p
tree include the terminal lineages leading to GLU and NIV; the ―slow‖ branches include the terminal
lineages leading to GLA and BRA. The estimated parameters (  -values) for each branch of 12-p model
were:
(((((SAT_0.0161:1.2,NIV_0.0287:1.2)_0.0126:0.4,(GLA_0.0047:0.26,BAR_0.0124:0.26)_0.0170:1.34)_0.
0243:0.2,GLU_0.0302:1.8)_0.0093:3,MER_0.0120:4.8)_0.0008:30.5,BRA_0.0005:35.3)
 -values for 3-p model were:
(((((SAT_0.0507:1.2,NIV_0.0602:1.2)_0.0507:0.4,(GLA_0.0023:0.26,BAR_0.0507:0.26)_0.0507:1.34)_0.
0602:0.2,GLU_0.0602:1.8)_0.0023:3,MER_0.0023:4.8)_0.0023:30.5,BRA_0.0023:35.3)
6.4 Accelerated evolution of gene families in the AA-genome Oryza species
The maximum likelihood approach to studying gene family evolution allows us to identify individual
gene families that are evolving at rates of gain and loss significantly higher than genome-wide level on
average. Such families can exhibit either larger-than-expected expansions or contractions, which may
either be confined to a single lineage or reflect large changes across the overall phylogeny. Of the 17,563
gene families inferred to have been present in the MRCA of AA-genome species, we found that 552
exhibited significant expansions or contractions (P-value < 0.0001, Dataset S6). At this level of
significance, only slightly more than one family is expected by chance. Annotation of InterPro domains
and GO assignments showed that these rapidly evolving families were associated with many biological
processes (Dataset S6). We are particularly interested in gene families with large lineage-specific
expansions, as it is likely that adaptive selection may act on lineage-specific traits through these changes.
Of the 552 rapidly evolving families, we identified ten that showed large changes in copy number among
the four terminal branches (SAT, NIV, BAR and GLA) and performed the annotation of InterPro and GO
enrichment (Dataset S7). Of them, the largest gene family (GF_2) had 188 copies across all seven Oryza
genomes and contained Tyrosine-protein kinase, catalytic domain (IPR020635), Leucine-rich repeat
domain (IPR013210), and Protein kinase, ATP binding site (IPR017441).
6.5 Molecular evolution of agronomically important gene families
6.5.1 NBS-LRR resistance gene family
A complete set of the NBS-encoding sequences was identified in the seven rice genomes using a
reiterative process. First, the predicted proteins from the seven rice genomes were screened using HMMER
V.3 [93] against the raw Hidden Markov Model (HMM), which corresponded to the Pfam NBS (NB-ARC)
family (PF00931.17; http://pfam.sanger.ac.uk/). The analyses using the raw NBS domain HMM resulted in
a total of 715 (SAT), 562 (NIV), 518 (GLA), 549 (BAR), 469 (GLU), 491 (MER), and 373 (BRA)
candidates. Subsequently, high quality protein sets (<1e-60) from the seven genomes were aligned using
MUSCLE [94] and employed to individually construct the seven rice species-specific NBS HMMs using
the module ‗‗hmmbuild‘‘. Searching again with these seven new rice-specific models, in total, 631 (SAT),
489 (NIV), 450 (GLA), 476 (BAR), 392 (GLU), 416 (MER), and 307 (BRA) NBS-candidate proteins were
separately identified (E-value ≤ 1e-2) (Table S30). The results are suggestive of the expansion of
NBS-encoding genes in the AA-genome species while comparing with BRA.
NBS-encoding resistance genes are often associated with other domains such as TIR and CC in the
N-terminal regions or a variable number of LRRs in the C-terminal region. To detect TIR and LRR
domains that are contained within the NBS resistance candidate genes obtained above, we again conducted
PFAM searches. The raw TIR HMM (PF01582.15) and nine LRR HMMs (PF00560.28, PF07723.8,
PF07725.7, PF12799.2, PF13306.1, PF13516.1, PF13504.1, PF13855.1, PF14580.1) were downloaded
from the PFAM database (http://pfam.sanger.ac.uk/), and searched against the final gene sets of 631 (SAT),
489 (NIV), 450 (GLA), 476 (BAR), 392 (GLU), 416 (MER) and 307 (BRA) NBS-encoding proteins using
HMMER V3 (E-value ≤ 1e-2). Both TIR and LRR domains were validated using NCBI conserved domain
database and Multiple Expectation Maximization for Motif Elicitation (MEME) [95]. As reported
previously [96], PFAM analysis could not identify the CC motif in the N-terminal region, and thus we
detected the CC domain using the ncoils with the default parameters [97]. In order to analyze the
chromosomal locations of NBS-encoding proteins, we performed BLAT searches of the other six rice
genomes against the SAT genome; nearly 486 (99.38%; NIV), 441 (98.00%; GLA), 468 (98.32%; BAR),
384 (97.96%; GLU), and 408 (98.08%; MER) NBS-encoding genes were positioned on the chromosomes.
Chromosome 11 was the most abundant in NBS-encoding genes (NIV, 122; GLA, 112; BAR, 119; GLU,
94; MER, 114), while Chromosome 3 harbored the lowest numbers with merely 11 (NIV), 15 (GLA), 19
(BAR), 13 (GLU), and 13 (MER) genes (Figure S16). Such an unequal distribution of NBS-encoding
genes is also observed among chromosomes in the Oryza AA-genomes, although the phenomenon is
ubiquitous in other plant genomes.
6.5.1.1 Pid3 gene
Resistance genes (R-genes) protect plants against pathogens by producing R proteins. The majority of
R proteins contain a nucleotide-binding site and a carboxy-terminal leucinerich repeat domain [98]. Rice
blast disease, caused by the fungus Magnaporthe grisea, is one of the most serious diseases for rice.
Nearly all the cloned blast resistance genes encode NBS-LRR proteins. The large number of published
blast R genes has revealed the importance of NBS-LRR gene family in the improvement of rice blast
disease resistance. Pid3 is one of the well-characterized blast R-genes which was first identified in the
indica variety Digu [99]. Using the protein sequence of the Pid3 (FJ745364) as query, we performed
BLAST searches against our five sequenced genomes and BRA, and totally identified the six Pid3
orthologous gene sequences. To fully understand molecular evolution of the Pid3 genes in the studied rice
species, we also downloaded and incorporated protein sequences of Pid3-Nipponbare (FJ773286) and
Pid3-9311 (FJ773285) for the analyses. Comparisons of domain structure and sequence similarities showed
that the protein sequences of the eight Pid3 genes are highly conserved, especially within the NBS domain
(Figure S17). Notably, the Pid3 gene in the japonica varieties (Pid3-Nipponbare) was identified as a
pseudogene due to a C to T nonsense mutation. This mutation leads to the production of a truncated 737 aa
protein with a disrupted LRR region (Figure S17). However, the mutation was not found in the indica
(9311), BRA as well as the other five sequenced AA-genome species in this study. These results suggest
that the pseudogenization of Pid3 in japonica occurred after the divergence of indica and japonica, in
agreement with the finding in a former study [99].
6.5.1.2 Pi-ta gene
Blast disease, caused by the fungus M. grisea, is one of the most serious diseases in rice. The
resistance gene, Pi-ta, protects rice against this pathogen, and encodes a predicted cytoplasmic receptor
protein with a nucleotide-binding site (NBS) and a leucine-rich domain (LRD). Pi-ta was found as a
single-copy gene and first cloned from O. sativa cv. Yashiromochi [100], which induced defense responses
against the blast fungus carrying the virulence gene AVR-Pita. The physical interaction between Pi-ta and
AVR-Pita was dependent on a single amino acid at the 918th codon position located in the LRD of Pi-ta
[100]. When alanine was the amino acid at this position, the binding of Pi-ta to AVR-Pita induced
hypersensitive response, while if serine was in this position, Pi-ta cannot bind to the AVR-Pita, and O.
sativa was susceptible to the blast fungus. Here, using the BLAST searches against the Pi-ta genes
reported previously [100], we identified the five Pi-ta genes from the five sequenced rice genomes. The
detailed domain structure and amino acid variation showed overall high sequence conservation at the Pi-ta
locus (Figure S18); only 20 polymorphic sites (eight in the N-terminal region, four in the NBS domain,
and eight in the leucine-rich region) were identified. At the position 918, serine was present in O. sativa cv.
Niponbare and other species except NIV and O. sativa cv. Yashiromochi. Similarly, isoleucine was retained
at the position 6 except NIV and O. sativa cv. Yashiromochi. These observations suggest that the
substitution from isoleucine to serine at the position 6 and subsequent replacement of serine with alanine at
the position 918 may produce the Pi-ta protein that recognizes AVR-Pita in the NIV and Yashiromochi
variants. We sampled the ten representative individuals with depths of > 10X in NIV [101], performed
reads mapping, and confirmed our discoveries in Pid3 and Pi-ta that still hold at the population level
(Dataset S8). Our results further supported findings in O. rufipogon and other related species [102, 103].
6.5.2 WRKY gene family
We systematically identified the WRKY genes in the seven Oryza genomes by adopting a similar
reiterative procedure as described in Xu et al. [101]. The protein sequences and annotation files of BRA
and SAT were downloaded from Gramene (http://www.gramene.org) and Rice Genome Annotation Project
(http://rice.plantbiology.msu.edu/index.shtml), respectively. The predicted proteins from the seven rice
genomes were screened using HMMER (Version 3.0) [93] against the raw Hidden Markov Model (HMM),
corresponding to the PFAM WRKY family (PF03106.10; http://pfam.sanger.ac.uk/). The analysis using the
raw HMM of the WRKY domain resulted in 148 (SAT), 94 (NIV), 98 (GLA), 83 (BAR), 97 (GLU), 91
(MER), and 94 (BRA) candidates. Of these, a high quality protein set (<1E-20) were selected and the
WRKY domain peptide sequences were aligned using MUSCLE (Edgar 2004). The alignment output was
used to build the seven rice species-specific WRKY HMM model using the module ―hmmbuild‖. Scanning
with these new rice-specific models against the seven rice peptide databases with a cutoff E-value of 1e-10,
we identified 128 (SAT), 92 (NIV), 88 (GLA), 81 (BAR), 93 (GLU), 83 (MER) and 88 (BRA) WRKY
genes in total. The presence of the WRKY domains in the protein structure were validated using NCBI
conserved
domain
database
(http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi)
and
Multiple
Expectation Maximization for Motif Elicitation (MEME) [95]. The 129 WRKY gene models in the SAT
genome were also collected from the Plant Transcription Factor Database (PlnTFDB v3.0,
http://plntfdb.bio.uni-potsdam.de/v3.0/). Of them, we characterized 127 (98.5%) by the above-mentioned
procedure, indicating that it can be suitably employed to identify the WRKY genes in other plant species.
In order to compare with other plant species, we directly obtained the WRKY genes of the Arabidopsis
thaliana, Zea mays and Sorghum bicolor genomes from PlnTFDB, and classified them into the three
groups based on the number of WRKY domains and the pattern of the zinc finger motif. Briefly speaking,
Group I proteins typically contain two WRKY domains including a C2H2 motif, Group II proteins have a
single WRKY domain and a C2H2 zinc-finger motif, and Group III proteins also possess a single WRKY
domain but zinc-finger-like type is C2-H-C. The number of WRKY genes (~93) in the rice species is
comparable to Sorghum bicolor (~94) and A. thaliana (~88) but is much lower than Z. mays (~202) (Table
S31).
6.5.3 MADS-box gene family
A total of 77 previously identified rice MADS-box protein sequences [104] were obtained and aligned
using MUSCLE [94]. The multiple alignment result was then used to build a MADS-box Hidden Markov
Model with HMMER program [93]. Using this model, the amino acid sequences of the seven Oryza
genomes were searched using the hmmearch command. A number of 70 (NIV), 73 (GLA), 72 (BAR), 72
(GLU), 70 (MER) and 60 (BRA) MADS-box proteins were obtained with a probability of E value
threshold of 0.1, as recommended by the HMMER user's guide. All candidate MADS-box proteins were
validated using NCBI conserved domains and multiple expectation maximization for motif elicitation
(MEME). In order to classify the Oryza MADS-box genes, we aligned MADS-box gene sequences against
the previously reported rice MADS-box sequences [104] (Table S32).
Supplemental Section S7 –Identification of Gene Loss in the AA-genome Oryza species
7.1 Loss of gene families across the AA-genome Oryza species
We estimated that the MRCA of the seven Oryza species contained ~20,873 genes on basis of all of
the inferred family sizes at the root of the tree (Dataset S5). Indeed, many families that experienced
contractions were completely lost along one or more branches of the phylogenetic tree. These ‗‗extinctions‘‘
occurred on almost every branch of the tree, and include genes involved in a wide variety of biological
functions. In total, there were 6,987 families inferred to have been present in the MRCA that have none in
at least one extant genome. The most common functional categories of extinct families involve protein
kinase, regulation of transcription, zinc finger, pentatricopeptide repeat, binding-related (but note that the
function of 40.2% were categorized as ambiguous or unknown). A complete list of extinct families is given
in Dataset S9. We found that 798 families are present in the MRCA but appear entirely lost in the SAT
genome; of them, 431 are still present in NIV. A similar phenomena was also observed in other species
(GLA: 401; BAR: 436; GLU: 424; MER: 438; BRA: 798) with an average estimate of 488.
7.2 Identification of gene loss events and/or novel genes in SAT based on WGS reads mapping
In order to identify the gene loss events and/or novel genes in SAT, we adopted the previously
reported methodology with some modifications [101]. We first used soap2 (version 2.21) [105] to align all
filtered short reads to the SAT reference genome, allowing two mismatches and the minimum length of the
reads of 35 bp. After the reads mapping, we assembled the unmapped reads into contigs by SOAPdenovo
for each species [9] with the default parameters. Both contigs shorter than 2 kb and the redundant
sequences identified by a self-align method were excluded for the further analyses. In sum, we identified
3,165 contigs with a total length of 8.23 Mb for all six species. Then, we again BLASTed all these
candidate ―novel‖ contigs against the SAT genome to search for homologous sequences. We found that
1,067 (33.7%) contigs indeed have more or less similarly homologous sequences in the reference genome
with the coverage >30% and identity >80%, indicating that these contig sequences were very possibly
from diverged homologs in the reference genome and thus resulted in mapping difficulties. The remaining
2,098 contigs were either real novel sequences or located in non-assembled heterochromatin regions. The
average GC content of these 5.64 Mb sequences was 41.5%, which is comparable to the GC content of the
genome (43.5%). We conducted de novo gene annotation with AUGUSTUS for these 2,098 contigs and
annotated a total of 823 putative novel genes. We next compared them with our above-annotated gene set
using BLASTN and retained genes whose coverage and identity both were approximately equal to 100%.
Finally, we confirmed that 490 de novo genes were lost in the SAT genome, which is in strong support of
the estimation of gene loss in SAT as identified above based on the entire loss of gene families across the
AA-genome Oryza species (Table S33). We further performed functional annotation of the 490 proteins by
using the InterProScan as performed above. The average gene length is substantially shorter than that
estimated from the whole genome (957 bp versus 2,300 bp), indicating that many of the annotated genes
may not be intact. Of the 490 genes, 163 (33.3%) can be functionally annotated. The most common
involved domains of these novel genes were NB-ARC (IPR002182), Tyrosine-protein kinase (IPR020635)
and Leucine-rich repeat (IPR013210) (Dataset S10). These terms are noteworthy, as previous work has
uncovered the evidence for the evolution of truly de novo proteins with the same functions [106]. We
performed statistical analysis by using hypergeometric test: phyper (q=47, m=445, n=40747, k=490,
lower.tail=FALSE, log.p=FALSE). Of them, m represents the number of disease-related genes (445), n
indicates the number of remaining genes of SAT after excluding disease-related genes (411192-445 =
40747), k represents the number of the sampled genes at random (490), and q shows the number of
disease-related genes of the sampled genes at random (47). We obtained P-value of 4.473738e-31,
indicating that a significant enrichment with disease-related genes of the 490 novel genes that were lost in
the SAT genome.
Supplemental Section S8 –Gain and Loss of Agronomically Important Genes across the
AA-genome Oryza Species
8.1 Computational identification of the gain and loss of agronomically important genes using whole
genome reads mapping
The gain and loss of functionally important genes across the genomes have attracted considerable
attention, since they may be related to the adaptation, divergence and speciation of plants. The high-quality
SAT genome, together with the five sequenced AA-genome sequences, provides an opportunity to shape a
picture of gene gain and loss across the AA-genome Oryza species and should provide novel perceptions
into their evolutionary consequences. To identify patterns of gene gain and loss in the AA-genome Oryza
species we collected a total of thirty-one agronomically important genes that have been functionally
well-characterized in rice (Table S34). We took the following two methods together to detect their
presence or absence in different species. First, we used SOAPaligner with default parameters
(http://soap.genomics.org.cn/soapaligner.html) to map ~30× Illumina pair-end reads of the five
AA-genome Oryza species to these genes from SAT. To reflect a full evolutionary history of
protein-coding genes and reveal evolutionary consequences within all eight AA-genome Oryza species,
besides the five sequenced species, we obtained ~30× Illumina reads from IND, O. sativa ssp. tropical
japonica (abbreviated as TRJ), O. rufipogon (abbreviated as RUF) and O. longistaminata (abbreviated as
LON) genomes, respectively, and mapped to the SAT genome for comparisons. We used Perl scripts to
filter reads and then performed the reference guide assembly. After calculating lengths of consensus
sequences, we computed consensus coverage (consensus coverage% = consensus length/gene length ×
100). We considered a gene to be present in a target genome if consensus sequence coverage ≥ 40%.
Second, we ran the genblastA software using these candidate genes of the SAT genome as query sequences
and separately BLASTed against all contigs of the five AA-genomes to identify homologs [107]. Based on
the output of gene coverage we regarded the gene as the presence in a target genome if gene coverage ≥
40%. The combined analyses lastly resulted in patterns of the gain and loss of the screened genes (Table
S35). Of these, we found that the four genes including GW5, PROG1, S5 and SaF exhibited the pattern of
gain and loss among the AA-genome Oryza species.
8.2 Evolutionary dynamics of rice speciation genes
The question of how two species originate from one has fascinated biologists since before Darwin‘s
iconic treatise on the subject [108]. The AA-genome Oryza species including the widely Asian cultivated
rice O. sativa provides a useful set of model species for studying many aspects of plant speciation. Many
speciation genes that underlie reproductive barriers (RI) in rice were identified in recent years [109]. These
thirty-one agronomically important genes indeed included the five speciation genes (S5, SaM, SaF,
mtRPL27and HWH1) from the six genes so far reported [109], we further investigated their evolutionary
dynamics across the six AA-genome species.
With nucleotide sequences of S5, SaM, SaF, mtRPL27and HWH1 from the SAT genome, we first
performed BLAST searches against the five assembled rice genomes and the annotated gene set (Table
S11), and retrieved and aligned orthologous coding sequences using MEGA5 [60] (Figures S19 - S23). Of
them, S5 was found completely lost in African cultivated rice (GLA) and its wild progenitor (BAR).
Computational analyses of these five genes using whole genome reads mapping have validated the loss of
S5 in GLA and BAR (see Supplemental Section S8.1 for details) (Table S35). We further examined
expression patterns using RNA-Seq datasets from the four tissues of NIV, GLA, BAR, GLU and MER
(Tables S9 and S10). The loss of S5 in GLA and BAR was evidently supported by the absence of any
transcripts (Table S36).
Comparative transcriptome analysis revealed quite different patterns of transcription across the
AA-genome Oryza species (Table S36). mtRPL27 is a nuclear-encoded mitochondrial ribosomal protein
L27 that previously reported to cause hybrid sterility between O. sativa and O. glumaepatula [110]. Here
we detected that this gene is functional, evidenced by high levels of expression in all four tissues from
these five rice species; HWH1 encode a GMC oxidoreductase that causes hybrid necrosis in
inter-sub-specific of O. sativa [111]. HWH1 seemed also highly expressed in all four tissues; SaF was
found highly expressed in roots as well. Although the above-described analyses suggested that S5 is
present in other four rice species except for GLA and BAR, we found that it was only expressed in the
panicle of NIV and leaf of MER. Meanwhile, we failed to detect the expression of SaM in NIV and GLA
but it indeed is expressed in BAR, GLU and MER. The expression profiling of these two genes (S5 and
SaM) may be different from one sampled individual to another of the same rice species, or affected by
tissue-specific expression. S5 and Sa loci (SaF and SaM), can cause female and male sterility in O. sativa
subsp. indica-japonica hybrids, respectively [112, 113]. Thus, the recent loss of this gene in GLA and
BAR suggested that the rapidly evolutionary dynamics of rice speciation genes that may have influenced
the formation of reproductive barriers among rice species. Together with molecular evolutionary behaviors
of different nucleotide substitution rates and the occurrences of genomic structural variation in these
speciation genes (Figures S19 - S23), our results suggest that their rapidly evolutionary dynamics may
have contributed to the hybridization across the AA-genome rice species at inter-subspecific or
inter-specific levels. It has proposed that these reproduction-related genes are quite variable between
different species, subspecies, populations and phenotypes, and encompasses a collection of divergent
systems. Further efforts should ideally combined experimental confirmation, phenotypic characterization,
functional experiments and population genetic analyses to convincingly validate evolutionary dynamics of
these speciation genes and RI-inducing alleles, represented by extensive genetic and geographical samples,
and examine effects on RI in AA-genome Oryza species.
8.3 Computational validation and experimental confirmation of the PROG1 gene in the eight
AA-genome Oryza species
The PROG1 gene controls several important agronomic traits of prostrate growth, greater grain
number and higher grain yield in Asian cultivated rice. The gene encodes a single Cys2-His2 zinc-finger
protein and is located on the chromosome 7 with the length of 504 bp. PROG1 variants identified in O.
sativa disrupt the PROG1 function and inactivate prog1 expression, leading to erect growth, greater grain
number and higher grain yield in cultivated rice [68, 69]. Of the above obtained five genes that exhibit the
gain and loss, we selected PROG1as an example to further validate the presence and /or absence in the six
AA-genome Oryza species. First, we used total Illumina reads (including pair-end, mate-pair and single
reads) for each of these species (NIV: 72.78×; GLA: 55.96×; BAR: 51.11×; GLU: 86.19×; MER: 60.16×)
(Table S2), and mapped to the PROG1 gene from the SAT genome. we obtained 76.24×, 55.25 ×, 138.33
× and 70.63 × Illumina reads from the IND, TRJ, RUF and LON genomes, respectively, and mapped to the
SAT genome for comparisons. The results showed that BAR, GLA and MER rarely had reads mapped on
the PROG1 gene of SAT. Most of these reads were mapped to specific regions that were all single reads
make us assume that they may be homologous sequences by chance and be nothing related to the candidate
gene (Table S37; Figure S24); second, we directly adopted primer pairs (F (5'-3'): atcatgattcgcagcttgca; R
(5'-3'): cggattcggaaataactagc) previously published [69], performed PCR amplifications, and then
sequenced the IND, TRJ, NIV, RUF, GLA, BAR, GLU, LON and MER by using ABI-3730 sequencer.
Results evidently validated the loss of the PROG1 genes in BAR, GLA and MER (Figure S25a); and third,
we retrieved 58,327 bp, 61,289 bp, 57,706 bp, 56,012 bp, 58,958 bp and 58,857 bp of the PROG1
orthologous genomic regions from the SAT, NIV, GLA, BAR, GLU and MER genomes, respectively. The
detailed sequence comparisons and annotation demonstrated the loss of the PROG1 genes in BAR, GLA
and MER (Figure S25b). Finally, we aligned the orthologous PROG1nucleotide and amino acid sequences
of SAT, IND, TRJ, NIV, RUF, LON and GLU (Figure S25c).
Supplemental Section S9–Molecular Evolution of Protein-coding Genes
9.1 Accelerated evolution of protein-coding genes
We investigated the molecular evolution of protein-coding genes on the set of high confidence 1:1
orthologous genes (see Supplemental Section S3.1 for details) in the six AA-genome Oryza species. All
the 1:1 orthologs (orthologous families) were further aligned by ClustalW [39]. Multiple alignments 1) that
had frame-shift indels, unless compensated within 15 bp; 2) that had CDS whose lengths were not
multiples of three; and 3) that had in-frame stop codons, were discarded, leaving a total of 2,272
orthologous families for the detection of positive selection and evolutionary rate analyses.
We analyzed this set of well-aligned 1:1 orthologous gene families to obtain and compare the average
evolutionary rates of protein-coding genes along lineages and clades of the six-species phylogeny. Out of
the 2,272 alignments, 200 were selected randomly and concatenated together. Branch-specific ω values
(nonsyonymous-synonymous rate ratio, dN/dS) were estimated using the codonML [114] program in the
PAML software (version 4.4) [115]. The two models with 1) a single ω estimated for all branches and 2)
branch-specific ω for each branch were compared in a likelihood ratio test (LRT). The experiments were
replicated 10 times. The detailed tree structure files for each LRT were given in the ―Performance Notes‖
at the end of this section.
The assumption of a single ω among all branches was rejected in all LRTs with 10 degrees of freedom
at a high significant level (P << 0.01). We observed that ω values varied among different Oryza lineages:
along the terminal branches, GLA and BAR had the largest ω estimates, MER showed the smallest, while
the remainder exhibited intermediate values. Compared to MER, the increased estimates of ω in SAT, NIV,
and GLU may primarily result from relaxed selective constraints, probably owing to the reduced effective
population sizes or codon biases (Table S38; Figures S26 and S27a).
9.2 Genome-wide scan for positive selection
Detecting positive selection was performed using the widely employed codon-based substitution
models and LRTs implemented in the program PAML version 4.4 [116]. In all measurements, codon
frequencies were estimated from nucleotide frequencies at each codon position (model F3×4).
To identify genes under positive selection, we performed CodeML and a series of different LRTs to
calculate the ratio of synonymous (dS) and non-synonymous (dN) changes at each codon or on particular
branches or clades of interest in the AA-genome Oryza phylogeny for each orthologous family. Briefly
speaking, our LRT for selection on any branch of the phylogenetic tree was compared site model 1a (nearly
neutral) against 2a (selection) [117], while the branch/clade-specific LRTs were based on branch-site
models [118, 119], which compared the modified model A with the corresponding null model with ω2 = 1
fixed (fix_omega = 1 and omega = 1). In the M1a-M2a comparison, the degree of freedom df = 2 was used,
with the critical values to be 5.99 and 9.21 at 5% and 1% significance levels, respectively. For the lineageand clade-specific LRTs, P-values were computed assuming that the null distribution was a 50:50 mixture
of point mass 0 and χ2df=1. Identifying sites under positive selection was achieved by using the Bayes
empirical Bayes (BEB) [120] to calculate the posterior probabilities for site classes. Multiple comparisons
were performed by following the method of Benjamini and Hochberg [121] to estimate the appropriate
P-value threshold for a false discovery rate (FDR) of < 0.05. The phylogenetic relationships were
determined based on the tree constructed above (see Supplemental Section S3.2 for details). In particular,
we tested for selection on any branch of the tree (Figure S27a); on each of the six individual branches
within the AA-genome clade (Figure S27b-e, h-i); on the branch leading to the Asian cultivated/wild rice
(Figure S27f); and on the branch leading to the African cultivated/wild rice (Figure S27g).
Applying the likelihood method and a P-value of 5% for statistical significance (FDR < 0.05) of all
2,272 high-confidence orthologous families, we found a total of 537 non-redundant positively selected
genes (PSGs) in all tests (Tables S39). It included 268 in the site model tests for all branches (Figure S27a)
and 20 (the ancestral Asian rice branch) to 234 (MER) in branch/clade-specific LRTs (Figure S27b-i). It is
likely that the detection power of the same method makes difference for the closely related species under
study. Several factors may be used to account for the wide range of numbers along different branches, such
as their divergence times, expression levels, recombination rates, and effective population sizes. Population
genetic theories have predicted the effect of a change in effective population size on nucleotide
substitutions rate [122]. However, the determined role of effective population size on natural selection
remains controversial [123, 124]. In our observation, the smaller numbers in cultivated rice when
compared with their wild progenitors (64 in SAT and 40 in GLA versus 122 in NIV and 74 in BAR,
respectively), is especially consistent with the positive relationship between effective population size and
selection. The candidate PSGs and detailed results of LRTs were listed in Dataset S11. To date, several
genome-wide scans for positive selection have been conducted in various organisms based on phylogenetic
or population genetic methods [125-127]. These results widely vary among the studied species, with the
evidence of extensive signs of adaptive evolution in Drosophila [128], rodents [129], and sunflowers [130],
but lower levels of positive selection in hominids [131, 132] and Arabidopsis species [133-135]. Recent
systematic analyses of positive selection revealed limited adaptive divergence in plants [123, 136, 137].
Our genome-wide survey of 2,272 highly confident orthologous families showed a large proportion (23.6%,
537) of candidate PSGs in the Oryza phylogeny, however, caused by severe filter criterion we adopted in
the identification of orthologs, the level of rice adaptive evolution is more likely to be underestimated.
We examined levels of expression of the PSGs that detected by branch-specific LRTs using RNA-Seq
datasets from the four tissues of NIV, GLA, BAR, GLU and MER (Tables S9 and S10). Note that the SAT
gene expression data was downloaded from MSU Rice Genome Annotation Project Database
http://rice.plantbiology.msu.edu/expression.shtml. We found that most of PSGs were expressed at lower
levels than non-PSGs (Figure S28), which was consistent with previous studies [138-140]. Nevertheless,
the detected genes or sites targeted by selection will provide more data and opportunities for further
functional and evolutionary analyses.
9.3 Gene function classification
Each of the identified PSGs (FDR < 0.05) was assigned to categories from Gene Ontology (GO) [141]
and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases [142]. The enrichment analyses of
functional annotation were performed by using agriGO [143]. Fisher‘s exact tests were adopted to measure
the gene-enrichment in annotation terms. The detailed functional information of these PSGs is given in
Dataset S12.
9.3.1 Functional analyses of positively selected genes
The identified PSGs were enriched in various GO functional categories. For all non-redundant PSGs
detected in at least one test, a group of significantly enriched categories was associated with developmental
processes, such as ―anatomical structure development‖, ―post-embryonic development‖, and ―multicellular
organismal development‖ (Table S40). Additionally, we found that 97 genes involved in ―response to
stimulus‖ and 47 in ―reproduction‖ categories showed evidence for positive selection, which were higher
than the genomic background levels (Figure S29). The PSGs identified by the branch- and clade-specific
LRTs displayed different patterns of functional classification and chromosomal distribution (Tables S40
and S41). For example, the over-represented functional categories largely varied between the two clades
of Asian and African cultivated rice and their own wild progenitors; the former enriched lots of categories
related to biological developmental process, reproductive process, and response to stimulus, while the
latter mainly included categories relevant to stress resistance. Several genome-wide studies confirmed that
positively selected genes were more likely to be associated with the reproduction and immune defense in
humans and mammals [140, 144]. The systematic study in higher plants also reported that more positive
selection was detected in the self-incompatibility loci, pathogens-defense, and genes involved in adaptation
to specific environments such as cold acclimation [136]. In agreement with previous findings, such
categories were also significantly over-represented in our dataset, given many candidate PSGs involved in
flower development, embryogenesis, reproduction and resistance-related processes.
9.3.2 Flower development and reproduction
Of the predicted PSGs, we detected a number of genes involved in biological processes related to
flower development and reproduction (Tables S40 – S42). There were a total of 14 genes which are
specifically homologous to flowering-related genes and may involve in the pathway of flowering process
and flower development (Table S42). For example, CRY2, LHY, and PFT1 may affect the
photomorphogenesis to regulate the flowering time; VIL1, ASHH2, MOS3/SAR3, and PEP perhaps act to
regulate FLOWERING LOCUS C (FLC) and Flowering Locus M (FLM) to control the flowering time; HEN1
and XTH may mediate the floral organ development. Moreover, several PSGs likely participate in
gametogenesis and reproduction processes, such as PG may be essential for pollen grains development, and
PDIL, TAF6, TMS1and SEC5 are probably associated with the pollen tube growth. It is likely that the selection
pressure of these reproduction or fertility-related genes might reflect arms races or adaptations to
environments after the speciation in particular lineage of the studied AA-genome Oryza species.
9.3.3 Stress resistance
Similar to the enrichments for genes related immunity and defense in animals, the identified PSGs
include many genes involve in the resistance to pathogens or environmental stresses. Some of the
over-represented functional categories, such as protein degradation, nucleotide binding, and
oxidoreductase activity, were related to stress resistance. We identified the seven PSGs that participated in
protein ubiquitination pathway (Figure S30), indicative of a signal for 26S proteasome dependent protein
degradation. Moreover, LOC_Os01g48390, LOC_Os03g61060, LOC_Os06g34040, LOC_Os03g53720,
LOC_Os10g01060, LOC_Os03g43850, LOC_Os12g08180, LOC_Os12g13170, LOC_Os11g16280,
LOC_Os01g70330, and LOC_Os04g40080 were assigned to categories in response to pathogens or abiotic
stress. Additionally, we identified a lot of PSGs containing Zinc finger domains, Pentatricopeptide repeat
domains, F-box domains, Leucine-rich repeat domains, or WD domains.
Performance Notes:
Detecting positive selection on any branch of the tree (site model- NSsites 0 1 2):
unrooted tree: (((SAT, NIV), (GLA, BAR)), GLU, MER);
Detecting positive selection on each of the six individual branches within the AA-genome clade
(branch-site model):
SAT-specific: unrooted tree: (((SAT #1, NIV), (GLA, BAR)), GLU, MER);
NIV-specific: (((SAT, NIV #1), (GLA, BAR)), GLU, MER)
GLA-specific: (((SAT, NIV), (GLA #1, BAR)), GLU, MER)
BAR-specific: (((SAT, NIV), (GLA, BAR #1)), GLU, MER)
GLU-specific: (((SAT, NIV), (GLA, BAR)), GLU #1, MER)
MER-specific: (((SAT, NIV), (GLA, BAR)), GLU, MER #1)
on the ancestral Asian rice:
(((SAT, NIV) #1, (GLA, BAR)), GLU, MER);
on the ancestral African rice:
(((SAT, NIV), (GLA, BAR) #1), GLU, MER);
on the clade of the Asian cultivated/wild rice:
(((SAT, NIV) $1, (GLA, BAR)), GLU, MER);
on the clade of the African cultivated/wild rice:
(((SAT, NIV), (GLA, BAR) $1), GLU, MER);
Estimating various rates among branches (branch-specific model- model = 0/model = 1):
unrooted tree: (((SAT, NIV), (GLA, BAR)), GLU, MER).
Supplemental Section S10 –Evolutionary Analyses of Non-coding RNAs across the
AA-genome Oryza Species
The knowledge of number changes and sequence evolution of ncRNA genes throughout the genomes
of closely related plants has raised considerable interest, since their origin and evolution may have played
an important role in driving the adaptation and divergence of the species. The six sequenced AA-genome
sequences thus provide an opportunity to perform a comprehensive analysis of evolutionary dynamics and
sequence divergence of ncRNA genes all over a genomic landscape. In particular, we investigated how the
variation of copy number and nucleotide substitution rates of important miRNA gene families may affect
phenotypic variations and important pathways that are relevant to ecological adaptations and development
processes.
10.1 Evolutionary dynamics of non-coding RNA genes among the AA-genome Oryza species
We compared dynamic changes in the number and sequence length of the tRNA, rRNA, snoRNA,
snRNA and miRNA genes across the six AA-genome Oryza species (Table S13; Figure S6). Our results
showed that average lengths of different types of ncRNA genes appear largely conserved from one species
to another. Probably owning to homology search method, the identified numbers of tRNA, snoRNA and
miRNA genes exhibited a decreased trend with the increase of divergence times from the SAT genome. In
addition, the number of rRNA genes identified in the assembled genomes may be underestimated partially
because of the assembly difficulty in the portion of genomic regions that harbored repeat sequences in the
sequenced genomes. It is likely that the number of most of the ncRNA genes identified in our sequenced
genomes was lower than that in the SAT genome caused by the gaps of these draft genomes. However, this
does not hold true for snRNA genes as number changes were still observed among them.
The target genes of miRNAs identified above were separately predicted for each AA-genome Oryza
species using psRNATarget server [145] with default parameters. Overall, we totally found that 203, 125,
124, 133, 125 and 123 miRNA families had predicted 2,433, 1,711, 1,803, 1,795, 1,678 and 1,567 target
genes in the SAT, NIV, GLA, BAR, GLU and MER genomes, respectively (Dataset S1; Table S43).
Among these predicted target genes, 1,927, 1,052, 1,085, 1,037, 921 and 902 were expressed in at least one
of the four tissues of SAT, NIV, GLA, BAR, GLU and MER, respectively, as supported by RNA-Seq data
sets (Tables S9 and S10). Note that the SAT gene expression data was downloaded from MSU Rice
Genome Annotation Project Database http://rice.plantbiology.msu.edu/expression.shtml.
To examine functional enrichment of the predicted target genes of miRNAs, we subsequently
performed the Gene Ontology (GO) analyses in the six AA-genome Oryza species by comparing with the
whole set of protein-coding genes of the SAT genome as a background using agriGO server [146] (Figure
S31). Some target genes, which were assigned to biological processes, including death (GO: 0016265),
cellular process (GO: 0009987), response to stimulus (GO: 0050896), or to molecular function, including
binding (GO: 0005488) activities, were significantly enriched in all of the six AA-genome Oryza species.
Besides, a proportion of GO terms including cellular component organization (GO: 0016043) (MER),
macromolecular complex (GO: 0032991) (MER), organelle (GO: 0043226) (SAT, GLU, MER) and
organelle part (GO: 0044422) (MER) were enriched in MER, of which organelle (GO: 0043226) was also
over-represented in SAT and GLU, probably due to undergoing frequent gain and loss during the evolution
of AA-genome Oryza species (Figure S32).
10.2 Sequence evolution of non-coding RNA genes
To understand the molecular evolution of ncRNA genes, we individually calculated the rates of
nucleotide substitutions across the six AA-genome Oryza species. The four steps were taken as follows:
first, the ncRNA genes in the SAT genome were annotated by using methods mentioned above (see
Supplemental Section S2.3 for details). miRNA targets of protein-coding genes were identified by using
psRNATarget server with default parameters [145], followed by using a Perl script to map these target sites
onto the SAT genome; second, for the purpose of comparisons, data sets of gene annotation of the SAT
genome were downloaded from TIGR database (ftp://ftp.tigr.org); third, all ncRNA genes, protein-coding
genes and intergenic genomic regions in SAT which at least remain 90% integrality in other species were
selected by using the alignment of orthologous segments constructed by MERCATOR and MAVID
softwares (see Supplemental Section S4.1 for details); finally, nucleotide substitution rates of ncRNA
genes and genomic components as a background were estimated by using divergence _ value  2

i j
dij ,
n(n  1)
where dij is an estimate of the number of nucleotide substitutions per site between DNA sequence i and j,
and n is the number of the examined sequences (n = 5 in this study).
On the basis of these analyses, we obtained nucleotide substitution rates of ncRNA genes in
comparisons with average genomic background across the AA-genome Oryza species (Table S44; Figure
S33). Nucleotide substitution rates of tRNA and snoRNA genes were very low, with merely 0.60% and
1.03%, respectively. The results suggest that they were quite conserved in the AA-genome Oryza species.
SnRNA genes, however, exhibited relatively high nucleotide substitution rate (2.61%), which could be
explained by that snRNA genes may regulate other protein-coding genes [147]. The nucleotide substitution
rates of miRNA genes, mature miRNA sequences and miRNA target sequences were found to be 1.91%,
1.47% and 1.61%, respectively. Among them, the nucleotide divergence was the highest in miRNA genes
but lowest in mature miRNA sequences, indicating that mature miRNA sequences and miRNA target
sequences are more conserved than miRNA genes. This result also suggests that negative selection is
weaker on miRNA genes and binding sites than protein-coding region of genes, but is stronger than the
surrounding introns and regulatory sequences. Furthermore, mature miRNAs and miRNA target sequences
are more conserved than miRNA genes.
To provide an in-depth insight into evolutionary behaviors of ncRNA genes, we further calculated
nucleotide substitution rates of ncRNA genes and genomic components for each of the six AA-genome
Oryza species (Figure S34). The nucleotide substitution rates of ncRNA genes and genomic components
for each AA-genome Oryza species was estimated by divergence _ value 
1 n 1
 di , where di is
n  1 i 1
an estimate of the number of nucleotide substitutions per site between this sequence and sequence i , and n
is the number of the examined sequences (n=5 in this study). For every kind of ncRNA gene, individual
species-based analysis expectedly confirmed the overall patterns of sequence evolution observed across the
six AA-genome Oryza species (Figure S33).
We calculated and compared nucleotide substitutions of miRNA genes and their target sites across the
six AA-genome Oryza species. Most of mature miRNA genes and target sequences were conserved,
whereas a small portion of them was variable (Figure S35). We next performed Gene Ontology (GO)
annotations of miRNA genes and their targets of the six AA-genome Oryza species, which were
implemented using GrameneMart and WEGO program [148, 149] (Figure S36). Results showed that the
majority of non-conserved miRNA genes were associated with the transcription regulator, biological
regulation and rhythmic process, whereas non-conserved target sites had no enrichment of any GO terms.
Furthermore, conserved miRNA genes were associated with the synapse, synase part, virion, virion part,
nutrient reservoir and locomotion, while conserved target sites were associated virion, virion part, nutrient
reservoir and translation regulator.
10.3 Exemplar studies of important miRNA genes in AA-genome Oryza species
MiRNA genes have emerged as master regulators of plant growth and thus have played an important
role in plant development processes and stress responses. We first analyzed and compared the well-known
miRNA gene families that are related to the adaptation to drought (miRNA169) [150] and
phosphate starvation (miRNA399) [151, 152].The miRNA169 family had 17, 7, 9, 7, 8 and 6 members in
SAT, NIV, GLA, BAR, GLU and MER, respectively (Table S43). Their precursor sequences are highly
similar to each other with a total of four nucleotide substitutions, of which there were three between MER
and orthologs from the other five species; the mature miRNA sequences are almost identical with only one
nucleotide difference between the MER sequence and orthologs from sequences of the other five species
(Figure S37). For the miRNA399 family, there were 11, 5, 5, 3, 3 and 6 members in SAT, NIV, GLA, BAR,
GLU and MER, respectively (Table S43). The precursor sequences had seven nucleotide substitutions and
three indels between MER and orthologs from the other five species, and the mature miRNA sequences
were almost identical with only one nucleotide difference between the MER sequence and the orthologs
from sequences of the other five species (Figure S38).
We systematically surveyed the twenty-one miRNA gene families that are related to flower
development [153-159]. Interestingly, we found that they largely varied in number among the six rice
species (Table S45). The results suggest that the expansion and contraction of some miRNA gene families
may be partly related to the differences of flower development and flowering time observed ordinarily
among the six AA-genome rice species. We randomly sampled the four miRNA genes (osa-miR159a,
osa-miR159e, osa-miR159f, osa-miR1847) belonging to the two miRNA gene families (miRNA 159,
miRNA 1847) which seemingly exhibited the amplification of copy number in one or more of the other
five non-SAT species. The miRNA159 family had 6, 5, 7, 6, 6 and 6 members, while miRNA 1847 had 1, 2,
0, 6, 7 and 4 members in SAT, NIV, GLA, BAR, GLU and MER, respectively. For osa-miR159a, the
precursor sequences had only two nucleotide substitutions among the six species and the mature miRNA
sequences are identical each other (Figure S39a); for osa-miR159e, the precursor sequences had two
nucleotide substitutions and one indel among the six species and the mature miRNA sequences are
identical each other (Figure S39b); for osa-miR159f, the precursor sequences had two nucleotide
substitutions and two indels among the six species and the mature miRNA sequences are identical each
other (Figure S39d); for osa-miR1847, the precursor sequences had four nucleotide substitutions and two
indels among the six rice species and the mature miRNA sequences are identical each other (Figure S39c).
In addition to the dynamic changes of copy number, the detection of nucleotide mutations in certain
lineages of AA-genome Oryza species may be related to phenotypic variation including flower
development and adaptations to drought and phosphate starvation.
References
1.
Doyle JJ: A rapid DNA isolation procedure for small quantities of fresh leaf tissue.
Phytochem Bull 1987, 19:11-15.
2.
Otto FJ: Preparation and staining of cells for high-resolution DNA analysis. In:
Flow cytometry and cell sorting. Springer; 1992: 65-68.
3.
Doležel J GW: Sex determination in dioecious plants Melandrium album and M.
rubrum using high-resolution flow cytometry. Cytometry 1995, 19(2):103-106.
4.
Cavalier-Smith T: Cell volume and the evolution of eukaryotic genome size. The
Evolution of Genome Size 1985:105-184.
5.
Matsumoto T, Wu JZ, Kanamori H, Katayose Y, Fujisawa M, Namiki N, Mizuno H,
Yamamoto K, Antonio BA, Baba T: The map-based sequence of the rice genome.
Nature 2005, 436(7052):793-800.
6.
Li R, Fan W, Tian G, Zhu H, He L, Cai J, Huang Q, Cai Q, Li B, Bai Y: The sequence
and
de
novo
assembly
of
the
giant
panda
genome.
Nature
2009,
463(7279):311-317.
7.
Varshney RK, Chen W, Li Y, Bharti AK, Saxena RK, Schlueter JA, Donoghue MTA,
Azam S, Fan G, Whaley AM: Draft genome sequence of pigeonpea (Cajanus cajan),
an orphan legume crop of resource-poor farmers. Nature Biotechnology 2012,
30(1):83-89.
8.
Bennetzen JL, Schmutz J, Wang H, Percifield R, Hawkins J, Pontaroli AC, Estep M,
Feng L, Vaughn JN, Grimwood J: Reference genome sequence of the model plant
Setaria. Nature Biotechnology 2012, 30(6):555-561.
9.
Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K: De
novo assembly of human genomes with massively parallel short read sequencing.
Genome Research 2010, 20(2):265-272.
10.
Koren S, Treangen TJ, Pop M: Bambus 2: scaffolding metagenomes. Bioinformatics
2011, 27(21):2964-2971.
11.
Kawahara Y, de la Bastide M, Hamilton JP, Kanamori H, McCombie WR, Ouyang S,
Schwartz DC, Tanaka T, Wu J, Zhou S: Improvement of the Oryza sativa
Nipponbare reference genome using next generation sequence and optical map
data. Rice 2013, 6(1):4.
12.
Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL:
Versatile and open software for comparing large genomes. Genome Biology 2004,
5(2):R12.
13.
Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller
W:
Human-mouse
alignments
with
BLASTZ.
Genome
Research
2003,
13(1):103-107.
14.
Smit AFA, Hubley R, Green P: RepeatMasker Open-3.0. URL: http://www.
repeatmasker. org. In.; 1996.
15.
Stanke M, Steinkamp R, Waack S, Morgenstern B: AUGUSTUS: a web server for
gene finding in eukaryotes. Nucleic Acids Research 2004, 32(suppl 2):309-312.
16.
Majoros WH, Pertea M, Salzberg SL: TigrScan and GlimmerHMM: two open
source ab initio eukaryotic gene-finders. Bioinformatics 2004, 20(16):2878-2879.
17.
Lukashin AV, Borodovsky M: GeneMark. hmm: new solutions for gene finding.
Nucleic Acids Research 1998, 26(4):1107-1115.
18.
Birney E, Clamp M, Durbin R: GeneWise and genomewise. Genome Research 2004,
14(5):988-995.
19.
She R, Chu JSC, Wang K, Pei J, Chen N: GenBlastA: enabling BLAST to identify
homologous gene sequences. Genome Research 2009, 19(1):143-149.
20.
Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith Jr RK, Hannick LI, Maiti R,
Ronning CM, Rusch DB, Town CD: Improving the Arabidopsis genome
annotation using maximal transcript alignment assemblies. Nucleic Acids
Research 2003, 31(19):5654-5666.
21.
Haas BJ, Salzberg SL, Zhu W, Pertea M, Allen JE, Orvis J, White O, Buell CR,
Wortman
JR:
Automated
eukaryotic
gene
structure
annotation
using
EVidenceModeler and the Program to Assemble Spliced Alignments. Genome
Biology 2008, 9(1):R7.
22.
Quevillon E, Silventoinen V, Pillai S, Harte N, Mulder N, Apweiler R, Lopez R:
InterProScan: protein domains identifier. Nucleic Acids Research 2005, 33(suppl
2):116-120.
23.
Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with
RNA-Seq. Bioinformatics 2009, 25(9):1105-1111.
24.
Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg
SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNA-Seq
reveals unannotated transcripts and isoform switching during cell differentiation.
Nature Biotechnology 2010, 28(5):511-515.
25.
Lowe TM, Eddy SR: tRNAscan-SE: a program for improved detection of transfer
RNA genes in genomic sequence. Nucleic Acids Research 1997, 25(5):955-964.
26.
Lagesen K, Hallin P, Rodland EA, Staerfeldt HH, Rognes T, Ussery DW: RNAmmer:
consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Research
2007, 35(9):3100-3108.
27.
Lowe TM, Eddy SR: A computational screen for methylation guide snoRNAs in
yeast. Science 1999, 283(5405):1168-1171.
28.
Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A: Rfam:
annotating non-coding RNAs in complete genomes. Nucleic Acids Research 2005,
33(Database issue):D121-124.
29.
Nawrocki EP, Kolbe DL, Eddy SR: Infernal 1.0: inference of RNA alignments.
Bioinformatics 2009, 25(10):1335-1337.
30.
Kozomara A, Griffiths-Jones S: miRBase: integrating microRNA annotation and
deep-sequencing data. Nucleic Acids Research 2011, 39(Database issue):152-157.
31.
Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J:
Repbase Update, a database of eukaryotic repetitive elements. Cytogenetic and
Genome Research 2005, 110(1-4):462-467.
32.
Bao Z, Eddy SR: Automated de novo identification of repeat sequence families in
sequenced genomes. Genome Research 2002, 12(8):1269-1276.
33.
Price AL, Jones NC, Pevzner PA: De novo identification of repeat families in large
genomes. Bioinformatics 2005, 21:351-358.
34.
McCarthy EM, McDonald JF: LTR_STRUC: a novel search and identification
program for LTR retrotransposons. Bioinformatics 2003, 19(3):362-367.
35.
Xu Z, Wang H: LTR_FINDER: an efficient tool for the prediction of full-length
LTR retrotransposons. Nucleic Acids Research 2007, 35:265-268.
36.
Seberg O, Petersen G: A unified classification system for eukaryotic transposable
elements should reflect their phylogeny. Nauret Review Genetics 2009, 10(4):276.
37.
Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K,
Eddy SR, Sonnhammer EL et al: The Pfam protein families database. Nucleic Acids
Research 2008, 36(Database issue):D281-288.
38.
Coffin JM, S. H. Hughes, et al.: Retroviruses. 1997 (Cold Spring Harbor Laboratory
Press).
39.
Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG: The
CLUSTAL_X windows interface: flexible strategies for multiple sequence
alignment aided by quality analysis tools. Nucleic Acids Research 1997,
25(24):4876-4882.
40.
Jurka J, Pethiyagoda C: Simple repetitive DNA sequences from primates:
compilation and analysis. Journal of Molecular Evolution 1995, 40(2):120-126.
41.
Ling HQ, Zhao S, Liu D, Wang J, Sun H, Zhang C, Fan H, Li D, Dong L, Tao Y et al:
Draft genome of the wheat A-genome progenitor Triticum urartu. Nature,
496(7443):87-90.
42.
Li L, Stoeckert CJ, Roos DS: OrthoMCL: identification of ortholog groups for
eukaryotic genomes. Genome Research 2003, 13(9):2178-2189.
43.
Chen J, Huang Q, Gao D, Wang J, Lang Y, Liu T, Li B, Bai Z, Goicoechea JL, Liang
C: Whole-genome sequencing of Oryza brachyantha reveals mechanisms
underlying Oryza genome evolution. Nature Communications 2013, 4:1595.
44.
Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient
alignment of short DNA sequences to the human genome. Genome Biology 2009,
10(3):R25.
45.
Tang H, Lyons E, Pedersen B, Schnable JC, Paterson AH, Freeling M: Screening
synteny blocks in pairwise genome comparisons through integer programming.
BMC Bioinformatics 2011, 12(1):102.
46.
Katoh K, Kuma K, Toh H, Miyata T: MAFFT version 5: improvement in accuracy
of multiple sequence alignment. Nucleic Acids Research 2005, 33(2):511-518.
47.
Stamatakis A: RAxML-VI-HPC: maximum likelihood-based phylogenetic
analyses with thousands of taxa and mixed models. Bioinformatics 2006,
22(21):2688-2690.
48.
Rokas A, Williams BL, King N, Carroll SB: Genome-scale approaches to resolving
incongruence in molecular phylogenies. Nature 2003, 425(6960):798-804.
49.
Pollard DA, Iyer VN, Moses AM, Eisen MB: Widespread discordance of gene trees
with species tree in Drosophila: evidence for incomplete lineage sorting. PLoS
Genetics 2006, 2(10):e173.
50.
Cranston KA, Hurwitz B, Ware D, Stein L, Wing RA: Species trees from highly
incongruent gene trees in rice. Systematic Biology 2009, 58(5):489-500.
51.
Holland BR, Huber KT, Moulton V, Lockhart PJ: Using consensus networks to
visualize contradictory evidence for species phylogeny. Molecular Biology and
Evolution 2004, 21(7):1459-1461.
52.
Gaut BS, Morton BR, McCaig BC, Clegg MT: Substitution rate comparisons
between grasses and palms: synonymous rate differences at the nuclear gene Adh
parallel rate differences at the plastid gene rbcL. Proceedings of the National
Academy of Sciences USA 1996, 93(19):10274-10279.
53.
Tang L, Zou X-h, Achoundong G, Potgieter C, Second G, Zhang D-y, Ge S:
Phylogeny and biogeography of the rice tribe (Oryzeae): evidence from
combined analysis of 20 chloroplast fragments. Molecular Phylogenetics and
Evolution 2010, 54(1):266-277.
54.
Ma J, Bennetzen JL: Rapid recent growth and divergence of rice nuclear genomes.
Proceedings of the National Academy of Sciences USA 2004, 101(34):12404-12410.
55.
Zhu Q, Ge S: Phylogenetic relationships among A-genome species of the genus
Oryza revealed by intron sequences of four nuclear genes. New Phytologist 2005,
167(1):249-265.
56.
Prasad V, Strömberg C, LeachéA, Samant B, Patnaik R, Tang L, Mohabey D, Ge S,
Sahni A: Late Cretaceous origin of the rice tribe provides evidence for early
diversification in Poaceae. Nature Communications 2011, 2:480.
57.
Bray N, Pachter L: MAVID: constrained ancestral alignment of multiple
sequences. Genome Research 2004, 14(4):693-699.
58.
Zhu T, Xu P-Z, Liu J-P, Peng S, Mo X-C, Gao L-Z: Phylogenetic relationships and
genome divergence among the AA-genome species of the genus Oryza as revealed
by 53 nuclear genes and 16 intergenic regions. Molecular Phylogenetics and
Evolution 2013, 70:348-361.
59.
Yang Z: PAML: a program package for phylogenetic analysis by maximum
likelihood. Computer applications in the biosciences: CABIOS 1997, 13(5):555-556.
60.
Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S: MEGA5: molecular
evolutionary genetics analysis using maximum likelihood, evolutionary distance,
and maximum parsimony methods. Molecular Biology and Evolution 2011,
28(10):2731-2739.
61.
Lu F, Ammiraju JSS, Sanyal A, Zhang S, Song R, Chen J, Li G, Sui Y, Song X, Cheng
Z: Comparative sequence analysis of MONOCULM1-orthologous regions in 14
Oryza genomes. Proceedings of the National Academy of Sciences USA 2009,
106(6):2071-2076.
62.
Ammiraju JSS, Zuccolo A, Yu Y, Song X, Piegu B, Chevalier F, Walling JG, Ma J,
Talag J, Brar DS: Evolutionary dynamics of an ancient retrotransposon family
provides insights into evolution of genome size in the genus Oryza. Plant Journal
2007, 52(2):342-351.
63.
Ammiraju JSS, Lu F, Sanyal A, Yu Y, Song X, Jiang N, Pontaroli AC, Rambo T,
Currie J, Collura K: Dynamic evolution of Oryza genomes is revealed by
comparative genomic analysis of a genus-wide vertical data set. Plant Cell 2008,
20(12):3191-3209.
64.
Zuccolo A, Sebastian A, Talag J, Yu Y, Kim H, Collura K, Kudrna D, Wing RA:
Transposable element distribution, abundance and role in genome size variation
in the genus Oryza. BMC Evolutionary Biology 2007, 7(1):152.
65.
Song XJ, Huang W, Shi M, Zhu MZ, Lin HX: A QTL for rice grain width and
weight encodes a previously unknown RING-type E3 ubiquitin ligase. Nature
Genetics 2007, 39(5):623-630.
66.
Li Y, Fan C, Xing Y, Jiang Y, Luo L, Sun L, Shao D, Xu C, Li X, Xiao J et al:
Natural variation in GS5 plays an important role in regulating grain size and
yield in rice. Nature Genetics 2011, 43(12):1266-1269.
67.
Yang Z: PAML 4: phylogenetic analysis by maximum likelihood. Molecular
Biology and Evolution 2007, 24(8):1586-1591.
68.
Jin J, Huang W, Gao J-P, Yang J, Shi M, Zhu M-Z, Luo D, Lin H-X: Genetic control
of rice plant architecture under domestication. Nature Genetics 2008,
40(11):1365-1369.
69.
Tan L, Li X, Liu F, Sun X, Li C, Zhu Z, Fu Y, Cai H, Wang X, Xie D: Control of a
key transition from prostrate to erect growth in rice domestication. Nature
Genetics 2008, 40(11):1360-1364.
70.
Harris RS: Improved pairwise alignment of genomic DNA. PhD thesis, Penn State
Univ 2007.
71.
Li YR, Zheng HC, Luo RB, Wu HL, Zhu HM, Li RQ, Cao HZ, Wu BX, Huang SJ,
Shao HJ et al: Structural variation in two human genomes mapped at
single-nucleotide resolution by whole genome de novo assembly. Nature
Biotechnology 2011, 29(8):723-730.
72.
Dewey CN: Aligning multiple whole genomes with Mercator and MAVID.
Methods Molecular Biology 2007, 395:221-236.
73.
Hu TT, Pattyn P, Bakker EG, Cao J, Cheng JF, Clark RM, Fahlgren N, Fawcett JA,
Grimwood J, Gundlach H et al: The Arabidopsis lyrata genome sequence and the
basis of rapid genome size change. Nature Genetics 2011, 43(5):476-481.
74.
Han Y, Wessler SR: MITE-Hunter: a program for discovering miniature
inverted-repeat transposable elements from genomic sequences. Nucleic Acids
Research 2010: doi: 10.1093/nar/gkq862.
75.
Lu C, Chen J, Zhang Y, Hu Q, Su W, Kuang H: Miniature inverted-repeat
transposable elements (MITEs) have been accumulated through amplification
bursts and play important roles in gene expression and species diversity in Oryza
sativa. Molecular Biology and Evolution 2012, 29(3):1005-1017.
76.
Chen F-C, Chen C-J, Li W-H, Chuang T-J: Human-specific insertions and deletions
inferred from mammalian genome sequences. Genome Research 2007,
17(1):16-22.
77.
Bailey JA, Gu Z, Clark RA, Reinert K, Samonte RV, Schwartz S, Adams MD, Myers
EW, Li PW, Eichler EE: Recent segmental duplications in the human genome.
Science 2002, 297(5583):1003-1007.
78.
Alkan C, Kidd JM, Marques-Bonet T, Aksay G, Antonacci F, Hormozdiari F, Kitzman
JO, Baker C, Malig M, Mutlu O et al: Personalized copy number and segmental
duplication maps using next-generation sequencing. Nature Genetics 2009,
41(10):1061-U1029.
79.
Marques-Bonet T, Kidd JM, Ventura M, Graves TA, Cheng Z, Hillier LW, Jiang Z,
Baker C, Malfavon-Borja R, Fulton LA et al: A burst of segmental duplications in
the genome of the African great ape ancestor. Nature 2009, 457(7231):877-881.
80.
Lohse M, Bolger AM, Nagel A, Fernie AR, Lunn JE, Stitt M, Usadel B: RobiNA: a
user-friendly, integrated software solution for RNA-Seq-based transcriptomics.
Nucleic Acids Research 2012, 40:622-627.
81.
Smit A, Hubley, R & Green, P.: RepeatMasker Open-3.0. 1996-2010.
82.
Benson G: Tandem repeats finder: a program to analyze DNA sequences. Nucleic
Acids Research 1999, 27(2):573-580.
83.
Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler
transform. Bioinformatics 2009, 25(14):1754-1760.
84.
Alkan C, Kidd JM, Marques-Bonet T, Aksay G, Antonacci F, Hormozdiari F, Kitzman
JO, Baker C, Malig M, Mutlu O et al: Personalized copy number and segmental
duplication maps using next-generation sequencing. Nature Genetics 2009,
41(10):1061-1067.
85.
Alkan C, Coe BP, Eichler EE: Genome structural variation discovery and
genotyping. Nature Reviews Genetics 2011, 12(5):363-376.
86.
Medvedev P, Stanciu M, Brudno M: Computational methods for discovering
structural variation with next-generation sequencing. Nature Methods 2009,
6(11):13-20.
87.
Marques-Bonet T, Kidd JM, Ventura M, Graves TA, Cheng Z, Hillier LW, Jiang Z,
Baker C, Malfavon-Borja R, Fulton LA et al: A burst of segmental duplications in
the genome of the African great ape ancestor. Nature 2009, 457(7231):877-881.
88.
Sudmant P H, Kitzman J O, Antonacci F, et al. Diversity of human copy number
variation and multicopy genes. Science 2010, 330(6004): 641-646.
89.
Bailey JA, Gu Z, Clark RA, Reinert K, Samonte RV, Schwartz S, Adams MD, Myers
EW, Li PW, Eichler EE: Recent segmental duplications in the human genome.
Science 2002, 297(5583):1003-1007.
90.
Alkan C, Kidd JM, Marques-Bonet T, Aksay G, Antonacci F, Hormozdiari F, Kitzman
JO, Baker C, Malig M, Mutlu O: Personalized copy number and segmental
duplication maps using next-generation sequencing. Nature Genetics 2009,
41(10):1061-1067.
91.
Hahn MW, De Bie T, Stajich JE, Nguyen C, Cristianini N: Estimating the tempo and
mode of gene family evolution from comparative genomic data. Genome Research
2005, 15(8):1153-1160.
92.
De Bie T, Cristianini N, Demuth JP, Hahn MW: CAFE: a computational tool for the
study of gene family evolution. Bioinformatics 2006, 22(10):1269-1271.
93.
Finn RD, Clements J, Eddy SR: HMMER web server: interactive sequence
similarity searching. Nucleic Acids Research 2011, 39(suppl 2):29-37.
94.
Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high
throughput. Nucleic Acids Research 2004, 32(5):1792-1797.
95.
Bailey TL, Elkan C: The value of prior knowledge in discovering motifs with
MEME. In: Ismb: 1995; 1995: 21-29.
96.
Shannon JC, Pien F-M, Liu K-C: Nucleotides and nucleotide sugars in developing
maize endosperms (synthesis of ADP-glucose in brittle-1). Plant Physiology 1996,
110(3):835-843.
97.
Lupas A, Van Dyke M, Stock J: Predicting coiled coils from protein sequences.
Science (New York, NY) 1991, 252(5009):1162-1164.
98.
Martin GB, Bogdanove AJ, Sessa G: Understanding the functions of plant disease
resistance proteins. Annual Review of Plant Biology 2003, 54(1):23-61.
99.
Shang J, Tao Y, Chen X, Zou Y, Lei C, Wang J, Li X, Zhao X, Zhang M, Lu Z:
Identification of a new rice blast resistance gene, Pid3, by genomewide
comparison of paired nucleotide-binding site-Leucine-rich repeat genes and their
pseudogene alleles between the two sequenced rice genomes. Genetics 2009,
182(4):1303-1311.
100.
Bryan GT, Wu K-S, Farrall L, Jia Y, Hershey HP, McAdams SA, Faulk KN,
Donaldson GK, Tarchini R, Valent B: A single amino acid difference distinguishes
resistant and susceptible alleles of the rice blast resistance gene Pi-ta. Plant Cell
2000, 12(11):2033-2045.
101. Xu X, Liu X, Ge S, Jensen JD, Hu F, Li X, Dong Y, Gutenkunst RN, Fang L, Huang L:
Resequencing 50 accessions of cultivated and wild rice yields markers for
identifying
agronomically
important
genes.
Nature
Biotechnology
2011,
30(1):105-111.
102.
Huang X, Qian Q, Liu Z, Sun H, He S, Luo D, Xia G, Chu C, Li J, Fu X: Natural
variation at the DEP1 locus enhances grain yield in rice. Nature Genetics 2009,
41(4):494-497.
103.
Yoshida K, Miyashita NT: DNA polymorphism in the blast disease resistance gene
Pita of the wild rice Oryza rufipogon and its related species. Genes Genetic
Systems 2009, 84(2):121-136.
104.
Arora R, Agarwal P, Ray S, Singh A, Singh V, Tyagi A, Kapoor S: MADS-box gene
family in rice: genome-wide identification, organization and expression profiling
during reproductive development and stress. BMC Genomics 2007, 8(1):242.
105.
Li R, Yu C, Li Y, Lam T-W, Yiu S-M, Kristiansen K, Wang J: SOAP2: an improved
ultrafast tool for short read alignment. Bioinformatics 2009, 25(15):1966-1967.
106.
Saika H, Matsumura H, Takano T, Tsutsumi N, Nakazono M: A point mutation of
Adh1 gene is involved in the repression of coleoptile elongation under
submergence in rice. Breeding Science 2006, 56(1):69-74.
107.
She R, Chu JS, Wang K, Pei J, Chen N: GenBlastA: enabling BLAST to identify
homologous gene sequences. Genome Research 2009, 19(1):143-149.
108.
Darwin C, Bynum WF: The origin of species by means of natural selection: or, the
preservation of favored races in the struggle for life: AL Burt; 2009.
109.
Rieseberg LH, Blackman BK: Speciation genes in plants. Annals of Botany 2010,
106(3):439-455.
110.
Yamagata Y, Yamamoto E, Aya K, Win KT, Doi K, Ito T, Kanamori H, Wu J,
Matsumoto T, Matsuoka M: Mitochondrial gene in the nuclear genome induces
reproductive barrier in rice. Proceedings of the National Academy of Sciences USA
2010, 107(4):1494-1499.
111.
Jiang W, Chu S-H, Piao R, Chin J-H, Jin Y-M, Lee J, Qiao Y, Han L, Piao Z, Koh H-J:
Fine mapping and candidate gene analysis of hwh1 and hwh2, a set of
complementary genes controlling hybrid breakdown in rice. Theoretical and
Applied Genetics 2008, 116(8):1117-1127.
112.
Chen J, Ding J, Ouyang Y, Du H, Yang J, Cheng K, Zhao J, Qiu S, Zhang X, Yao J: A
triallelic system of S5 is a major regulator of the reproductive barrier and
compatibility of indica-japonica hybrids in rice. Proceedings of the National
Academy of Sciences USA 2008, 105(32):11436-11441.
113.
Long Y, Zhao L, Niu B, Su J, Wu H, Chen Y, Zhang Q, Guo J, Zhuang C, Mei M:
Hybrid male sterility in rice controlled by interaction between divergent alleles
of two adjacent genes. Proceedings of the National Academy of Sciences USA 2008,
105(48):18871-18876.
114.
Yang Z: Likelihood Ratio Tests for Detecting Positive Selection and Application
to Primate Lysozyme Evolution. Molecular Biology and Evolution 1998,
15(5):568-573.
115.
Yang Z: PAML: a program package for phylogenetic analysis by maximum
likelihood. Computer applications in the biosciences: CABIOS 1997, 13(5), 555-556.
116.
Yang Z: PAML 4: Phylogenetic Analysis by Maximum Likelihood. Mol Biol Evol
2007, 24(8):1586-1591.
117.
Nielsen R, Yang Z: Likelihood models for detecting positively selected amino acid
sites and applications to the HIV-1 envelope gene. Genetics 1998, 148(3):929-936.
118.
Yang Z, Nielsen R: Codon-substitution models for detecting molecular adaptation
at individual sites along specific lineages. Molecular Biology and Evolution 2002,
19(6):908-917.
119.
Zhang J, Nielsen R, Yang Z: Evaluation of an improved branch-site likelihood
method for detecting positive selection at the molecular level. Molecular Biology
and Evolution 2005, 22(12):2472-2479.
120.
Yang Z, Wong WS, Nielsen R: Bayes empirical Bayes inference of amino acid sites
under positive selection. Molecular Biology and Evolution 2005, 22(4):1107-1118.
121.
Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and
powerful approach to multiple testing. Journal of the Royal Statistical Society
Series B (Methodological) 1995:289-300.
122. Kimura M: The neutral theory of molecular evolution: Cambridge University Press;
1983.
123.
Gossmann TI, Song B-H, Windsor AJ, Mitchell-Olds T, Dixon CJ, Kapralov MV,
Filatov DA, Eyre-Walker A: Genome wide analyses reveal little evidence for
adaptive evolution in many plant species. Molecular Biology and Evolution 2010,
27(8):1822-1832.
124.
Strasburg JL, Kane NC, Raduski AR, Bonin A, Michelmore R, Rieseberg LH:
Effective population size is positively correlated with levels of adaptive
divergence among annual sunflowers. Molecular Biology and Evolution 2011,
28(5):1569-1580.
125.
Nielsen R, Bustamante C, Clark AG, Glanowski S, Sackton TB, Hubisz MJ,
Fledel-Alon A, Tanenbaum DM, Civello D, White TJ: A scan for positively selected
genes in the genomes of humans and chimpanzees. PLoS Biology 2005, 3(6):e170.
126.
Bakewell MA, Shi P, Zhang J: More genes underwent positive selection in
chimpanzee evolution than in human evolution. Proceedings of the National
Academy of Sciences USA 2007, 104(18):7489-7494.
127.
Kosiol C, Vinař T, Da Fonseca RR, Hubisz MJ, Bustamante CD, Nielsen R, Siepel A:
Patterns of positive selection in six Mammalian genomes. PLoS Genetics 2008,
4(8):e1000144.
128.
Bachtrog D: Similar rates of protein adaptation in Drosophila miranda and D.
melanogaster, two species with different current effective population sizes. BMC
Evolutionary Biology 2008, 8(1):334.
129.
Halligan DL, Oliver F, Eyre-Walker A, Harr B, Keightley PD: Evidence for
pervasive adaptive protein evolution in wild mice. PLoS Genetics 2010,
6(1):e1000825.
130.
Strasburg JL, Scotti-Saintagne C, Scotti I, Lai Z, Rieseberg LH: Genomic patterns of
adaptive divergence between chromosomally differentiated sunflower species.
Molecular Biology and Evolution 2009, 26(6):1341-1355.
131.
Bustamante CD, Fledel-Alon A, Williamson S, Nielsen R, Todd Hubisz M, Glanowski
S, Tanenbaum DM, White TJ, Sninsky JJ, Hernandez RD et al: Natural selection on
protein-coding genes in the human genome. Nature 2005, 437(7062):1153-1157.
132.
Boyko AR, Williamson SH, Indap AR, Degenhardt JD, Hernandez RD, Lohmueller
KE, Adams MD, Schmidt S, Sninsky JJ, Sunyaev SR: Assessing the evolutionary
impact of amino acid mutations in the human genome. PLoS Genetics 2008,
4(5):e1000083.
133.
Barrier M, Bustamante CD, Yu J, Purugganan MD: Selection on rapidly evolving
proteins in the Arabidopsis genome. Genetics 2003, 163(2):723-733.
134.
Bustamante CD, Nielsen R, Sawyer SA, Olsen KM, Purugganan MD, Hartl DL: The
cost of inbreeding in Arabidopsis. Nature 2002, 416(6880):531-534.
135.
Foxe JP, Dar V-u-N, Zheng H, Nordborg M, Gaut BS, Wright SI: Selection on amino
acid substitutions in Arabidopsis. Molecular Biology and Evolution 2008,
25(7):1375-1383.
136.
Roth C, Liberles DA: A systematic search for positive selection in higher plants
(Embryophytes). BMC Plant Biology 2006, 6(1):12.
137.
Pentony M, Winters P, Penfold-Brown D, Drew K, Narechania A, DeSalle R,
Bonneau R, Purugganan M: The plant proteome folding project: structure and
positive selection in plant protein families. Genome Biology and Evolution 2012,
4(3):360-371.
138.
Drummond DA, Bloom JD, Adami C, Wilke CO, Arnold FH: Why highly expressed
proteins evolve slowly. Proceedings of the National Academy of Sciences USA 2005,
102(40):14338-14343.
139. Larracuente AM, Sackton TB, Greenberg AJ, Wong A, Singh ND, Sturgill D, Zhang Y,
Oliver B, Clark AG: Evolution of protein-coding genes in Drosophila. Trends in
Genetics 2008, 24(3):114-123.
140.
Kosiol C, Vina T, da Fonseca RR, Hubisz MJ, Bustamante CD, Nielsen R, Siepel A:
Patterns of positive selection in six Mammalian genomes. PLoS Genetics 2008,
4(8):e1000144.
141.
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP,
Dolinski K, Dwight SS, Eppig JT: Gene Ontology: tool for the unification of
biology. Nature Genetics 2000, 25(1):25.
142.
Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M: KEGG for integration and
interpretation of large-scale molecular data sets. Nucleic Acids Research 2012,
40:109-114.
143.
Du Z, Zhou X, Ling Y, Zhang Z, Su Z: agriGO: a GO analysis toolkit for the
agricultural community. Nucleic Acids Research 2010, 38:64-70.
144.
Heger A, Ponting CP: Evolutionary rate analyses of orthologs and paralogs from
12 Drosophila genomes. Genome Research 2007, 17(12):1837-1849.
145.
Dai X, Zhao PX: psRNATarget: a plant small RNA target analysis server. Nucleic
Acids Research 2011, 39:155-159.
146.
Du Z, Zhou X, Ling Y, Zhang Z, Su Z: agriGO: a GO analysis toolkit for the
agricultural community. Nucleic Acids Research 2010, 38:64-70.
147.
O'Reilly D, Dienstbier M, Cowley SA, Vazquez P, Drozdz M, Taylor S, James WS,
Murphy S: Differentially expressed, variant U1 snRNAs regulate gene expression
in human cells. Genome Research 2013, 23(2):281-291.
148.
Spooner W, Youens-Clark K, Staines D, Ware D: GrameneMart: the BioMart data
portal for the Gramene project. Database (Oxford) 2012, 2012:bar056.
149.
Ye J, Fang L, Zheng H, Zhang Y, Chen J, Zhang Z, Wang J, Li S, Li R, Bolund L:
WEGO: a web tool for plotting GO annotations. Nucleic Acids Research 2006,
34:293-297.
150.
Zhao B, Liang R, Ge L, Li W, Xiao H, Lin H, Ruan K, Jin Y: Identification of
drought-induced microRNAs in rice. Biochemical and Biophysical Research
Communications 2007, 354(2):585-590.
151.
Franco-Zorrilla JM, Valli A, Todesco M, Mateos I, Puga MI, Rubio-Somoza I, Leyva
A, Weigel D, Garca JA, Paz-Ares J: Target mimicry provides a new mechanism for
regulation of microRNA activity. Nature Genetics 2007, 39(8):1033-1037.
152.
Kuo H-F, Chiou T-J: The role of microRNAs in phosphorus deficiency signaling.
Plant Physiology 2011, 156(3):1016-1024.
153.
Caicedo AL, Williamson SH, Hernandez RD, Boyko A, Fledel-Alon A, York TL,
Polato NR, Olsen KM, Nielsen R, McCouch SR et al: Genome-wide patterns of
nucleotide
polymorphism
in
domesticated
rice.
PLoS
Genetics
2007,
3(9):1745-1756.
154.
Guo X, Gui Y, Wang Y, Zhu QH, Helliwell C, Fan L: Selection and mutation on
microRNA target sequences during rice evolution. BMC Genomics 2008, 9:454.
155.
Li YF, Zheng Y, Addo-Quaye C, Zhang L, Saini A, Jagadeeswaran G, Axtell MJ,
Zhang W, Sunkar R: Transcriptome-wide identification of microRNA targets in
rice. Plant Journal 2010, 62(5):742-759.
156.
Lu C, Jeong DH, Kulkarni K, Pillay M, Nobuta K, German R, Thatcher SR, Maher C,
Zhang L, Ware D et al: Genome-wide analysis for discovery of rice microRNAs
reveals natural antisense microRNAs (nat-miRNAs). Proceedings of the National
Academy of Sciences USA 2008, 105(12):4951-4956.
157.
Luo YC, Zhou H, Li Y, Chen JY, Yang JH, Chen YQ, Qu LH: Rice embryogenic calli
express a unique set of microRNAs, suggesting regulatory roles of microRNAs in
plant post-embryogenic development. FEBS Letters 2006, 580(21):5111-5116.
158.
Xie K, Wu C, Xiong L: Genomic organization, differential expression, and
interaction of SQUAMOSA promoter-binding-like transcription factors and
microRNA156 in rice. Plant Physiology 2006, 142(1):280-293.
159.
Zhu QH, Upadhyaya NM, Gubler F, Helliwell CA: Over-expression of miR172
causes loss of spikelet determinacy and floral organ abnormalities in rice (Oryza
sativa). BMC Plant Biology 2009, 9:149.
Supporting Figures
Figure S1. Diagrams of two examples of library parameters. Sequencing depth was calculated by read
length distribution for GLA (GlaP01) (a) and 17-mer distribution for BAR (BarP03) (b). All other libraries
were given in Table S2.
Figure S2. Cytogram of fluorescence intensity of the six AA-genome Oryza species including O. sativa
ssp. japonica. cv. Nipponbare nuclei isolated with an improved Otto buffer. Coefficient of variation
values (CVs): 3.75% (a); 3.37% (b); 3.25% (c); 3.51% (d); 4.07% (e); 3.36% (f). X: relative fluorescence;
Y: number of nuclei.
Figure S3. Length distributions of scaffolds for the five assembled rice genomes.
Figure S4. Flowchart for estimating the assembly quality of the five sequenced rice genomes.
b
a
Run RepeatMasker against all 6 rice genome
sequences
Run RepeatMasker against both BES
data and the 5 assembled genomes
Use MUMMER to map the 5 assembled genomes to SAT
Use MUMMER to map BES to the 5
Filter sequences > 90% similarity and remove
assembled genomes
overlapping sequences
Calculate genome
Calculate gene
coverage
coverage
Estimate sequence
Estimate sequence
similarity
similarity
Filter sequence with > 90% similarity
and remove multi-hits
Count the rates of orientation errors
Figure S5. Examples of scatter plot displays of the assembled genome sequences that were mapped to
the short arms of Chromosomes 3 generated by OMAP. Scatter plot displays are shown for the GLA (a),
GLU (b) and MER (c) alignments.
Figure S6. Number of different types of non-coding RNA genes identified in the six AA-genome
Oryza species. (a) tRNA genes; (b) rRNA genes; (c) snoRNA genes; (d) snRNA genes; (e) miRNA genes.
Branch lengths are proportional to the divergence times indicated in the scale bar.
Figure S7. Number of simple sequence repeats (SSRs) detected in the six AA-genome Oryza species.
All SSRs were classified into one of two subgroups according to their lengths; ≥ 20 bp (red) and < 20 bp
(blue).
Figure S8. Phylogenetic relationship and divergence dates of the six AA-genome Oryza species
inferred from 2,305 orthologous gene sequences. All nodes were supported at 100% with O.
brachyantha as the outgroup.
Figure S9. Consensus networks constructed from 100 individual gene trees. The thresholds were (a)
0.1, (b) 0.15 and (c) 0.2. The planar graph was constructed with Splits-Tree 4.12.8.
Figure S10. Genomic alignment of the GS5 (LOC_Os05g06660) region across AA-genome Oryza
species. Gene models are shown in blue rectangles. TEs are shown in red for DNA transposons and green
for retrotransposons. Lines connect orthologous TEs.
Figure S11. Molecular evolution of the GS5 gene in six AA-genome Oryza species. dN/dS values were
estimated for each branch of the GS5 gene tree with the reconstructed sequences at ancestral nodes.
Numbers above the lineage indicate dN/dS values; branch lengths are proportional to the number of
substitutions.
Figure S12. Overall length distribution of insertions and deletions across the five rice genomes. SV:
structural variants.
0.8
Percentage of SV
0.7
NIV
0.6
GLA
0.5
BAR
GLU
0.4
MER
0.3
0.2
0.1
0
1-10
10-50
50-100
100-500
Length (bp)
500-1k
>1k
Figure S13. TE annotation of indels of 140-160 bp (a) and 230-250 bp (b) in the five sequenced rice
genomes.
Figure S14. Size distribution of indels within different genomic regions among the five rice species
relative to SAT. Five bars from bottom to top for each length range shown are for NIV, BAR, GLA, GLU
and MER, respectively.
40-50
30-39
Intergenic
20-29
Intron
10-19
3' UTR
Length (bp)
9
5' UTR
8
Exon
7
6
5
4
3
2
1
0
50,000 100,000 150,000 200,000 250,000 300,000 350,000
Count
Figure S15. Size distribution of insertions and deletions within protein-coding sequences of five rice
genomes. Insertions and deletions that are multiples of a single codon (3 bp) were overrepresented in
protein-coding regions.
20,000
Insertion
Count
15,000
Deletion
10,000
5,000
0
1
2
3
4
5
Length (bp)
6
7
8
9
Figure S16. Chromosomal distribution of NBS-encoding genes among the seven Oryza genomes. The
positions of NBS-encoding genes of NIV, GLA, BAR, GLU and MER were based on the BLAT
alignments using the SAT genome (MSU v7.0). The numbers below the figure indicate the number of
annotated NBS-encoding genes, showing (for instance) just over 480 such genes in NIV.
Figure S17. Comparative analysis of Pid3 homologues in seven rice species. (a) The average base
identity of all the Pid3 homologues (window: 10 aa). The x-axis indicates the longitudinal position as well
as domain structure of the Pid3 gene; the y-axis shows the average base identity, which was calculated as
the maximum number of identical amino acids in the alignment for each position divided by the number of
amino acids at each location; (b) detailed sequence information for aa 700-780 of the Pid3 homologues.
The amino acid at position 737 is indicated in red.
Figure S18. Summary of amino acid variation in the Pi-ta homologues among the six rice species.
The locations of domains are shown at the top of the figure. Numbers indicate the positions of amino acids
in the Pi-ta homologue protein (AF207842).
Figure S19. Alignment of protein-coding sequences of the S5 orthologues in AA-genome Oryza
species. Missing sequences represent either indels or assembly gaps.
Figure S20. Alignment of protein-coding sequences of SaM orthologues in AA-genome Oryza species.
Missing sequences represent either indels or assembly gaps.
Figure S21. Alignment of protein-coding sequences of SaF orthologues in AA-genome Oryza species.
Missing sequences represent either indels or assembly gaps.
Figure S22. Alignment of protein-coding sequences of mtRPL27 ortholgoues in AA-genome Oryza
species. The missing sequences are indels.
Figure S23. Alignment of protein-coding sequences of HWH1 orthologues in AA-genome Oryza
species.
Figure S24. Read mapping results for TRJ, IND, NIV, RUF, GLA, BAR, GLU, LON and MER using
LOC_Os07g05900 (PROG1) as the reference sequence. Of these species, BAR, GLA, MER and LON
have the fewest mapped reads, most of which were homologous to specific regions of PROG1. Red and
green lines indicate single reads, while blue lines represent paired-end reads. The size of the gene is
indicated by the bp numbers on the top lines.
Figure S25. Experimental validation and comparative genomic analysis of PROG1 orthologues in
AA-genome Oryza species. (a) Results for PCR amplification of PROG1 in the AA-genome Oryza species.
(b) Comparative genomic analysis of the PROG1-orthologous regions. PROG1 is absent in BAR, GLA,
and MER. The green, red, and light blue boxes indicate retroelement TEs, DNA-TEs and protein-coding
genes, respectively. Solid and dashed black lines show the orthologous relationships of genes and TEs,
respectively. (c) Alignment of predicted amino acid sequences of PROG1. SAT_HM149649,
IND_HM149647, RUF_HM149748, RUF_HM149763 and NIV_HM149768 that were downloaded from
NCBI.
Figure S26. Branch-specific dN, dS and ω values in each terminal branch. Shown are values of (a) ω
(dN/dS); (b) dN and dS for each species, which were estimated in branch model tests.
Figure S27. Likelihood Ratio Tests (LRTs) used to detect positive selection in the studied Oryza
species. Panel (a) shows the test for selection on any branch of the phylogeny, and panels (b–i) show the
lineage-specific tests, with branches under positive selection highlighted in red. The numbers below each
subfigure represent the number of positively selected genes at a significance level of P < 0.05 (FDR <
0.05). Each branch is labeled with the corresponding estimate of ω from the branch model test (a).
Figure S28. Comparisons of expression levels between positively selected genes (PSGs) and non-PSGs. For
each tissue, expression levels of non-PSGs (blue) and PSGs (red) are indicated, and
these PSGs were detected
by lineage-specific LRTs for each species. Values for expression levels are shown in log2 (FPKM). SAT tissues
are indicated by abbreviations as follows: LE, Leaves-20 days; PE-I, Pre-emergence inflorescence; PT-I,
Post-emergence inflorescence; AN, Anther; PI, Pistil; SE-5, Seed-5 DAP; SE-10, Seed-10 DAP; EM-25,
Embryo-25 DAP; SH, Shoots; SD, Seedling four-leaf stage; EN-25, Endosperm-25 DAP; DAP, Days after
pollination.
Figure S29. Comparisons of the percentages of PSGs for each species. The blue bars represent the
percentages of PSGs detected by lineage-specific LRTs among 2,272 orthologous gene families for each species.
The red and green bars represent the percentages of PSGs among the genes assigned to the categories of
―reproduction‖ and ―response to stimulus‖, respectively.
Figure S30. Ubiquitin-mediated proteolysis pathway as presented in the KEGG database. Predicted PSGs
(FDR < 0.05) are indicated by red boxes. All arrows, connectors and nodes are graphically designated in
accordance with KEGG guidelines.
Figure S31. GO annotation of miRNA targets across the AA-genome Oryza species. (a) miRNA target
genes in SAT; (b) miRNA target genes in NIV; (c) miRNA target genes in GLA; (d) miRNA target genes
in BAR; (e) miRNA target genes in GLU; (f) miRNA target genes in MER. The whole non-TE transcript
data sets for the SAT genome were used as the background control.
Figure S32. GO annotation of miRNA target genes across the AA-genome Oryza species. pv1
represents the P-value for SAT, pv2 represents the P-value for NIV, pv3 represents the P-value for GLA,
pv4 represents the P-value for BAR, pv5 represents the P-value for GLU, and pv6 represents the P-value
for MER. The red to gray scale represents the degree of significance, with red being the most significant
and gray indicating a lack of significance.
Figure S33. Nucleotide substitution frequencies of non-coding RNA genes in comparisons with
genomic components across the AA-genome Oryza species. Nucleotide substitution frequencies for the
whole genome were calculated by using 1 kb sliding windows along each genome. The horizontal axis
indicates different genomic elements, while the vertical axis indicates nucleotide substitution frequencies.
Figure S34. Average nucleotide substitution frequencies of non-coding RNA genes in comparison
with genomic components across the AA-genome Oryza species. Branch lengths are proportional to the
divergence times as indicated by the scale bar. (a) tRNA genes; (b) snoRNA genes; (c) snRNA genes; (d)
mature miRNAs; (e) pre-miRNAs; (f) miRNA targets; (g) 1 kb upstream regions; (h) 5‘ UTRs; (i) CDS; (j)
introns; (k) 3‘ UTRs; (l) 1 kb downstream regions; (m) intergenic regions; (n) whole genome.
Figure S35. Nucleotide substitutions in mature miRNA sequences and their putative target sites
across the six AA-genome Oryza species. The horizontal axis represents the number of nucleotide
substitutions, while the vertical axis denotes the number of mature miRNA sequences (a) and miRNA
target sites inside miRNA target genes (b).
Figure S36. GO annotation of miRNA genes and their putative target sites in the six AA-genome
Oryza species. The conserved miRNAs and target sites refer to miRNAs and their target sites without
nucleotide substitutions; the non-conserved miRNAs and target sites designate miRNAs and their target
sites that had at least one nucleotide substitution. (a) miRNAs; (b) miRNA target sites.
Figure S37. Alignment of osa-miR169b orthologues across the six AA-genome Oryza species.
osa-miR169b has been associated with adaptation to drought. Sequences in the red rectangle show mature
miRNAs within the pre-miRNAs, while asterisks indicate nucleotide substitutions.
Figure S38. Alignment of osa-miR399f orthologues across the six AA-genome Oryza species.
osa-miR399f has been associated with adaptation to phosphate starvation. Sequences in the red rectangle
show predicted mature miRNAs within the predicted pre-miRNAs, while asterisks indicate nucleotide
substitutions or small indels.
Figure S39. Sequence alignments of miRNAs related to flower development across the six
AA-genome Oryza species. (a) osa-miR159a; (b) osa-miR159e; (c) osa-miR1847; (d) osa-miR159f.
Sequences within the red rectangles are predicted mature miRNAs within the predicted pre-miRNAs,
while asterisks indicate nucleotide substitutions or small indels.
Supporting Tables
Table S1: The five sequenced AA-genome Oryza species in this study.
Geographical
Species
Accession No.a
Category
Origin
O. nivara
Laos
88812
Wild
O. glaberrima
Ivory Coast
103486
Cultivated
O. barthii
Burkina Faso
101252
Wild
O. glumaepatula
Brazil
88793
Wild
O. meridionalis
Australia
105298
Wild
a
All accessions were provided by the Genetic Resources Center (GRC), International Rice Research
Institute (IRRI).
Table S2: Libraries and read statistics for the sequence assemblies of the five AA-genome Oryza
species.
Species
NIV
Insert
Read
Raw
Clean
Size
Length
Data
Data
(bp)
(bp)
(Mb)
(Mb)
Total
52453.19
28386.21
72.78
Paired-ends
31865.10
21923.59
56.21
Data Type
Lane
Information
240
120
9996.13
7163.28
18.37
NivP02
350
100
17031.44
10684.36
27.40
NivP03
350
100
4837.54
4075.94
10.45
20588.09
6462.62
16.57
NivM01
1980
85
5735.16
3083.62
7.91
NivM02
2200
75
2130.04
998.59
2.56
NivM03
2700
75
2205.38
188.69
0.48
NivM04
4070
85
1897.68
658.83
1.69
NivM05
4260
40
2341.51
189.00
0.48
NivM06
4360
75
893.70
195.24
0.50
NivM07
4460
60
925.62
472.73
1.21
NivM08
6260
75
733.81
100.78
0.26
NivM09
7460
60
1052.60
267.32
0.69
NivM10
7480
40
2342.59
271.59
0.70
NivM11
8580
75
330.00
36.24
0.09
Total
32975.96
21265.67
55.96
Paired-ends
21649.96
16018.76
42.15
GlaP01
310
100
4908.23
4237.69
11.15
GlaP02
340
75
16741.74
11781.07
31.00
11326.00
5246.91
13.81
Mate pair
BAR
Coverage
NivP01
Mate pair
GLA
Fold Sequence
GlaM01
2150
75
2406.06
1139.25
3.00
GlaM02
2150
75
2500.57
1349.00
3.55
GlaM03
2670
75
1596.43
348.48
0.92
GlaM04
3940
60
1013.53
658.65
1.73
GlaM05
3990
85
1527.09
998.63
2.63
GlaM06
7320
40
2282.31
752.90
1.98
Total
42062.01
18911.68
51.11
Paired-ends
24951.93
15167.43
40.99
BarP01
230
36
7191.85
4543.28
12.28
BarP02
230
75
8462.70
5145.40
13.91
BarP03
430
120
4107.11
1311.30
3.54
BarP04
460
100
5190.27
4167.46
11.26
17110.08
3744.25
10.12
Mate pair
BarM01
1920
75
2159.30
609.51
1.65
BarM02
2500
75
4792.98
1038.52
2.81
BarM03
3980
85
5706.63
918.11
2.48
GLU
BarM04
3980
40
1912.54
537.76
1.45
BarM05
7230
40
2538.63
640.34
1.73
Total
46473.26
31891.82
86.19
Paired-ends
29608.96
23269.09
62.89
GluP01
260
80
10517.84
8591.10
23.22
GluP02
320
100
5787.14
4756.05
12.85
GluP03
400
100
13303.98
9921.94
26.82
16864.30
8622.72
23.30
Mate pair
MER
GluM01
1820
40
210.18
98.28
0.27
GluM02
1850
55
4185.34
2974.98
8.04
GluM03
3500
40
2746.03
1755.62
4.74
GluM04
5460
40
2656.93
1045.23
2.82
GluM05
7700
40
2143.00
742.02
2.01
GluM06
7700
36
2314.39
718.17
1.94
GluM07
8620
50
2608.44
1288.42
3.48
Total
58674.11
22859.49
60.16
Paired-ends
24719.84
17142.28
45.11
MerP01
260
36
6043.28
3805.70
10.02
MerP02
260
75
13159.89
8783.62
23.11
MerP03
420
100
5516.66
4552.96
11.98
33954.27
5717.21
15.05
Mate pair
MerM01
2060
60
3767.06
1252.07
3.29
MerM02
2120
75
3375.84
1019.84
2.68
MerM03
2480
75
6520.76
427.69
1.13
MerM04
2960
50
3914.99
1139.04
3.00
MerM05
4070
85
1902.23
771.62
2.03
MerM06
4380
75
1933.27
311.90
0.82
MerM07
7250
50
7625.51
410.75
1.08
MerM08
7750
40
4914.62
384.29
1.01
Table S3: Genome sizes of the five sequenced AA-genome Oryza species estimated by flow cytometry
and k-mer analysis. SAT (estimated genome size = 389 Mb) was employed as a standard, and the
conversion factor used was 1 pg = 978 Mb (Doležel et al. 2003).
Species
2C-value
SD
(pg)
Estimate Size (Mb)
Previously Reported Results
Flow Cytometry
17 k-mer
Size (Mb)
References
NIV
0.698
0.004
341.32
395.00
448
Ammiraju, Luo et al. 2006
GLA
0.778
0.007
380.44
370.00
357
Martinez, Arumuganathan et al.
1994;
Uozu, Ikehashi et al. 1997
420
Ammiraju, Luo et al. 2006
BAR
0.757
0.008
370.17
376.00
401
Myiabayashi, Nomomura et al. 2007
GLU
0.794
0.057
388.27
366.00
475
Uozu, Ikehashi et al. 1997
MER
0.845
0.060
413.21
388.00
493
Uozu, Ikehashi et al. 1997
Table S4: Assembly statistics for the five sequenced AA-genomes.
Species
NIV
GLA
BAR
GLU
MER
13
7
7
9
11
59.72
55.96
51.11
86.19
74.7
~300 bp
36.21
42.15
26.19
36.07
33.13
~500 bp
-
-
14.8
26.82
11.98
2 Kb
10.95
7.47
4.46
8.31
10.1
4 Kb
3.88
4.36
3.93
4.74
2.85
6 Kb
0.36
-
-
2.82
-
8 Kb
1.48
1.98
1.73
7.43
2.09
BES*
0.14
0.11
-
-
-
375012451
344859159
335092604
334669472
340778775
237573
129688
117674
52825
34304
28477
Library Number
Sequence Coverage (×)
Assembled Length (bp)
133015
241242
(511541)**
(722125)**
32133
65818
(36897) **
(76929)**
Contig N50 (bp)
19023
25248
16126
17474
14633
Contig N90 (bp)
5090
6459
4019
5099
3695
Scaffold Number
6249
3831
9694
7016
9431
36840
29972
45851
36810
49517
13178141
5832171
1480305
848808
947518
360430
129641
198278
121993
116901
115463348
138287780
10206661
0
0
0.1-1 Mb
170895313
164370361
252911941
207495389
195595494
10-100 Kb
81246620
38726316
62745354
121128494
135247238
1-10 Kb
6204666
2393991
5880475
4875733
7935337
1 bp-1 Kb
5728704
6261546
14182849
4812289
8244828
395
370
376
366
388
339758745
327867452
318231159
315643039
321989115
30591
26141
36157
29794
40086
35253706
16991707
16861445
19026433
18789660
Coverage (%)
94.94
93.21
89.12
91.44
87.83
GC Content (%)
39.23
41.63
42.14
42.35
42.11
Scaffold N50 (bp)
Scaffold N90 (bp)
Contig Number
Longest Scaffold (bp)
Longest Contig (bp)
>1Mb Scaffold Length (bp)
Estimated Length (Mb)
No gap Length (bp)
Gap Number
Gap Length (bp)
* BES sequence data were downloaded from http://www.omap.org/;
** Assembled lengths of Scaffold N50 and Scaffold N90 after adding BES data for NIV and GLA.
Table S5: Scaffold length distributions of the five assembled AA-genome Oryza species.
Species
Scaffold Length
Number
Scaffold Length
(bp)
NIV
GLA
BAR
GLU
MER
Average Length
(bp)
Percentage
(%)
>1 Kb
4474
373771467
83543
98.49
>10 Kb
2866
367524707
128236
96.84
>50 Kb
1130
323373789
286171
85.21
>100 Kb
599
286342198
478034
75.45
>200 Kb
374
255609882
683449
67.35
>300 Kb
270
230458927
853552
60.73
>500 Kb
169
190971191
1130007
50.32
>800 Kb
85
138631109
1630954
36.53
>1 Mb
59
115457626
1956909
30.42
>1 Kb
2211
343970402
155572
98.21
>10 Kb
1535
341583236
222530
97.53
>50 Kb
881
323878164
367626
92.48
>100 Kb
579
302793594
522960
86.46
>200 Kb
379
274493226
724257
78.37
>300 Kb
287
251316829
875668
71.76
>500 Kb
170
206042109
1212012
58.83
>800 Kb
100
162038725
1620387
46.27
>1 Mb
73
138332401
1894964
39.50
>1 Kb
4870
331838005
68139
95.90
>10 Kb
2371
325977159
137485
94.21
>50 Kb
1548
303570597
196105
87.73
>100 Kb
1001
263476074
263213
76.15
>200 Kb
480
189113314
393986
54.65
>300 Kb
267
136622094
511693
39.48
>500 Kb
102
73858265
724101
21.35
>800 Kb
27
27040229
1001490
7.81
>1 Mb
8
10206541
1275818
2.95
>1 Kb
5301
333900144
62988
98.58
>10 Kb
3946
328995299
83374
97.13
>50 Kb
2086
279490598
133984
82.52
>100 Kb
1091
207822351
190488
61.36
>200 Kb
344
104409334
303516
30.83
>300 Kb
121
49866729
412122
14.72
>500 Kb
24
14053576
585566
4.15
>800 Kb
1
850278
850278
0.25
>1 Kb
6506
338842328
52082
97.62
>10 Kb
4335
330916366
76336
95.34
>50 Kb
2148
273346289
127256
78.75
>100 Kb
1057
195435307
184896
56.31
>200 Kb
315
92846733
294752
26.75
>300 Kb
105
42794909
407571
12.33
>500 Kb
17
11101312
653018
3.20
>800 Kb
2
1783512
891756
0.51
Table S6: Comparisons of the assembled quantities and percentages of transposable elements in the
six AA-genome Oryza species.
Sequence Size (Mb)
SAT*
Est. Genome Size by 17-mer Analysis
389
Assembled Size (No Gaps)
Gap Size
LTR retrotransposons
Other TEs
NIV
GLA
BAR
GLU
MER
395
370
376
366
388
372.23
339.76
327.87
318.23
315.64
321.99
16.77
55.24
42.13
57.77
50.36
66.01
118.91
69.16
71.09
72.53
69.19
75.88
34.22
40.07
37.52
39.63
40.28
39.64
* Assembled lengths and percentages of transposable elements were estimated by using the reference
genome of SAT (International Rice Genome Sequencing Project 2005).
Table S7: Assessment of the assembly quality of the five sequenced AA-genome Oryza species.
Genome Masking
Masked Length of Repeat
Sequences (Mb)
Remained Length of Genomic
Sequences (Mp)
Percentage of Unmasked
Sequences (%)
Statistic of Genome Alignments
Sequence Length Mapped to
SAT (Mb)
Mapped Percentage to the SAT
Genome
Mapped Percentage to the Repeat
Sequence-free SAT Genome
Percentage of Protein-coding
Gene Regions (%)
Statistic of Aligned Gene Numbers
50% a
80% a
90% a
Truncated b
a
b
SAT
NIV
GLA
BAR
GLU
MER
170.3
181.8
148.5
138.5
142.1
147.5
202.9
197.7
201.8
207.5
196.6
199.6
54.38
52.09
57.61
59.97
58.04
57.50
194.5
192.1
188.9
185.0
157.3
52.1
51.5
50.6
49.6
42.1
95.8
94.7
93.1
91.2
77.5
98.7
98.6
98.6
97.9
97.8
37613
35486
33912
154
37793
36155
35045
80
37250
35076
33677
136
37081
35028
33696
152
35256
32062
30245
180
39988
39988
39988
0
Coverage of genes.
Numbers of genes that are cut into more than one contig.
Table S8: Assessment of the assembly quality of four sequenced AA-genome Oryza species by using
the orientations of BES data from OMAP.
Species
BES
Pairs
BES Pairs in the
Same Scaffolds
Mismatched
BES Pairs
Number of
Correct
BES Pairs
Disorientation
Percentage
(%)
NIV
13469
10652
97
10555
0.9106
GLA
11036
9252
118
9134
1.2754
GLU
2615
592
5
587
0.8446
MER
3555
718
705
1.8106
13
Table S9: RNA-Seq analysis of four tissues in five AA-genome Oryza species.
RNA
source tissues
Read length
(bp)
Number
of
pair-end reads*
Clean data
(Gb)
NIV
30-d-roots
30-d-shoots
panicles at booting stage
flag leaves at booting stage
100
100
100
100
~43M
~32M
~28M
~42M
~8.61
~6.56
~5.65
~8.44
GLA
30-d-roots
30-d-shoots
panicles at booting stage
flag leaves at booting stage
100
100
100
100
~36M
~33M
~50M
~32M
~7.29
~6.69
~10.03
~6.37
BAR
30-d-roots
30-d-shoots
panicles at booting stage
flag leaves at booting stage
100
100
100
100
~34M
~31M
~30M
~25M
~6.86
~6.29
~6.09
~4.99
GLU
30-d-roots
30-d-shoots
panicles at booting stage
flag leaves at booting stage
100
100
100
100
~37M
~40M
~46M
~38M
~7.37
~8.02
~9.33
~7.51
MER
30-d-roots
30-d-shoots
panicles at booting stage
flag leaves at booting stage
100
100
100
100
~36M
~37M
~35M
~31M
~7.13
~7.35
~7.16
~6.27
Species
*M indicates million.
Table S10: Summary of transcript assemblies for five AA-genome Oryza species.
Species
30-d-roots
Panicles at
Flag leaves at
booting stage
booting stage
30-d-shoots
Total
NIV
35858
42172
41016
40394
52801
GLA
34928
36930
39274
34243
49453
BAR
27249
39342
34773
32471
46834
GLU
40184
42065
40460
35352
54602
MER
34391
39311
36380
29854
48949
Table S11: Summary of gene annotation for six AA-genome Oryza species.
Type
SAT
NIV
GLA
BAR
GLU
MER
Number of protein-coding genes
39045
41490
41476
42283
41605
39106
Gene
Average gene length (bp)
2853
2243
2258
2175
2208
2170
Features
Average exon/intron length (nt)
318/418
282/416
291/411
285/408
283/411
275/408
9.5
9.0
8.0
8.4
8.1
8.7
# Pfam
32102
25304
25862
26024
25862
24844
# Interpro
32369
23125
23647
23881
23657
22827
# Gene Ontology
23032
17541
17947
18189
17951
17268
Average gene density (Kb per gene)
Function
annotation
Table S12: Validation of gene model predictions at transcript and gene levels.
NIV
Number
%
GLA
Number
%
BAR
Number
%
GLU
Number
%
MER
Number
%
49432
35360
22713
26516
18110
40177
--71.5
46.0
53.6
36.6
81.3
49285
35724
22685
28153
18991
40147
--72.5
46.0
57.1
38.5
81.5
49885
34637
22031
26137
7991
39088
--69.4
44.2
52.4
16.0
78.4
49002
34000
22272
28969
18791
39347
--69.4
45.5
59.1
38.4
80.3
45159
27403
18159
24902
7134
33578
--60.7
40.2
55.1
15.8
74.4
41490
28750
16952
19967
13161
32680
--69.3
40.9
48.1
31.7
78.8
41479
29057
17001
21510
13912
32729
--70.0
41.0
51.9
33.5
78.9
42283
28276
16609
19760
7149
31917
--66.9
39.3
46.7
16.9
75.5
41605
27708
16837
22559
13930
32269
--66.6
40.5
54.2
33.5
77.6
39106
22683
14115
19813
11345
27957
--58.0
36.1
50.7
29.0
71.5
Transcript Level
Total predicted gene models
Protein supporteda
EST supportedb
RNA-Seq supportedc
Protein, RNA-Seq, and EST supported
Protein, RNA-Seq or EST supported
Gene Level
Total predicted genes
Protein supported
EST supported
RNA-Seq supported
Protein, RNA-Seq and EST supported
Protein, RNA-Seq or EST supported
a
Protein supported criterion: identity ≥ 30%; coverage ≥ 90%;
EST supported criterion: identity ≥ 95%; coverage ≥ 90%;
c
RNA-Seq supported criterion is based on the Cufflinks prediction.
b
Table S13: Summary of non-coding RNA genes in the six AA-genome Oryza species.
tRNA
rRNA(8S)
rRNA(18S)
rRNA(28S)
snoRNA
snRNA
miRNA
SAT
NIV
GLA
BAR
GLU
MER
Number
726
551
491
588
621
598
Total length (bp)
53989
40949
36395
43625
46314
44227
Average length (bp)
74
74
74
74
75
74
% of genome
0.014%
0.010%
0.010%
0.012%
0.013%
0.011%
Number
1
8
22
14
24
54
Total length (bp)
115
935
2485
1729
2645
6071
Average length (bp)
115
117
113
124
110
112
% of genome
0.000%
0.000%
0.000%
0.000%
0.001%
0.002%
Number
17
1
1
-
3
5
Total length (bp)
27249
1421
1479
-
5868
8226
Average length (bp)
1603
1421
1479
-
1956
1645
% of genome
0.007%
0.000%
0.000%
-
0.002%
0.002%
Number
11
1
-
2
2
2
Total length (bp)
46696
6758
-
10261
5207
8194
Average length (bp)
4245
6728
-
5131
2604
4097
% of genome
0.012%
0.002%
-
0.003%
0.002%
0.002%
Number
317
259
232
229
194
195
Total length (bp)
33604
24762
22241
22159
18772
19179
Average length (bp)
106
96
96
97
97
98
% of genome
0.009%
0.006%
0.006%
0.006%
0.005%
0.005%
Number
112
116
121
120
125
117
Total length (bp)
15797
16486
16949
16778
17690
16274
Average length (bp)
141
142
140
140
142
139
% of genome
0.004%
0.004%
0.005%
0.004%
0.005%
0.004%
Number
366
276
271
276
263
251
Total length (bp)
53869
36663
35966
36133
34585
33169
Average length (bp)
147
133
133
131
132
132
% of genome
0.014%
0.009%
0.010%
0.010%
0.009%
0.009%
Table S14: Summary of the annotated repeat sequences in the six AA-genome Oryza species.
Transposable
Elements
DNA transposons
En-Spm
Harbinger
Helitron
Maverick
MuDR
TcMar-Stowaway
Tourist
hAT
z-other
RNA Transposons
Non-LTR
Retrotransposons
LINE
SINE
LTR
Retrotransposons
Copia
SAT
NIV
BAR
GLA
GLU
MER
Length (bp)
Percentage
(%)
20090983
Length (bp)
Percentage
(%)
24090389
Length (bp)
Percentage
(%)
24505071
Length (bp)
Percentage
(%)
24693858
Length (bp)
Percentage
(%)
25371900
Length (bp)
Percentage
(%)
24994918
5.16
6.1
6.62
6.57
6.93
6.44
2805006
2584241
2526039
2636283
2779125
3029844
0.72
0.65
0.68
0.7
0.76
0.78
97011
143956
131518
142526
129170
135463
0.02
0.04
0.04
0.04
0.04
0.03
2148771
2131002
1913359
2107114
1982534
2045802
0.55
0.54
0.52
0.56
0.54
0.53
7905
11769
12137
11168
14128
8684
0.002
0.003
0.003
0.003
0.004
0.002
4559725
5448015
5738379
5676209
5828267
5679693
1.17
1.38
1.55
1.51
1.59
1.46
3936841
5180210
5414113
5372359
5513758
5332976
1.01
1.31
1.46
1.43
1.51
1.37
2566960
3439971
3502963
3423993
3545358
3442465
0.66
0.87
0.95
0.91
0.97
0.89
1380939
1794311
1831556
1828443
2000529
1846473
0.35
0.45
0.5
0.49
0.55
0.48
2587825
3356914
3435007
3495763
3579031
3473518
0.67
0.85
0.93
0.93
0.98
0.9
122536262
75583313
73877795
77179173
73824245
80558927
31.5
19.14
19.97
20.53
20.17
20.76
3629501
4497448
4714383
4651013
4636697
4674008
0.93
1.14
1.27
1.24
1.27
1.2
2588878
3152909
3329039
3278464
3238994
3258973
0.67
0.8
0.9
0.87
0.88
0.84
1040623
1344539
1385344
1372549
1397703
1415035
0.27
0.34
0.37
0.37
0.38
0.36
118906761
71085865
69163412
72528160
69187548
75884919
30.57
18
18.69
19.29
18.9
19.56
27060173
15551090
15320520
16201531
15029060
18619349
Gypsy
Other*
Other Repeats
Low complexity
Satellite
Simple repeat
Unknown
Total
Genome Size (Mb)
Assemble Length
6.96
3.94
4.14
4.31
4.11
4.8
38114939
17698081
16557837
17591545
17054732
18149396
9.8
4.48
4.48
4.68
4.66
4.68
53731649
37836694
37285055
38735084
37103756
39116174
13.81
9.58
10.08
10.3
10.14
10.08
10499690
9562772
10220690
10288062
10276229
9972115
2.7
2.42
2.76
2.74
2.81
2.57
2302220
1386516
1430813
1492140
1376953
1376072
0.59
0.35
0.39
0.4
0.38
0.35
139144
151338
166224
161311
181810
180798
0.04
0.04
0.04
0.04
0.05
0.05
2598175
1142940
1308776
1403625
1371482
1152710
0.67
0.29
0.35
0.37
0.37
0.3
5460151
6881978
7314877
7230986
7345984
7262535
1.4
1.74
1.98
1.92
2.01
1.87
153126935
109236474
108603556
112161093
109472374
115525960
39.36
27.65
29.35
29.83
29.91
29.77
389
372.23
395
339.76
370
327.87
376
318.23
366
315.64
388
321.99
4.31
13.98
11.39
15.36
13.76
17.01
(No Gaps)(Mb)
Gap Percentage (%)
* Indicates LTR retrotransposons that could not be classified into Gypsy or Copia superfamilies.
Table S15: Summary of types and number of simple sequence repeats in the six AA-genome Oryza
species.
Types
Monomer
Dimer
Trimer
Tetramer
Pentamer
Hexamer
Total Number
SAT
NIV
GLA
BAR
GLU
MER
Subtype number
2
2
2
2
2
2
n ≥ 12
16423
13559
16518
16176
13691
13416
Subtype number
4
4
4
4
4
4
n ≥ 6
37254
24429
25290
23925
26082
24057
Subtype number
10
10
10
10
10
10
n≥ 4
82067
53647
60912
56947
59595
53871
Subtype number
33
33
33
33
33
33
n≥ 3
49408
43198
44119
44041
43655
43520
Subtype number
102
102
102
102
102
102
n≥ 3
17023
13432
13741
13413
13876
13016
Subtype number
335
323
329
327
328
329
n ≥ 3
10221
7086
7634
7140
7719
6948
212396
155351
168214
161642
164618
154828
Table S16: Summary of the 1:1 orthologous genes with high confidence among seven Oryza genomes.
OrthoMCL
(Raw)
OrthoMCL
(No change)
Reads mapping
(coverage = 1;
34.2 ≤ depth ≤ 68.4)
Synteny
(SAT vs. BRA)
Length
(length ≥ 300aa)
# 1:1 orthologous
# Filtered
8276
---
7692
584a
4224
3468b
3845
379
2305
1540
Note: Raw: e-value ≤ 1e-5; coverage ≥ 50%; I = 1.5.
No change: the number of 1:1 orthologs that did not change even when e-values were varied from
1e-1 to 1e-30 and I parameters were varied from 1.5 to 5.
a
gene sets that change with the adjustment of e-values and MCL i-parameters.
b
gene sets with read depth < 34.2 and coverage < 100%.
Table S17: Coverage of orthologous genomic regions in six AA-genome species.
Length of aligned region
Species
Size of genome (bp)
The
average
gene
number
Gene
Coverage
(bp)
number
per block
SAT
372317567
317005864
85.1%
40435
13.6
NIV
375012451
283915139
75.7%
39258
13.2
GLA
344859159
276440309
80.2%
39902
13.4
BAR
335092604
270918152
80.8%
39182
13.2
GLU
334669472
267174953
79.8%
39258
13.2
MER
340778775
254573564
74.7%
36106
12.2
Table S18: Proportions of sequence lengths of orthologous genomic segments across six AA-genome
Oryza species.
Length (Kb)
Number
Percentage (%)
Length (Kb)
Number
Percentage (%)
0-20
535
19.17
120-140
137
4.91
20-40
429
15.37
140-160
114
4.08
40-60
334
11.97
160-180
85
3.05
60-80
269
9.64
180-200
103
3.69
80-100
202
7.24
> 200
591
21.18
100-120
172
6.16
2971
100.00
Total
Table S19: Genomic divergence between SAT and the other five AA-genome Oryza species.
Pairwise
Species
Protein
sequences
SAT - NIV
Orthologous
Singletons
dS
dN
0.0075 0.0031
0.0047
Orthologous
Genomic
Segments
0.0114
SAT - GLA
0.0102
0.0033
0.0055
0.0141
SAT - BAR
0.0107
0.0037
0.0054
0.0129
SAT - GLU
0.0114
0.0039
0.0065
0.0150
SAT - MER
0.0316
0.0079
0.0154
0.0414
SAT - BRA
0.2299
0.0494
0.1042
0.1452
Table S20: General features of GS5 genomic regions in the six AA-genome Oryza species.
Sequence Features
SAT
NIV
GLA
BAR
GLU
MER
Genome size (Mb)
389
395
370
376
366
388
Sequence length (Kb)
31.5
31.3
32.1
32.7
34.2
34.5
Number of intact genes
3
3
3
3
3
3
Gene density (genes/10 Kb)
1.0
1.0
0.9
0.9
0.9
0.9
GC content (%)
37.1
36.3
36.0
35.6
35.6
36.0
DNA transposon length (Kb)
1.8
2.3
2.7
2.5
3.4
2.5
Retrotransposon length (Kb)
3.6
3.9
5.0
5.0
6.1
5.4
Genic region (%)
46.9
44.8
44.7
48.2
44.3
43.0
TE region (%)b
17.1
15.7
24.0
22.9
27.8
22.9
a
a
Introns and untranslated regions are included.
b
Simple sequence repeats and low-complexity regions are excluded.
Table S21: Composition and classification of TEs in GS5 genomic regions in the six AA-genome Oryza species.
SAT
#
Length
(bp)
NIV
#
Length
(bp)
GLA
#
Length
(bp)
BAR
#
Length
(bp)
GLU
#
Length
(bp)
MER
#
Length
(bp)
Class I:
LTR
LTR-copia
LTR-gypsy
LINE
Class II:
DNA
DNA-tourist
DNA-stowaway
DNA-MuDR
DNA-hAT
DNA-hAT-Ac
Unknown
9
2
4
1
1595
667
921
443
14
2
4
1
1894
667
950
444
19
2
1
1
3323
1131
111
447
15
2
2
1
3219
1134
166
447
21
2
/
1
4525
1122
/
446
16
3
4
/
3475
999
919
/
/
2
6
1
1
/
1
/
130
1046
328
172
/
159
2
2
6
4
1
/
1
237
130
1004
640
172
/
170
2
1
10
1
/
/
2
240
154
1658
332
/
/
391
2
1
8
1
/
/
2
240
154
1397
332
/
/
391
2
2
10
2
/
/
4
240
505
1655
405
/
/
616
1
2
7
1
/
1
4
115
129
1148
81
/
333
736
Table S22: Sites under positive selection detected by Bayes Empirical Bayes (BEB) analysis of the
GS5 gene.
Site
1
2
3
6
10
11
12
13
15
16
17
23
25
26
27
28
29
182
Amino acid
V
Q
F
Y
D
E
R
H
R
A
L
E
L
W
L
N
G
I
*: P > 95%; **: P > 99%.
P (ω>1)
1.000**
1.000**
0.999**
0.923
0.879
0.847
0.707
0.916
0.859
0.642
0.720
0.900
0.509
0.993**
0.509
0.994**
0.969
0.934
Posterior mean±SE for ω
10.048±0.721
10.047±0.727
10.043±0.752
9.343±2.529
8.940±3.049
8.624±3.404
7.278±4.363
9.281±2.620
8.733±3.316
6.649±4.583
7.395±4.306
9.135±2.816
5.377±4.771
9.985±1.045
5.377±4.771
9.991±1.015
9.765±1.735
9.446±2.367
Table S23: General features of PROG1 genomic regions in the six AA-genome Oryza species.
Sequence Features
SAT
NIV
GLA
BAR
GLU
MER
Genome size (Mb)
389
395
370
376
366
388
Sequence length (Kb)
279.0
253.3
201.6
192.5
226.8
196.3
Number of intact genes
42
26
27
21
28
22
Gene density (gene/ 10 Kb)
1.5
1.0
1.3
1.1
1.2
1.1
GC content (%)
42.3
40.5
41.2
40.5
41.0
40.6
DNA transposon length (Kb)
26.1
24.0
24.1
24.4
22.2
22.8
Retrotransposon length (Kb)
39.0
31.0
19.9
17.1
24.1
10.9
Genic region (%)
38.67
28.65
32.43
28.45
30.53
26.70
TE region (%)b
23.33
21.71
21.83
21.56
20.41
17.17
a
a
b
Introns and untranslated regions are included.
Simple sequence repeats and low-complexity regions are excluded.
Table S24: Composition and classification of TEs in PROG1 genomic regions in the six AA-genome
Oryza species.
SAT
NIV
#
Length
(bp)
#
Length
(bp)
LTR-copia
LTR-gypsy
LINE
SINE
unclassified
Class II:DNA TE
4
11
6
3
10
1144
22020
1310
692
12928
2
16
14
7
12
574
21588
2427
1520
4436
DNA-En-Spm
DNA-tourist
DNA-stowaway
DNA-MuDR
DNA-hAT
DNA-hAT-Ac
Unclassified
13
19
25
10
1
4
16
8582
4023
4974
4173
209
1035
3094
5
18
40
15
1
6
16
1909
3613
7024
5655
209
1976
2983
GLA
#
BAR
GLU
MER
Length
(bp)
#
Length
(bp)
#
Length
(bp)
#
Length
(bp)
3
7
6
3
9
1626
3771
1316
467
12715
2
10
7
5
8
872
10853
1343
1159
2867
7
15
12
6
8
1913
13954
4588
798
2744
3
2
6
2
11
932
3370
1061
503
4990
9
21
33
15
1
3
12
2997
4036
5166
8771
209
879
1923
7
23
35
11
1
5
12
2597
4487
5287
7942
209
1513
2266
4
18
36
20
1
5
18
2436
3999
6286
4471
199
1498
3326
12
11
40
16
1
4
10
4030
1799
7266
4537
207
1024
1569
Class I: RNA TE
Table S25: Statistics for structural variation in the five assembled rice genomes*.
Genome
Putative Indels (≤ 50 bp)
Putative Indels (> 50 bp)
Number / Length (Mb)
Number / Length (Mb)
Total Length (Mb)
Max. length (bp)
Insertion **
Deletion
Insertion **
Deletion
Insertion ***
Deletion
Insertion
Deletion
NIV
194,145 (255) /1.16 (0.05)
211,676 / 1.33
38,755 (4320) / 28.26 (6.10)
29,252 / 8.83
29.43 (6.15)
10.16
43,019
19,347
GLA
256,511 (361) / 1.53 (0.04)
269,243 /1.61
44,761 (4351) / 33.82 (5.33)
31,993 /10.37
35.36 (5.37)
11.96
43,487
40,230
BAR
253,728 (535) / 1.54 (0.10)
259,575 / 1.55
42,382 (4853) / 31.60 (5.32)
32,809 / 10.48
33.14 (5.42)
12.02
41,786
20,658
GLU
260,632 (714) / 1.58 (0.26)
262,587 / 1.58
46,621 (5880) / 33.73 (7.82)
31,472 / 10.92
35.31 (8.07)
12.50
47,650
15,141
MER
438,856 (448) / 2.73 (0.06)
480,954 / 2.94
76,068 (10126) / 52.88 (9.28)
58,072 / 19.18
55.61 (9.34)
22.12
46,275
20,829
* Putative indels were extracted from the refined sequence alignments with the SAT genome as a reference. Minimum length of indels detected in all five genomes is
1 bp (not shown in this table). Insertions are considered as gap-related insertions if these loci overlapped with inter/intra-contig gaps. As the lengths of assembly gaps
cannot be exactly calculated, we excluded gaps and counted non-N sequences of all insertions including gap-related ones.
** Total No. of insertions (No. of gap-related insertions) / Total length of non-N insertion sequences (total predicted length of gaps).
*** Total length of non-N insertion sequences (total predicted length of gaps).
Table S26: Structural variants across the six AA-genome Oryza species.
Short ( ≤ 50bp)
Long ( > 50bp)
Pattern*
Regular**
Random***
Total
Deletion
Insertion
Deletion
Insertion
Deletion
Insertion
000001
37667
70690
383148
348299
420815
418989
000010
11504
36979
132876
121122
144380
158101
000100
7892
25008
45130
40123
53022
65131
001000
8499
22784
38522
39827
47021
62611
010000
12846
32439
118434
97675
131280
130114
100000
835
2323
18807
16123
19642
18446
001100
6209
10867
103392
83601
109601
94468
110000
646
2204
15229
13673
15875
15877
111100
1382
4470
20183
22948
21565
27418
001110
2444
2568
25866
30790
28310
33358
011110
1450
1323
18740
23074
20190
24397
011100
1347
983
12724
14429
14071
15412
010010
2048
939
11658
11186
13706
12125
000110
1456
667
4993
4807
6449
5474
011000
1610
319
2822
2615
4432
2934
011010
539
127
1652
1687
2191
1814
110010
689
2276
11027
13011
11716
15287
101110
661
3124
8320
12228
8981
15352
111010
213
1808
1989
3116
2202
4924
100010
262
909
4595
4776
4857
5685
101100
160
1025
3650
4514
3810
5539
110110
233
2093
1699
2688
1932
4781
111000
71
649
1221
1451
1292
2100
101010
31
549
678
852
709
1401
101011
387
1477
3255
3411
3642
4888
101000
48
395
869
1075
917
1470
100100
39
357
715
838
754
1195
110100
73
678
1129
1175
1202
1853
100110
43
635
550
695
593
1330
101001
171
593
2198
2074
2369
2667
110101
600
1397
4144
4070
4744
5467
* ‗1‘ stands for the orthologous presence in a species, and ‗0‘ signifies the absence. The order of numbers
is followed by their phylogenetic relationships in the species tree (SAT/NIV/BAR/GLA/GLU/MER). For
example, ‗111101‘ indicates that all species shared the same indel except GLU.
** The regular type indicates that the insertion/deletion event is clearly supported by robust phylogenetic
relationships of the six studied species
*** The random type includes all other insertion and deletion events that cannot be located by any
phylogenetic relationships
Table S27: Summary statistics for the six rice genome libraries.
Genome
size
(Mb)
Raw
data
(Gbp)
Number
of reads
Raw
coverage
(×)
Number
of filtered
reads *
SAT
389
9.4
125,393,628
24.2
NIV
395
7.3
72,738,382
GLA
370
4.3
BAR
376
GLU
MER
Genome
Read-depth per
5k bp window
Insert size
Discordant
reads
Average
**
STDEV
Average
insert size
Standard
Deviation
Count
%
106,784,472
990.85
97.38
191.36
20.25
309,843
0.29
18.4
58,145,400
692.4
129.48
350.9
17.84
95,022
0.16
42,516,398
11.5
33,321,668
387.94
73.91
339.63
19.03
41,116
0.12
5.2
51,388,836
13.8
40,317,714
413.11
84.32
451.82
47.67
176,069
0.44
366
6.7
67,222,990
18.4
49,042,454
418.78
95.04
399.35
40.85
189,673
0.39
388
4.4
43,637,412
11.2
35,018,194
250.36
85.9
260.43
18.04
32,715
0.09
* Reads mapped against the SAT reference genome (MSU v7.0) using mrFAST; total number of mapped
reads after filtering for quality scores (Phred quality > 20) and removing PCR duplicates;
** The average read depth and variance for 5-kb (unmasked) regions were determined using 2,305
single-copy loci.
Table S28: Number of gene models excluding TEs with the longest isoform* of the six AA-genome
Oryza species.
Datasets
Total number of
annotated genes
TE-related genes
SAT
NIV
GLA
BAR
GLU
MER
Total
56101
41490
41476
42283
41605
39106
205963
17272
4515
4272
4353
4217
4309
38938
36975
37204
37388
34797
223123
Total number of
genes for gene
38829
37930
family analysis
*Longest isoform means the longest protein isoform for each predicted gene.
Table S29: Number of genes and gene families for seven Oryza species.
Data
Families/Genes
SAT
NIV
GLA
BAR
GLU
MER
BRA
Total
Families
30287
27628
29395
28767
27452
23645
17874
39293
35819
31165
32873
32554
31047
26845
21415
211718
3010
5810
5376
4331
6341
7952
8963
41783
126
68
34
71
75
134
311
21730
Families
30161
27560
29361
28696
27377
23511
17563
17563
Genes
35471
31019
32801
32202
30895
26571
20651
130636
Genes/Family
1.12
1.13
1.12
1.12
1.13
1.13
1.18
---
Totala
Genes
Putative annotation artifacts
b
Families
Lineage-specificc
Likelihood
Analysis
a
Families
Total number of gene families inferred from OrthoMCL clustering.
Gene families without either paralogous or orthologous genes among the seven rice genomes.
c
Gene families that were not inferred to be present in the common ancestor of all seven rice genomes.
b
Table S30: Number of putative NBS-LRR genes identified in the seven Oryza species.
Domains
SAT
NIV
GLA
BAR
GLU
MER
BRA
CC-NBS
56
64
63
61
43
53
22
CC-NBS-LRR
252
133
121
136
129
116
104
NBS-LRR
227
178
158
161
125
136
132
TIR-NBS
1
1
1
1
1
1
1
NBS only
95
113
107
117
94
110
48
Total
631
489
450
476
392
416
307
Table S31: Number of WRKY domains and predicted genes of the seven Oryza species in WRKY
Groups I–III. Numbers of predicted genes are given in parentheses. Note: ATH indicates Arabidopsis
thaliana, SB indicates Sorghum bicolor, and ZM indicates Zea mays.
WRKY
SAT
NIV
GLA
BAR
GLU
MER
BRA
SB
ZM
ATH
I_N
20
8
15
10
11
8
11
10
36
19
I_C
20
8
15
10
11
8
11
10
36
19
II
72
57
47
43
48
41
53
57
129
53
III
36
27
26
28
34
34
23
27
37
16
149
100
103
91
104
91
98
104
239
107
(128)
(92)
(88)
(81)
(93)
(83)
(87)
(94)
(202)
(88)
Group
Total
Table S32: Number of predicted MADS-box gene families in the seven Oryza species.
Category
SAT
NIV
GLA
BAR
GLU
MER
BRA
Mα
12
13
14
13
13
13
11
Mβ
9
5
8
8
9
7
6
Mγ
11
7
7
8
5
9
4
Type I (subtotal)
32
25
29
29
27
29
21
MIKCC
40
41
39
38
40
38
36
MIKC*
5
4
5
5
5
3
3
Type II (subtotal)
45
45
44
43
45
41
39
Total
77
70
73
72
72
70
60
Table S33: Statistics for novel genes identified from the five sequenced rice
genomes in comparisons with SAT, based on read mapping.
NIV
GLA
BAR
GLU
MER
395
370
376
366
388
No. of reads used for mapping (M)*
82055
58077
50973
152646
86927
Fraction of mapped reads (%)
83.93
76.53
81.63
71.35
63.88
No. of contigs > 2Kb
532
572
662
137
1262
No. of contigs mapped
160
135
155
42
575
No. of contigs unmapped
372
437
507
95
687
41.74
41.34
41.08
41.57
41.80
177/111
172/88
187/114
39/12
248/165
Average length of novel genes (bp)
992
958
927
722
972
Functionally annotated (%)
30.6
45.5
36.0
33.3
39.4
Genome size (Mb)
GC content (%)
No. of genes (AUGUSTUS/Blast 100)
* (M) indicates million.
Table S34: List of the thirty-one agronomically important genes screened in this study.
Genes
SE5
SD1
ID
Function description
References
LOC_Os06g40080
LOC_Os01g66100
Izawa et al. 2000
Monna et al. 2002
Gn1a
LOC_Os01g10110
Adh1
GH2
LOC_Os11g10480
DQ234272*
qSH1
GW2
LOC_Os01g62920
LOC_Os02g14720
Ghd7
LOC_Os07g15770
GIF1
LOC_Os04g33740
GW5
DQ991205*
HWH1
LOC_Os02g40840
Phr1
PROG1
LOC_Os04g53300
LOC_Os07g05900
qSW5
LOC_Os05g09520
S5
EU889293*
SaF
EU337974*
SaM
EU337977*
BADH2
DEP1
LOC_Os08g32870
LOC_Os09g26999
MOC1
Sh4
GS3
LOC_Os06g40780
LOC_Os04g57530
LOC_Os03g30450
OsSPL14
LOC_Os08g39890
mtRPL27
AB496674*
Rc
LOC_Os07g11020
Confers photoperiodic control of flowering
Encodes a mutant enzyme involved in gibberellin
synthesis
Causes cytokine accumulation in inflorescence meristems
and increases the number of reproductive organs,
resulting in enhanced grain yield; regulates the number of
secondary branches on primary branches at the panicle
base
Represses coleoptile elongation under submergence
Encodes a primarily multifunctional cinnamyl-alcohol
dehydrogenase
A major quantitative trait locus of seed shattering in rice
Loss of GW2 function increased cell numbers, resulting
in a larger (wider) spikelet hull; it accelerated the grain
milk filling rate, resulting in enhanced grain width,
weight and yield
Major effects on several traits in rice, including number
of grains per panicle, plant height and heading date
Grain-filling; encodes a cell-wall inverstase required for
carbon partitioning during early grain-filling
A major QTL underlying rice width and weight; likely
acts in the ubiquitin-proteasome pathway to regulate cell
division during seed development
Encodes a GMC oxidoreductase; causes
inter-sub-specific hybrid necrosis
Phenol reaction (PHR) phenotype
Inactivation of PROG1 expression leads to erect growth,
greater grain number and higher grain yield
A deletion in qSW5 resulted in a significant increase in
sink size owing to an increase in cell number in the outer
glume of the rice flower
Encodes an aspartate protease; causes inter-sub-specific
hybrid female sterility (embryo sac sterility)
Encodes an F-box protein; causes inter-sub-specific
hybrid male sterility (pollen abortion)
SUMO E3 ligase-like gene; causes inter-sub-specific
hybrid male sterility (pollen abortion)
Controls fragrance in the grain
Reduces length of the inflorescence internode, an
increased number of grains per panicle and a consequent
increase in grain yield; erect panicle; controls the number
of both primary branches and secondary branches on
primary branches at the panicle top
Controls tillering
Responsible for the reduction of grain shattering
Functions as a negative regulator of grain size and organ
size; regulates stigma length and participates in stigma
exertion
Pleiotropic gene, affecting unproductive tiller number,
grain yield per panicle and lodging resistance
Nuclear-encoded mitochondrial ribosomal protein L27;
mutation causes inter-specific hybrid male sterility
(pollen abortion)
A positive regulator of proanthocyanidin ,control the colo
Ashikari et al. 2005
Saika et al. 2006
Zhang et al. 2006
Konishi et al. 2006
Song et al. 2007
Xue et al. 2008
Wang et al. 2008
Weng et al. 2008
Jiang et al. 2008
Yu et al. 2008
Jin et al. 2008; Tan
et al. 2008
Shomura et al. 2008
Chen et al. 2008
Long et al. 2008
Long et al. 2008
Kovach et al. 2009
Huang et al. 2009
Lu et al. 2009
Zhang et al. 2009
Mao et al. 2010;
Takano-Kai et al.
2011
Jiao et al. 2010;
Miura et al. 2010
Yamagata et al.,
2010
Gross et al. 2010
GS5
LOC_Os05g06660
LAX2
LOC_Os04g32510
Wx
Hd1
qGW8
LOC_Os06g04200
LOC_Os06g16370
LOC_Os08g41940
TAWAWA1
LOC_Os10g33780
r of pericarp
Controls grain size by regulating grain width, filling and
weight
Encodes a novel nuclear protein and regulates the
formation of axillary meristems
SNP in intron affecting mRNA splicing
Photoperiod pathway gene
Effects grain width and yield; negative regulator of
panicle branching
A regulator of rice in florescence architecture, functions
through the suppression of meristem phase transition
* Accession number is named according to the NCBI database.
Li et al. 2011
Tabuchi et al. 2011
Su et al. 2011
Huang et al. 2012
Wang et al. 2012
Yoshida et al. 2013
Table S35: Computational identification of gain and loss of thirty-one agronomically-important
genes in the AA-genome Oryza species through whole genome read mapping.
Gene
SE5
SD1
Gn1a
Adh1
GH2
qSH1
GW2
Ghd7
GIF1
GW5
HWH1
Phr1
PROG1
qSW5
S5
SaF
SaM
BADH2
DEP1
MOC1
Sh4
GS3
OsSPL14
mtRPL27
Rc
GS5
LAX2
Wx
Hd1
qGW8
TAWAWA1
IND
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
TRJ
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
RUF
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
NIV
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
BAR
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
GLA
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
GLU
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
LON
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
MER
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
‗+‘ means that the gene is present, and ‗-‘ indicates that the gene appears to have been lost.
Table S36: Expression patterns of the thirty-one agronomically-important genes in four tissues across the five AA- genome Oryza species. The number shown
is expression value (FPKM, Reads Per Kilo bases per Million reads) of these genes in the four different tissues.
NIV
B*
GLA
C*
BAR
GLU
C
MER
Tissues
A*
D*
A
B
C
D
A
B
D
A
B
C
D
Adh1
BADH2
DEP1
GH2
Ghd7
GIF1
Gn1a
GS3
GS5
GW2
GW5
HD1
HWH1
IPA1
LAX2
MOC1
mtRPL27
phr1
PROG1
prog1
qGW8
qSH1
qSW5
Rc
S5
SaF
SaM
SD1
SE5
Sh4
TAWAWA1
10299
8576
1652
16170
7483
7958
10197
33848
24514
11744
5898
31646
48738
26541
6542
28691
A
8336
15074
B
C
3329
18897
D
9828
2713
3005
4596
9967
3353
4111
4512
3059
1970
2220
3090
6144
3548
5279
5981
6389
1988
2879
3697
192
3907
228
749
211
1206
98
402
0
1991
56
687
178
2621
256
174
581
4110
283
521
164356
21255
1587
94373
163304
25088
5827
148769
112585
6631
7640
162260
87135
32170
34174
99385
130792
5351
20828
234470
0
55
563
0
0
14
0
0
50
13
3095
105
34
89
223
23
0
25
691
65
4280
693
207
5407
6068
1059
430
3695
1024
486
309
7599
4670
1416
464
11970
6898
479
190
6674
321
205
744
363
623
1114
3011
202
97
334
2389
495
325
386
834
608
38
227
356
314
284
170
1247
254
216
481
1400
88
163
538
213
91
97
174
442
249
892
2070
3265
1568
1572
1012
861
852
1259
1106
2572
1190
1519
1418
347
865
1273
1572
514
1147
1469
2340
1111
1130
1646
1723
4351
2354
1060
1112
1099
1162
825
1200
1070
1484
1457
1509
1752
966
1356
1735
2280
1595
0
41
0
0
0
53
0
37
0
0
0
0
0
0
0
0
0
0
0
0
935
380
378
515
357
508
2479
344
249
3514
2507
147
422
1037
2644
374
218
7789
857
759
13882
2403
4897
11719
12281
3073
13825
24659
5433
1429
1029
4021
3868
2016
2861
3262
12781
2611
4967
11840
1388
1887
1422
1116
1580
975
1167
1048
1411
1878
1089
1990
2134
2084
1029
562
1358
1379
1166
1085
1930
1128
5862
3923
276
855
603
1738
828
698
636
1385
907
1275
1056
1204
1099
546
1763
1995
943
558
979
595
1749
583
701
746
627
588
555
504
753
751
1144
999
931
916
1365
918
590927
191602
187425
387943
220387
105172
196974
120095
171031
193060
157190
190673
116696
80106
99259
150896
129312
116224
166221
383693
2659
14998
22488
4974
1199
21871
18912
2637
4893
2663
21673
2823
722
3969
19689
15560
560
1188
19397
6031
169
1990
1492
651
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
374
1002
0
0
0
0
0
0
0
0
0
242
0
0
295
0
0
0
0
256
222
100
1253
150
136
156
863
187
128
26
426
148
256
145
394
218
304
105
619
4454
4519
3059
7419
2647
5295
1997
8764
3950
4817
2101
14811
6485
6232
4576
4948
10269
5761
2586
9951
106
403
841
285
147
150
161
215
47
208
201
81
177
311
70
326
152
141
35
256
232
332
136
386
179
240
303
192
188
194
182
160
147
110
144
245
421
515
982
262
0
0
0
53
0
0
0
0
0
0
0
0
0
0
0
0
0
0
167
0
358
125
0
119
288
0
154
154
105
0
51
29
586
128
208
356
141
0
0
0
0
0
0
0
0
0
0
0
2375
3713
432
926
967
2179
4386
2447
328
3059
1382
774
132
1559
5991
1590
736
3013
1207
1744
178
946
1637
1580
278
935
1074
1160
656
527
4145
1425
19709
10546
79422
6851
9505
14329
11482
4134
5321
6846
34952
5066
6015
11611
15232
5470
8521
6682
45375
5394
482
456
396
455
145
55
63
151
65
104
127
413
399
220
226
348
381
323
128
456
16182
15523
3358
3669
6556
4735
3338
2860
12383
9407
1283
1863
6813
7706
1515
3575
10810
6067
6861
5815
Wx
4280
2241
3345
*A: Root; B: Shoot; C: Leaf; D: Panicle.
4124
2856
2032
3544
2485
2041
2208
6015
1688
3211
2507
4360
9111
2586
2402
3191
3320
Table S37: Read mapping results for PROG1-orthologous genes in the
AA-genome Oryza species.
Species
IND
TRJ
RUF
NIV
GLA
BAR
GLU
LON
MER
Number of reads
42
25
40
24
7
7
66
15
11
Length of consensus
sequences*
389
389
457
397
67
95
474
260
124
*The nucleotide length of PROG1is 504 bp in SAT.
Mapped average
coverage (×)
5.31
5.39
3.60
2.84
0.43
0.41
4.87
1.12
0.66
Table S38: Summary of branch-specific ω, dN, and dS values in PAML.
Species/brancha
ω(dN/dS)
Std. Error
dN
Std. Error
dS
Std. Error
1# SAT
0.47986
0.02654
0.00138
0.00012
0.00290
0.00017
2# NIV
0.45181
0.01893
0.00166
0.00009
0.00363
0.00012
3# GLA
1.99500
0.28259
0.00069
0.00013
0.00035
0.00005
4# BAR
1.62731
0.17815
0.00057
0.00004
0.00040
0.00006
5# GLU
0.40689
0.01965
0.00193
0.00018
0.00473
0.00034
6# MER
0.28136
0.00475
0.00723
0.00022
0.02569
0.00057
7# b
0.36116
0.02817
0.00090
0.00019
0.00239
0.00029
8# c
0.33086
0.01451
0.00139
0.00009
0.00420
0.00015
9# d
0.64212
0.25295
0.00040
0.00005
0.00111
0.00015
a
# of each branch is consistent with that labeled in the phylogenic three shown in Figure S27a.
b
Common ancestor of SAT/NIV.
c
Common ancestor of GLA/BAR.
d
Common ancestor of SAT/NIV/GLA/BAR.
Table S39: Numbers of PSGs detected in PAML analysis.
Lineage-specific PSGsd
All PSGs
Branch
P < 0.05
FDR < 0.05
P < 0.05
FDR < 0.05
All branchesa
522
268
-
-
SAT
98
64
49
44
NIV
153
122
74
69
GLA
72
40
24
23
BAR
97
74
23
35
GLU
115
69
45
38
MER
338
234
208
165
Ancestral Asian cultivated riceb
55
20
15
8
Ancestral African cultivated ricec
93
37
36
18
Total (non-redundancy)
797
537
474
400
a
The genes under selection along any branch of the phylogenetic tree based on the site model;
b
The branch leading to Asian cultivated rice and its wild progenitor;
c
The branch leading to African cultivated rice and its wild progenitor;
d
These genes showed significant evidence for positive selection only in one branch.
Table S40: Enrichment analyses for functional categories of predicted PSGs.
Type
Category
Terms
Gene Number
PSGs
All
P-values
(A) All non-redundant PSGs
BP
GO:0048856
anatomical structure development
45
133
1.90E-02
BP
GO:0032501
multicellular organismal process
83
278
2.30E-02
BP
GO:0009791
post-embryonic development
50
154
2.40E-02
BP
GO:0032502
developmental process
86
291
2.50E-02
BP
GO:0007275
multicellular organismal development
82
276
2.60E-02
BP
GO:0009653
anatomical structure morphogenesis
34
97
2.70E-02
BP
GO:0050896
response to stimulus
97
343
4.20E-02
BP
GO:0000003
reproduction
47
151
4.80E-02
(B) Asian-rice clade
BP
GO:0009791
post-embryonic development
26
154
6.30E-04
BP
GO:0032501
multicellular organismal process
38
278
1.40E-03
BP
GO:0000003
reproduction
24
151
2.20E-03
BP
GO:0007275
multicellular organismal development
37
276
2.30E-03
BP
GO:0032502
developmental process
38
291
3.10E-03
BP
GO:0050896
response to stimulus
41
343
8.80E-03
BP
GO:0006950
response to stress
29
226
1.20E-02
BP
GO:0048856
anatomical structure development
19
133
1.70E-02
BP
GO:0022414
reproductive process
11
63
2.00E-02
CC
GO:0030312
external encapsulating structure
7
32
2.40E-02
CC
GO:0005618
cell wall
7
32
2.40E-02
BP
GO:0009790
embryonic development
11
72
4.30E-02
BP
GO:0009908
flower development
8
47
5.00E-02
(C) African-rice clade
BP
GO:0006950
response to stress
23
226
1.10E-02
BP
GO:0050896
response to stimulus
30
343
2.40E-02
BP
GO:0009628
response to abiotic stimulus
15
141
2.70E-02
(D) SAT-lineage
BP
GO:0009653
anatomical structure morphogenesis
9
97
1.90E-03
BP
GO:0048856
anatomical structure development
10
133
4.40E-03
BP
GO:0050896
response to stimulus
18
343
5.70E-03
MF
GO:0003677
DNA binding
13
231
1.20E-02
BP
GO:0009056
catabolic process
10
157
1.30E-02
BP
GO:0016043
cellular component organization
10
163
1.70E-02
BP
GO:0007275
multicellular organismal development
14
276
2.10E-02
BP
GO:0032501
multicellular organismal process
14
278
2.20E-02
BP
GO:0006950
response to stress
12
226
2.40E-02
BP
GO:0032502
developmental process
14
291
3.10E-02
(E) NIV-lineage
CC
GO:0030312
external encapsulating structure
7
32
2.80E-03
CC
GO:0005618
cell wall
7
32
2.80E-03
(F) GLA-lineage
BP
GO:0030154
cell differentiation
4
64
2.80E-02
BP
GO:0048869
cellular developmental process
4
64
2.80E-02
BP
GO:0032502
developmental process
10
291
2.80E-02
BP
GO:0007275
multicellular organismal development
9
276
4.90E-02
(G) BAR-lineage
CC
GO:0005634
nucleus
17
313
2.40E-02
BP
GO:0016043
cellular component organization
10
163
4.20E-02
MF
GO:0005515
protein binding
15
287
4.50E-02
MF: molecular function; BP: biological process; CC: cellular component. P-values represent the
one-tail Fisher‘s Exact Probability Values.
Table S41: Chromosomal distribution enrichment of the predicted PSGs.
Gene Number
All
Category
Chromosome
detected
P-values
PSGs
2,272 gene
families
Non-redundant PSGs
6
30
179
3.90E-02
SAT
7
10
177
2.80E-02
GLA
9
7
111
3.00E-03
BAR
10
8
100
1.90E-02
Note: Total predicted non-redundant PSGs and PSGs for each lineage are considered, and the
over-represented results are given here. P-values indicate the one-tail Fisher‘s Exact Probability
Values.
Table S42: List of PSGs assigned to flower development and pollination.
Orthologous
Category
MSU Loci
MSU Functional Description
Gene
Names
LOC_Os01g07790.1
Polygalacturonase, putative, expressed
LOC_Os02g14730.1
Ubiquitin carboxyl-terminal hydrolase family
PG
protein, expressed
LOC_Os02g34850.1
Histone-lysine N-methyltransferase ASHH2,
ASHH2
putative, expressed
LOC_Os02g41550.4
FAD binding domain of DNA photolyase
CRY2
domain containing protein, expressed
LOC_Os03g07580.1
OsNucAP1 - Putative nucleoporin
MOS3/SAR3
Autopeptidase homologue, expressed
Flower
LOC_Os03g12660.1
Cytochrome P450, putative, expressed
LOC_Os04g49450.1
MYB family transcription factor, putative,
expressed
development
(GO:0009908)
LHY
LOC_Os05g05310.1
Fibronectin type III domain containing
VIL1
protein, expressed
LOC_Os07g06970.1
HEN1, putative, expressed
HEN1
LOC_Os09g13610.1
PFT1, putative, expressed
PFT1
LOC_Os10g02770.1
Glycosyl hydrolases family 16, putative,
XTH
expressed
LOC_Os10g27470.1
KH domain containing protein, putative,
PEP
expressed
LOC_Os10g35110.2
Alpha-galactosidase precursor, putative,
expressed
LOC_Os10g42640.1
Expressed protein
LOC_Os01g23740.1
OsPDIL2-2
protein disulfide isomerase
PDIL
PDIL2-2, expressed
LOC_Os01g32750.1
Pollination
(GO:0009856)
TATA binding protein-associated factor,
TAF6
putative, expressed
LOC_Os03g18200.1
Heat shock protein DnaJ, putative, expressed
TMS1
LOC_Os04g34450.1
Expressed protein
SEC5
Table S43: Summary of the conserved miRNA genes in the six AA-genome Oryza species.
miRNA
SAT
NIV
GLA
BAR
GLU
MER
MIR156
MIR159
MIR160
MIR162
MIR164
MIR166
MIR167
MIR168
MIR169
MIR171
MIR172
MIR319
MIR390
MIR393
MIR394
MIR395
MIR396
MIR397
MIR398
MIR399
MIR408
MIR413
MIR415
MIR416
MIR417
MIR418
MIR426
MIR435
MIR437
MIR438
MIR440
MIR443
MIR528
MIR529
MIR530
MIR531
MIR535
MIR815
MIR816
12
6
6
2
6
13
10
2
17
9
4
2
1
2
1
25
8
2
2
11
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
3
1
3
1
5
5
3
1
3
9
8
7
5
2
1
1
1
1
32
5
5
1
1
1
1
1
1
1
1
1
5
4
7
4
1
4
3
2
2
9
7
2
2
1
1
19
5
1
5
1
1
1
1
1
1
1
2
1
2
1
1
7
4
6
5
-*
1
5
3
1
7
3
2
1
2
1
28
5
1
2
3
1
1
1
1
1
1
1
2
1
1
5
8
6
2
4
6
1
1
8
6
1
2
1
1
26
4
1
2
3
1
1
1
1
4
3
3
6
1
3
4
7
1
6
4
2
2
1
2
1
26
3
2
6
1
1
1
1
1
1
1
2
MIR827
MIR1317
MIR1318
MIR1319
MIR1423
MIR1424
MIR1425
MIR1426
MIR1427
MIR1429
MIR1430
MIR1431
MIR1433
MIR1436
MIR1437
MIR1441
MIR1442
MIR1847
MIR1848
MIR1849
MIR1851
MIR1852
MIR1853
MIR1854
MIR1855
MIR1856
MIR1857
MIR1858
MIR1859
MIR1862
MIR1863
MIR1864
MIR1865
MIR1866
MIR1867
MIR1868
MIR1869
MIR1871
MIR1872
MIR1874
MIR1875
MIR1876
MIR1878
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
5
3
1
1
1
1
1
1
1
1
1
1
1
1
3
2
2
1
1
1
1
1
2
2
1
2
2
1
2
1
7
1
1
1
2
1
1
1
3
6
1
1
1
1
1
1
1
2
2
2
4
1
1
1
1
1
1
-
5
2
1
2
1
1
1
3
1
6
1
1
1
1
1
2
3
1
6
1
1
1
1
1
1
1
1
-
3
2
1
1
2
1
3
1
1
1
7
1
2
1
1
5
2
1
1
1
1
1
1
-
2
9
1
1
6
1
1
1
4
1
1
2
1
1
3
2
1
1
1
1
1
1
1
MIR1879
MIR1880
MIR1881
MIR1883
MIR2055
MIR2091
MIR2092
MIR2093
MIR2094
MIR2096
MIR2099
MIR2101
MIR2102
MIR2103
MIR2104
MIR2105
MIR2106
MIR2118
MIR2121
MIR2123
MIR2125
MIR2275
MIR2862
MIR2863
MIR2866
MIR2867
MIR2868
MIR2869
MIR2870
MIR2871
MIR2872
MIR2874
MIR2875
MIR2876
MIR2877
MIR2878
MIR2879
MIR2880
MIR2905
MIR2920
MIR2922
MIR2923
MIR2924
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
18
2
3
1
4
1
3
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
3
1
1
1
1
1
4
6
4
11
2
4
1
3
2
3
1
1
1
2
1
1
1
1
4
1
1
1
1
1
1
8
6
8
1
13
7
1
4
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
4
1
1
1
1
1
1
1
1
1
5
5
5
15
1
3
4
1
1
1
1
1
2
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
6
3
5
1
9
1
2
4
1
2
1
1
1
2
1
1
1
1
1
4
1
1
1
5
1
1
1
1
1
1
4
5
2
1
12
1
6
2
1
3
1
1
1
1
1
1
1
1
5
1
1
-
MIR2925
MIR2926
MIR2927
MIR2928
MIR2930
MIR2931
MIR3979
MIR3982
MIR5053
MIR5056
MIR5071
MIR5073
MIR5076
MIR5077
MIR5078
MIR5081
MIR5082
MIR5144
MIR5147
MIR5150
MIR5151
MIR5152
MIR5155
MIR5157
MIR5158
MIR5159
MIR5162
MIR5337
MIR5338
MIR5339
MIR5340
MIR5484
MIR5485
MIR5486
MIR5487
MIR5488
MIR5489
MIR5490
MIR5491
MIR5492
MIR5493
MIR5494
MIR5495
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
1
1
1
1
1
1
1
1
1
1
1
6
1
1
1
1
1
1
1
1
2
1
1
5
1
1
1
1
1
1
1
1
1
2
1
4
1
2
1
1
1
1
1
1
1
1
2
1
1
2
1
1
1
2
1
1
1
1
1
4
2
1
1
1
1
1
1
1
1
1
1
1
3
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
-
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
1
1
1
1
1
1
-
MIR5496
MIR5497
MIR5499
MIR5500
MIR5501
MIR5503
MIR5504
MIR5506
MIR5508
MIR5509
MIR5510
MIR5511
MIR5513
MIR5514
MIR5515
MIR5516
MIR5517
MIR5518
MIR5519
MIR5520
MIR5521
MIR5522
MIR5523
MIR5524
MIR5525
MIR5526
MIR5528
MIR5529
MIR5530
MIR5531
MIR5534
MIR5535
MIR5536
MIR5537
MIR5539
MIR5543
MIR5544
In Total
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
366
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
4
4
1
1
1
3
3
1
1
276
1
2
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
2
3
7
271
* ‗-‘ indicates that this miRNA gene was not detected.
1
1
3
1
1
1
1
1
1
1
1
1
1
1
3
5
1
1
1
1
5
3
276
2
1
1
1
1
1
1
1
1
1
1
3
5
1
3
7
5
1
263
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
4
3
1
251
Table S44: Nucleotide substitution of non-coding RNA genes across the AA-genome Oryza
species.
Total numbera
Selected numberb
Mutation frequency
tRNA
726
314
0.60%
snoRNA
317
128
1.03%
snRNA
112
41
2.61%
Mature miRNA
614
225
1.47%
Pre-miRNA
546
165
1.91%
miRNA target
3802
1593
1.61%
1 Kb upstream region
41439
5117
4.15%
5’ UTR
27248
23460
2.47%
3’ UTR
26960
13165
2.21%
CDS
167120
87572
0.96%
Intron
134263
63231
2.09%
1 Kb downstream region
41439
6741
3.46%
Intergenic region
41450
3358
3.76%
Whole genome
372311
60488
2.77%
Genome component
a
The number of genomic components in the SAT genome.
Genomic components were selected within orthologous genomic regions in five AA-genomes
that share at least 90% sequence length of SAT with the other five genomes.
b
Table S45: Copy numbers of miRNA genes that are related to flower development across the
six AA-genome Oryza species.
miRNAs
SAT
NIV
GLA
BAR
GLU
MER
MIR156
MIR156a
MIR156b
MIR156c
MIR156d
MIR156e
MIR156f
MIR156g
MIR156h
MIR156i
MIR156j
MIR156k
MIR156l
MIR159
MIR159a
MIR159b
MIR159c
MIR159d
MIR159e
MIR159f
MIR160
MIR160a
MIR160b
MIR160c
MIR160d
MIR160e
MIR160f
MIR162
MIR162a
MIR162b
MIR164
MIR164a
MIR164b
MIR164c
MIR164d
MIR164e
MIR164f
MIR166
12
1
1
1
1
1
1
1
1
1
1
1
1
6
1
1
1
1
1
1
6
1
1
1
1
1
1
2
1
1
6
1
1
1
1
1
1
14
5
1
1
1
1
1
5
2
1
2
3
1
1
1
1
1
3
1
1
1
9
4
1
-*
1
1
1
7
1
2
1
1
2
4
1
1
1
1
1
1
4
1
1
1
1
3
4
1
1
1
1
6
1
2
1
1
1
5
1
1
1
1
1
1
1
5
8
1
1
1
1
1
1
1
1
6
1
2
1
1
1
2
1
1
4
1
1
1
1
6
3
1
1
1
6
1
1
1
1
2
1
1
3
1
1
1
2
MIR166a
MIR166b
MIR166c
MIR166d
MIR166e
MIR166f
MIR166g
MIR166h
MIR166i
MIR166j
MIR166k
MIR166l
MIR166m
MIR166n
MIR172
MIR172a
MIR172b
MIR172c
MIR172d
MIR390
MIR408
MIR444
MIR529b
MIR1438
MIR1442
MIR1847
MIR1848
MIR1884b
MIR2123
MIR2123a
MIR2123b
MIR2123c
MIR3979
MIR5144
MIR5150
MIR5535
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
1
1
1
1
1
1
1
1
1
1
1
1
1
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
2
2
4
2
2
1
-
1
1
1
2
1
1
1
1
1
1
7
2
2
3
1
-
1
1
1
1
1
2
1
1
1
6
1
1
3
3
1
1
1
1
1
1
1
1
1
1
1
1
7
2
2
1
1
1
1
-
1
1
2
1
1
1
4
6
2
4
1
1
-
* ‗-‘ shows that this miRNA gene was not detected; miRNA gene families that vary in number are
indicated in bold.