Sandra Smit, Siavash Sheikhizadeh Anari, Eric Schranz, Dick de

Transcription

Sandra Smit, Siavash Sheikhizadeh Anari, Eric Schranz, Dick de
Pangenomics for eukaryotes
Sandra Smit, Siavash Sheikhizadeh Anari, Eric Schranz, Dick de Ridder
Motivation
Because of large genome sequencing projects, such as the 150 tomato genomes project
(Aflitos et al. 2014), many plants and animals are no longer represented by a single
reference genome, but rather by a group of related genomes. Therefore, we need
approaches to efficiently compare all these genomes and to analyze novel data with
respect to all of them. This requires a shift from reference-centric analyses to pangenome
analyses.
Research questions
How can we condense multiple annotated genomes for a (group of) species in a single
representation, which can be used to study genome variation and evolution? How can we
mine a pangenome for information relevant to plant and animal breeding?
D
C
In large genome projects, the DNA of
hundreds to thousands of members of an
evolutionary clade is sequenced to study
domesticated species and their wild
relatives.
Approach
We are developing a pangenome representation that is:
Complete: the pangenome representation is a graph structure that contains both the
genome sequences and the annotation. The sequences are condensed in a compressed
de Bruijn graph (green nodes), and annotations are added as nodes pointing to the start
and end position in the sequence. The nodes in the database are objects with properties,
such as the occurrences of the sequence in the original genomes.
Scalable and efficient: to overcome memory limitations,
the constructed graph is stored in a Neo4j graph database.
The graph can be built in memory or directly in the
database. Genomes can be added or removed from the
graph. Our solution scales to eukaryotic genomes, as the
database grows linearly. The pangenome can be queried
efficiently using indices built on top of the graph database.
B
A
A pangenome representation of two HIV
genomes. The graph contains multiple
types of nodes. A) This pangenome
contains 2 genomes (blue) with each one
sequence (purple). B) Two vpu genes
(red) are annotated and grouped by gene
name (yellow). C) A SNP between two pol
genes. D) Two annotated pol genes (red),
which start and end at different k-mer
nodes (green).
References
Aflitos et al. Exploring genetic variation in
the tomato (Solanum section Lycopersicon)
clade by whole-genome sequencing. Plant J
(2014)
Usable for comparative genomics: the inclusion of the
annotation makes the graph usable for biological analyses.
Several relationships, such as homology, orthology, or
identity, can be used to group annotations (e.g. genes). We
have implemented the retrieval of genes and genomic
regions. For example, it is very fast to retrieve all FRIGIDA
genes in the arabidopsis pangenome and align them.
Furthermore, we are working towards analyzing synteny
and structural variation, read mapping, and visualization.
Performance on:
93 yeast genomes
•  2.5 hours
•  27 GB of memory
•  1.1 GB database
19 arabidopsis genomes
•  5.5 hours
•  61 GB of memory
•  5.3 GB database
The database grows
linearly.
Conclusions
Pangenome analyses likely play an important role in future comparative genomics. Our
proposed pangenome representation, in combination with the appropriate database
indices, could facilitate many different analyses. The storage of the pangenome in a
graph database makes it scalable to eukaryotic genomes and allows for the inclusion of
genome annotations.
Contact
Sandra Smit
Bioinformatics, Plant Sciences Group
Wageningen University
Sandra.Smit@wur.nl