Lecture: Topic 3: Genome assembly
Transcription
Lecture: Topic 3: Genome assembly
Genome Assembly Sample Prepara1on Fragments Sequencing Reads Assembly ACGTAGAATACGTAGAA ACGTAGAATCGACCATG GGGACGTAGAATACGAC ACGTAGAATACGTAGAAACAGATTAGAGAG… Con1gs Paired-‐End Reads Genomic segment Genome fragments Get two reads from each segment ~100 bp ~100 bp Read Coverage C • Length of genomic segment: L • Number of reads: n • Length of each read: l Coverage C = n l / L Fragment Assembly • Cover region with ~7-‐fold redundancy • Overlap reads and extend to reconstruct the original genomic region Challenges in Fragment Assembly • Repeats: a major problem for fragment assembly • > 50% of human genome are repeats: -‐ over 1 million Alu repeats (about 300 bp) -‐ about 200,000 LINE repeats (1000 bp and longer) Repeat Repeat Green and blue fragments are interchangeable when assembling repe11ve DNA Repeat Triazzle: A Fun Example The puzzle looks simple BUT there are repeats!!! The repeats make it very difficult. Try it – only $7.99 at www.triazzle.com Repeat Types • Low-‐Complexity DNA (e.g. ATATATATACATA…) • Microsatellite repeats (a1…ak)N where k ~ 3-‐6 (e.g. CAGCAGTAGCAGCACCAG) • Transposons/retrotransposons – SINE Short Interspersed Nuclear Elements (e.g., Alu: ~300 bp long, 106 copies) – LINE Long Interspersed Nuclear Elements ~500 -‐ 5,000 bp long, 200,000 copies • Gene Families genes duplicate & then diverge • Segmental duplicaCons ~very long, very similar copies Fragment Assembly • ComputaConal Challenge: assemble individual short fragments (reads) into a single genomic sequence (“superstring”) • Un1l late 1990s the shotgun fragment assembly of human genome was viewed as intractable problem Shortest Superstring Problem • Problem: Given a set of strings, find a shortest string that contains all of them • Input: Strings s1, s2,…., sn • Output: A string s that contains all strings s1, s2,…., sn as substrings, such that the length of s is minimized • Complexity: NP-‐complete • Note: this formula1on does not take into account sequencing errors Whole Genome Shotgun Sequencing Genome Genome amplified and sliced into smaller fragments (>=600bp) Build consensus sequence from overlap Overlap-‐Layout-‐Consensus Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA Overlap: find poten1ally overlapping reads Layout: merge reads into con1gs and con1gs into supercon1gs Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT.. Overlap • Each read is compared to that of every other read, in both the forward and reverse complement orienta1ons. • As such, the overlap computa1on step is a very 1me intensive step – especially if the set of reads is very large. • For example, the whole genome shotgun assembly of Drosophila had about 3 x 10^6 reads of 500 bases, requiring roughly 10^13 comparisons (Deonier et al., 2010). • Even on today's computers, running that many comparisons is imprac1cal, so seeded algorithm are used Overlapping Reads • • • Sort all k-‐mers in reads Find pairs of reads sharing a k-‐mer Extend to full alignment – throw away if not >95% similar TACA TAGATTACACAGATTAC T GA || ||||||||||||||||| | || TAGT TAGATTACACAGATTAC TAGA Finding Overlapping Reads Create local mul1ple alignments from the overlapping reads. TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA Finding Overlapping Reads Correct errors using mul1ple alignment. • Find loca1ons where there is a devia1on in which 1% of the data diverge from the rest. • Make those posi1ons agree with the rest. TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA Build the Overlap Graph • Overlap graph: the nodes represent actual reads, and edges represent overlaps between these reads. • Thus, the genome assembly becomes equivalent to finding a path through the graph that visits each node exactly once (i.e., a Hamiltonian path). 17 An overlap graph. Nodes are complete reads and edges connect reads that overlap. Note that in an actual graph, reads and overlaps would be much larger. 18 Layout • Finding a Hamiltonian path through the overlap graph is not a trivial task. • In order to decrease the size of the graph, the OLC assembly graph is simplified in the layout stage, where segments of the graph are compressed into con1gs • Thus, we have to find a manner to decrease the complexity of the graph Graph Reduc1on • A con1g would be a subgraph, or a group of nodes, with many connec1ons among each other, as they all overlap with each other and refer to the same sequence (A and B). • Once a subgraph has been iden1fied, these nodes and edges are compressed into one node, or a con1g, thereby simplifying the graph (C) 20 21 Separa1ng Con1gs • There are two classes of con1gs, unique conCgs and repeat conCgs. • Unique con1gs are composed of reads that can be unambiguously assembled. • Repeat con1gs are con1gs with an abnormally high read coverage or connected to an abnormally large number of other con1gs 22 Separa1ng Con1gs Normal density Too dense: Overcollapsed? Inconsistent links: Overcollapsed? Crea1ng Scaffolds • Unique con1gs are joined into larger sequences, called scaffolds. • The most common way to piece con1gs into scaffolds is through mate-‐pair informaCon. • With mate-‐pair informa1on, assemblers can iden1fy how far reads and unique con1gs should be apart from each other. – e.g. if a 2kb fragment of a genome were sequenced 100bp on each end, then we know these reads and the unique con1gs they are in should be roughly 2kb apart. 24 Link Con1gs into Scaffolds Find all links between unique con1gs Connect con1gs incrementally, if ≥ 2 links Link Con1gs into Scaffolds Fill gaps in scaffolds with paths of overcollapsed contigs Consensus • A consensus sequence is derived from a profile of the assembled fragments • A sufficient number of reads is required to ensure a sta1s1cally significant consensus • Reading errors are corrected Derive Consensus Sequence TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive mul1ple alignment from pairwise read alignments Derive each consensus base by weighted voting Problems! • The main problem with this approach is that it is very, very, very slow and will only work on small genomes or low coverage. • Not commonly used for complete assembly, however, some sopware tools s1ll use this approach: – Celera: genome assembler for 454, PacBio, and Illumina data – LOCAS: Resequencing genomes. – HapAssembler: for sequencing highly polymorphic genomes Problems! Unfortunately, overlap-‐layout-‐consensus approach will not work for NGS data or significantly large genomes: – There is too much data. Calcula1ng the overlap for each pair of reads would take way to much 1me. – There has to be a new method for fragment assembly. De Bruijn Graph Approach to Assembly De Bruijn Graph for Assembly • Introduced in 1989. Pevzner. J Biomol Struct Dyn (1989) 7:63—73. Iduly & Waterman. J. Comput Biol (1995) 2:291—306. • Adapted for next genera1on sequencing data. Euler-‐SR: Chaisson & Pevzner. Genome Res. (2008) 18:324—30. Velvet: Zerbino & Birney. Genome Res. (2008) 18:821—29. ALLPATHS: Butler et al. Genome Res. (2008) 18(5):810—20. ABySS: Simpson et al. Genome Res (2009) 19:1117—1123. De Bruijn Graph Construc1on I. Choose a value of 𝑘. II. For each 𝑘-‐mer that exists in any sequence create an edge with one vertex labeled as the prefix and one vertex labeled as the suffix. III. Glue all ver1ces that have the same label. (Pevzner, Tang & Tesler, 2004) De Bruijn Graph Construc1on GTCTATTCGCTAATTCACTA ATTC ATTCG ATTC ATTCA TTCG TTCA (Pevzner, Tang & Tesler, 2004) De Bruijn Graph Construc1on GTCTATTCGCTAATTCACTA ATTC ATTCG ATTC ATTCA TTCG TTCA (Pevzner, Tang & Tesler, 2004) De Bruijn Graph Construc1on GTCTATTCGCTAATTCACTA TTCG ATTCG ATTC ATTCA TTCA (Pevzner, Tang & Tesler, 2004) De Bruijn Graph of a Genome Example Example GGenome: enome: A ABCDEFGHICDEFGKL BCDEFGHICDEFGKL 𝑘-‐mers ABCD BCDE CDEF DEFG EFGH GHIC HICD ICDE EFGK FGKL (𝑘 −1)-‐mers ABC BCD CDE DEF EFG GHI HIC ICD FGK GKL De Bruijn Graph of a Genome Example Example GGenome: enome: A ABCDEFGHICDEFGKL BCDEFGHICDEFGKL GHI HIC ICD ABC BCD FGH CDE DEF EFG FGK GKL De Bruijn Graph of a Genome Bulges (undirected cycles) and whirls (directed cycles) Example Genome: ABCDEFGHICDEFGKL occur because of sequencing errors or repeats in the genome. GHI HIC ICD ABC BCD FGH CDE DEF EFG FGK GKL De Bruijn Graph of a Genome Example Example GGenome: enome: A ABCDEFGHICDEFGKL BCDEFGHICDEFGKL GHI HIC 2 ICD ABC BCD 1 FGH CDE DEF EFG FGK 3 GKL Typical De Bruijn Graph However, this is over a billion ver1ces (for a very small bacteria genome). … De Bruijn Graph of a Genome Example Example GGenome: enome: A ABCDEFGHICDEFGKL BCDEFGHICDEFGKL GHI HIC 2 ICD ABC BCD 1 FGH CDE DEF EFG FGK 3 GKL De Bruijn Graph of a Genome Example Example GGenome: enome: A ABCDEFGHICDEFGKL BCDEFGHICDEFGKL ABC BCD CDE DEF EFG FGK GKL De Bruijn Graph of a Genome Example GEenome: AG BCDEFGHICDEFGKL Resul1ng rroneous enome: ABCDEFGKL ABC BCD 1 CDE DEF EFG FGK GKL Coverage Standard (Mul1-‐cell) Data (Chitsaz et al., 2011) Coverage Coverage Single-‐cell Data (Chitsaz et al., 2011) Detangling the de Bruijn Graph Even using mate-‐pair informa1on, the de Bruijn graph is highly tangled. There are the following op1ons for detangling the de Bruijn graph: 1. Error correc1on of reads. 2. Bulge and whirl removal. 47 Detangling the de Bruijn Graph Even using mate-‐pair informa1on, the de Bruijn graph is highly tangled. There are the following op1ons for detangling the de Bruijn graph: PROBLEM! 1. Error correc1on of reads Both inevitably end-‐up causing 2. Bulge and whirl removal errors rather than correcCng then. 48 What has been sequenced? • We’ve sequenced a number of genomes but several genomes remain difficult • Plant genomes are very hard because they are extremely long, contain huge repeat regions, and are polyploid • Note: we do not dis1nguish between genotypes… that is a separate problem Organism Type Genome Size No. of predicted genes Homo Sapiens Takifugu rubripes Human Puffer fish 3.2Gb 390Mb 20,251 22-‐29,000 Oryza sa1va Anopheles gambiae Rice Mosquito 420Mb 278Mb 32-‐50,000 13,700 Saccharomyces Baker’s cerevisiae yeast 12.1Mb 6,200 Cucumis sa1vus cucumber 367Mb 27,000 • kb (= kbp) = kilo base pairs = 1,000 bp • Mb = mega base pairs = 1,000,000 bp • Gb = giga base pairs = 1,000,000,000 bp. Assembly Evalua1on • How can we tell the difference between a good assembly and a bad assembly? – Answer: N50 staCsCc, which is a metric of the length of a set of sequences, with greater weight given to longer sequences. – Given a set of sequences of varying lengths, the N50 length is defined as the length N for which half of all bases in the sequences are in a sequence of length L < N. – There are some contradictory in the defini1on(s) of the N50 value. Other Evalua1ons • Number of inser1ons, dele1ons, and subs1tu1on errors in an assembly • misassembly of con1gs (chimeric indels) >=500 bp 52 Other Evalua1ons • Number of inser1ons, dele1ons, and subs1tu1on errors in an assembly • misassembly of con1gs (chimeric indels) >=500 bp 53