screen
Transcription
screen
Alignments BLAST, BLAT Genome vs. gene Genome Gene Built of... DNA DNA Describes Organism Protein Stored as... Circular/ linear Single molecule, or Part of genome a few of them Both (depending on the species) Linear “Life cycle” DNA-DNA-DNA-... DNA-RNA-protein Size 0.5Mb*) ... 3500Mb 100b ... 100000b Amount per cell 1 500 .. 50000 Information content 2% ... 100% ~30% The amount of genetic information in organisms Name Mycoplasma genitalium Escherichia coli Saccharomyces cerevisiae Drosophila melanogaster Caenorhabtitis elegans Homo sapiens Zea mays Genome size (Mb) # genes 0.5 4.5 470 4400 12 5500 120 18000 97 3000 2500 22000 23457 50000 The amount of genetic information in organisms http://www.cbs.dtu.dk/staff/dave/roanoke/genetics980313.html Largest genome: amoeba Chaos chaos http://www.lawrence.edu/dept/biology/animal/ (200x human genome) Sequence searching challenges Exponential growth of databases Sequence searching – definition Task: – Query: short, new sequence (~1000 letters) – Database (searching space): very many sequences – Goal: find seqs homologous to the query Sequence searching – definition We want: – fast tool – primarily a filter: most sequences will be unrelated to the query – fine-tune the alignment later Database Search Algorithms: Sensitivity, Selectivity • True Positive (TP) – a homology detected (positive) correctly (true) Signal Yes No Yes No Detected Name Yes True Positive No True Negative No False Negative Yes False Positive Database Search Algorithms: Sensitivity, Selectivity • Sensitivity =TP/(TP+FN) • Selectivity =TN/(TN+FP) – Sensitivity Selectivity Courtesy of Gary Benson (ISSCB 2003) What is BLAST Basic Local Alignment Search Tool Bad news: it is only a heuristics – Heuristics: A rule of thumb that often helps in solving a certain class of problems, but makes no guarantees. Perkins, DN (1981) The Mind's Best Work Basic idea: – High scoring segments have well conserved (almost identical) part – As well conserved part are identified, extend it to the real alignment - s e q s e q u e What means well conserved for BLAST? BLAST works with k-words (words of length k) – k is a parameter – different for DNA (>10) and proteins (2..4) word w1 is T-similar to w2 if the sum of pair scores is at least T (e.g. T=12) Similar 3-words W1: R K P W2: R R P Score: 9 –1 7 ∑ = 15 BLAST algorithm 3 basic steps 1)Preprocess 2)Scan 3)Extend 1)Preprocess the query: extract all the k-words 2)Scan for T-similar matches in database 3)Extend them to alignments BLAST, Step 1: Preprocess the query 1)Preprocess 2)Scan 3)Extend Take the query (e.g. LVNRKPVVP) Chop it into overlapping k-words (k=3 in this case) Query: LVNRKPVVP Word1: LVN Word2: VNR Word3: NRK … For each word find all similar words (scoring at least T) E.g. for RKP the following 3-words are similar: QKP KKP RQP REP RRP RKP Finite state machine 1)Preprocess 2)Scan 3)Extend abstract machine constant amount of memory (states) used in computation and languages recognizes regular expressions – cp dmt*.pdf /home/john AC*T|GGC BLAST, Step 2: Find ¨exact¨ matches with scanning 1)Preprocess 2)Scan 3)Extend Use all the T-similar k-words to build the Finite State Machine Scan for exact matches QKP KKP RQP movement REP RRP RKP ... ...VLQKPLKKPPLVKRQPCCEVVRKPLVKVIRCLA... 1)Preprocess 2)Scan 3)Extend BLAST, Step 3: Extending ¨exact¨ matches Having the list of exact matches we extend alignment in both directions Query: L T-similar: Subject: G Score: -3 V N V C 4 -3 R R R 5 K R R 2 P P P 7 V V P L K C 1 -2 -3 …till the sum of scores drops below some level X (e.g. X=-100) from the best known - what with gaps? Gapped BLAST (now standard) 1)Preprocess 2)Scan 3)Extend gapped local alignments are computed: much, much, much slower therefore: modified “Hit criteria” Hit criteria 1)Preprocess 2)Scan 3)Extend Extends the alignment only if there are close two hits on the same diagonal – sensitivity would drop without lowering T – reduces extensions (90% time is spend on extensions) Gapped local alignments are computed – increased sensitivity allows us raise T – raising T speeds up the search cl os eh it, sa m ed ia g query pos dbpos Gapped BLAST v BLAST We end up with – same speed – gapped alignments! – much higher sensitivity BLAST flavours blastp: protein query, protein db blastn: DNA query, DNA db blastx: DNA query, protein db – in all reading frames. Used to find potential translation products of an unknown nucleotide sequence. tblastn: protein query, DNA db – database dynamically translated in all reading frames. tblastx: DNA query, DNA db – all translations of query against all translations of db PSI-BLAST Position-Specific Iterated BLAST A profile is derived from the result of the first search Database is searched against the profile (instead of a sequence) Up to 3 iterations Profile Profile is generalized form of sequence probabilities instead of a letter A C D . . . W Y 0.5 0 0 . . . 0 0.5 0.3 0.1 0 . . . 0.3 0.3 0.2 0.0 0.1 . . . 0.4 0.3 ... ... ... ... ... ... ... ... 0 0.5 0.2 . . . 0.1 0.2 Score of the profile scorep , i , A =∑B p[i ,B]scoreblosum62 A ,B profile position letter Constructing a profile Take significant BLAST results Make an alignment Assign weights to sequences Construct the profile A C D . . . W Y 0.5 0 0 . . . 0 0.5 0.3 0.1 0 . . . 0.3 0.3 0.2 0.0 0.1 . . . 0.4 0.3 ... ... ... ... ... ... ... ... 0 0.5 0.2 . . . 0.1 0.2 BLAT The Blast-Like Alignment Tool Large-scale genome comparison: – query can be large Preprocessing phase: – BLAST: query – BLAT: db BLAT, Step 1: Preprocess the database 1)Preprocess 2)Scan 3)Extend Index the database with k-words – k=8..16 for nucleotides – k=3..5 for proteins For each k-word store in which sequences it appears k-word: RKP Hashed DB: QKP: HUgn0151194, Gene14, IG0, ... KKP: haemoglobin, Gene134, IG_30, ... RQP: HSPHOSR1, GeneA22... RKP: galactosyltransferase, IG_1... REP: haemoglobin, Gene134, IG_30, ... RRP: Z17368, Creatine kinase, ... ... Hashing – associative arrays 1)Preprocess 2)Scan 3)Extend Indexing with the object Hash function: hash: small (fits in memory) possible objects - large – Objects should be “well spread” x Hashing - examples 1)Preprocess 2)Scan 3)Extend T9 Predictive Text in mobile phones – “hello” in Multitap: 4, 4, 3, 3, 5, 5, 5, (pause) 5, 5, 5, 6, 6, 6 – “hello” in T9: 4, 3, 5, 5, 6 – Collisions: 4, 6: “in”, “go” BLAT, Step 1: Index to find exact matches with hashing 1)Preprocess 2)Scan 3)Extend The database is preprocessed only once! (independent from the query) k-word: RKP Hashed DB: QKP: HUgn0151194, Gene14, IG0, ... KKP: haemoglobin, Gene134, IG_30, ... RQP: HSPHOSR1, GeneA22... RKP: galactosyltransferase, IG_1... REP: haemoglobin, Gene134, IG_30, ... RRP: Z17368, Creatine kinase, ... ... BLAT, Step 2: Hit criteria 1)Preprocess 2)Scan 3)Extend In a constant time we can get the sequences with a certain k-word relaxing hit definition -> improve sensitivity – allow imperfect hits • costly, huge hash grows a few times! ➔ shorten k (would lead to FP), but expect two hits (see BLAST) BLAT, Step 3: Identifying homologous regions 1)Preprocess 2)Scan 3)Extend Exclude common k-words For all k-words from query – find out the position in db For results (qpos, dbpos): – split into buckets (64kbp) – sort on the diagonal (diag=qposdbpos) BLAT, Step 3: Identifying homologous regions 1)Preprocess 2)Scan 3)Extend continued... – from diagonally close hits (gap limit) create “pre-clusters” – sort each “pre-cluster” on dbpos – create clusters from close hits – run Local Alignment for each cluster Seeds – improving sensitivity More general form of k-word is a seed The seed CT.GT.AT. gives “hits” with both sequences ...CTCGTTATA... ...CTAGTAATG... How to detect homology? Take the score of an maximal local alignment can it be obtained by chance? – any score can be obtained from comparing (long enough) random sequences What is a “chance”? Extracting local alignments from random sequences P-value (e.g. =0.01) – The probability of obtaining the result by pure chance – An alignment giving lower P-value than set by user is considered a hit. Best Local Alignments by chance Create random seqs, each 1000aa long Find the max local align Repeat 7000 6000 # Alignments 5000 4000 3000 2000 1000 0 25 30 35 40 45 50 Score 55 60 65 70 75 The Statistics of local alignment http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html Subst matrix must guarantee • E(score(a,b)) < 0 for random a, b Analytical solution – sum of i.i.d. variables -> normal distribution – max of i.i.d. -> extreme value distribution (EVD) Expected number of aligns E-value: the expected number of alignments scoring >= S −S E=K m n e 2x size of seq -> 2x number aligns 2x S -> E drops exponentially −S E=K m n e E-value depends on n, m Example – For comparing seqA with seqB: S=88 -> E = 0.001 – For comparing seqA with 1000 seqs: score 88 -> E=1 Important for db searching: – n – size of query, m - size of db Deriving K, L −S E=Kmne The above eq is theoretical result for gapless (g=inf) alignment – K, L can be derived from the subst table For gapped case – it seems that the equation holds – we can derive K, L from “experiments” 7000 6000 5000 # Alignments 4000 3000 2000 1000 0 25 30 35 40 45 50 Score 55 60 65 70 75 Bit score S' −S E=Kmne Score S depends on the substitution table What if we want table-independent score? E=mn2−S' where S'= S−ln K ln 2 Why does the BLAST work? Relevant riddle Are there at least 2 people in Amsterdam with the same number of hairs? – At most 500 000 hairs on each head – 700 000 people living in Amsterdam Why does the BLAST work? Pigeons pigeonhole principle: having 9 boxes and 10 pigeons, there is at least one box with more than 1 pigeon – n=9, k=7 case: http://en.wikipedia.org/wiki/Pigeonhole_principle Why does the BLAST work? Average case pigeonhole principle describes the worst case! On average we'll expect two pigeons in the same box much earlier Birthday paradox: among 23 people, probability that they have the same birthday is > 0.5 – note: 365 boxes and only 23 pigeons! Birthday paradox Why does the BLA[S]T work? Forget the T-similar words, now use only identities 2 sequences, 100 nucleotides each: – What's the minimal sequence identity for which there's a string of 3 consecutive identities? 0 identities, 100 mismatches: 67 identities, 33 mismatches: 68st ? but if seqs are 50% id, we'll detect it with prob. 99% 28% id -> we'll detect it with prob. 50% – how is it calculated? Expected sensitivity We assume that letters are independent I – identity between seqs, for human-mouse – 86% for DNA, – 89% for proteins pword id =Ik Expected sensitivity Q - query size number of non-overlapping words R= Q k prob. of a “hit” pdetect=1−1−pword id R Expected specificity How many matches by chance (C)? G – genome size G 1 C=Q−k1∗ ∗ k 4 k For h-m, to get 99% sensitivity we have to set k=7, and for Q=1000 C ~= 25,000,000 – 7h assuming 1/1000 per alignment