6th Lecture
Transcription
6th Lecture
A common word processor facility is to search for a given word in a document. Generally, the problem is to search for occurrences of a short string in a long string. the Do the first then do the other one Dr. Maged Wafy 1 The brute force algorithm: ◦ invented in the dawn of computer history ◦ re-invented many times, still common Knuth & Pratt invented a better one in 1970 ◦ invented independently by Morris ◦ published 1976 as “Knuth-Morris-Pratt” Boyer 1976 & Moore found a better one before ◦ found independently by Gosper Karp & Rabin found a “better” one in 1980 Dr. Maged Wafy 2 The obvious algorithm is to try the word at each possible place, and compare all the characters: for i := 0 to n-m do for j := 0 to m-1 do (doc length n) (word length m) compare word[j] with doc[i+j] if not equal, exit the inner loop The complexity is at worst O(m*n) and best O(n). Dr. Maged Wafy 3 Surprisingly, there is a faster algorithm where you compare the last characters first: Do the first then do the other one the compare ‘e’ with ‘ ‘, fail so move along 3 places Do the first then do the other one the can only move along 2 places Dr. Maged Wafy 4 In every case where the document character is not one of the characters in the word, we can move along m places. Sometimes, it is less. Dr. Maged Wafy 5 Let p be the pattern string Let t be the target string Let k be the index of the character in the target string that “lies over” the first character of the pattern Given two strings, p and t, over the alphabet , determine whether p occurs as the substring of t That is, determine whether there exists k such that p=Substring(t,k,|p|). Dr. Maged Wafy 6 function SimpleStringSearch(string p,t): integer {Find p in t; return its location or -1 if p is not a substring of t} for k from 0 to Length(t) – Length(p) do i <- 0 while i < Length(p) and p[i] = t[k+i] do i <- i+1 if i == Length(p) then return k return -1 Dr. Maged Wafy 7 t[0] A t[1] B p[0] A Y p[1] B Y t[2] C t[3] E p[2] C Y F t[4] t[5] G t[6] A t[7] B t[8] C t[9] D t10] E p[3] D N Dr. Maged Wafy 8 t[0] A t[1] B t[2] C p[0] A t[3] E t[4] F p[1] p[2] B C t[5] G t[6] A t[7] B t[8] C t[9] D t10] E p[3] D N Dr. Maged Wafy 9 t[0] A t[1] B t[2] C t[3] E p[0] A t[4] F p[1] B t[5] G p[2] C t[6] A t[7] B t[8] C t[9] D t10] E p[3] D N Dr. Maged Wafy 10 t[0] A t[1] B C t[2] t[3] E t[4] F p[0] A p[1] B t[5] G p[2] C t[6] A t[7] B t[8] C t[9] D t10] E p[3] D N Dr. Maged Wafy 11 t[0] A t[1] B C t[2] E t[3] t[4] F t[5] G p[0] A t[6] A p[1] B t[7] B p[2] C t[8] C t[9] D t10] E p[3] D N Dr. Maged Wafy 12 t[0] A t[1] B C t[2] E t[3] F t[4] t[5] G t[6] A p[0] A t[7] B p[1] B t[8] C p[2] C t[9] D t10] E p[3] D N Dr. Maged Wafy 13 t[0] A t[1] B C t[2] E t[3] F t[4] t[5] G t[6] A p[0] A t[7] B p[1] B t[8] C p[2] C t[9] D t10] E p[3] D N Dr. Maged Wafy 14 t[0] A t[1] B C t[2] E t[3] F t[4] t[5] G t[6] t[7] t[8] t[9] A B C D p[0] p[1] p[2] p[3] A B Y Dr. Maged Wafy Y C D Y Y t10] E 15 Worst case: Okay if patterns are short, but better algorithms exist ◦ Pattern string always matches completely except for last character ◦ Example: search for XXXXXXY in target string of XXXXXXXXXXXXXXXXXXXX ◦ Outer loop executed once for every character in target string ◦ Inner loop executed once for every character in pattern ◦ (|p| * |t|) Dr. Maged Wafy 16 (|p| * |t|) Key idea: ◦ if pattern fails to match, slide pattern to right by as many boxes as possible without permitting a match to go unnoticed Dr. Maged Wafy 17 t[0] X t[1] Y p[0] t[2] X p[1] X Y Y Y t[3] Y p[2] X Y X p[3] Y Y X t[4] Y t[6] Y c Y Z t[7] t[8] t[9] t10] p[4] Z Y Y t[5] N X Y Y ? Dr. Maged Wafy 18 Correct motion of pattern depends on both location of mismatch and the mismatching character If c == X : move 2 boxes to right If c == E : move 5 boxes to right If c == Z : target found; alg terminates Dr. Maged Wafy 19 Goal: determine d, number of boxes to right pattern should move; smallest d such that: p[0] = t[k+d] p[1] = t[k+d+1] p[2] = t[k+d+2] … p[i-d] = t[k+i] Dr. Maged Wafy 20 Note: can be stated largely in terms of pattern alone. Value of d depends only on: ◦ The pattern ◦ The value of i ◦ The mismatching character c (at t[k+i]) Dr. Maged Wafy 21 Can define a function KMPskip(p,i,c) to give correct d ◦ Return smallest integer d such that 0 <= d <=I, such that p[i-d] == c and p[j] == p[j+d] for each 0 <=j <= i-di1 ◦ Return i+1 if no such d exists Calculate all values of KMPskip for pattern p and store it in KMPskiparray do lookup at each mismatch Dr. Maged Wafy 22 For pattern ABCD: A B C D 0 1 2 3 B 1 0 3 4 C 1 2 0 4 D 1 2 3 0 1 2 3 4 A other Dr. Maged Wafy 23 For pattern XYXYZ: X X Y Z other Y X Y Z 0 1 0 3 2 1 0 3 0 5 1 2 3 4 0 1 2 3 4 5 Dr. Maged Wafy 24 Function KMPSearch(string p, t): integer {Find p in t; return its location or -1 if p is not a substring of t} KMPskiparray <- ComputeKMPskiparray(p) k <- 0 i <- 0 While k < Length(t) – Length(p) do if i == Length(p) then return k d <- KMPskiparray[I,t[k+i]] k <- k + d i <- I + 1 –d Return -1 Dr. Maged Wafy 25 Coming soon …. Dr. Maged Wafy 26 To work out how far to skip when the last character does not match, build a table. Care is needed with repeated letters: word cab skip 1 * 2 3 3 ... a b c d e ... word abba skip[c] end skip * 1 4 4 4 ... a b c d e ... = distance of last occurrence of c from Dr. Maged Wafy 27 The algorithm becomes: i := 0 while i <= n-m do if word[m-1] = doc[i+m-1] then for j := 0 to m-1 do compare word[j] with doc[i+j] This i := i + 1 else i := i + skip[doc[i+m-1]] is still O(n*m) in the worst case, but now it is O(n/m) in the best case, because m characters may be skipped at each stage. Dr. Maged Wafy 28 The last-character algorithm can be generalised by making the skip table work for partial matches, and by adding a secondary table. The result is the Boyer-Moore algorithm. It is possible to show that the complexity of the Boyer-Moore algorithm is guaranteed to be only O(n) in the worst case, as well as O(n/m) in the best case. It has generally been regarded as too difficult to understand, and so has not been used much. Dr. Maged Wafy 29 Karp & Rabin found an algorithm which is: ◦ almost as fast as Boyer-Moore ◦ simple enough to understand easily ◦ can be adapted for 2-dimensional searches for patterns in pictures Go back to the brute force idea, but now use a single number to represent the word you are searching for, and a single number for the current portion of the document you are comparing against. Dr. Maged Wafy 30 Suppose we are searching for 4-letter words. Then the whole (English) word fits in one (computer) word w of 4 bytes. If the current 4 bytes of the document are also in one word d, a single comparison can match the two in one step. To move along the document, shift d and add in the next character. For longer words, use hashing. The characters of the word and the document are combined into single hash numbers wh and dh. The hash number dh can be updated by doing a suitable sum and adding in the code for the next character. Dr. Maged Wafy 31