The Knuth-Morris-Pratt String Search

by Dinesh 2012-08-29 22:59:13

<h3>The Knuth-Morris-Pratt String Search</h3>
Around 1970, D.E. Knuth, J.H. Morris, and V.R. Pratt invented an algorithm that requires essentially in the order of N character comparisons only, even in the worst case [1-8]. The new algorithm is based on the observation that by starting the next pattern comparison at its beginning each time, we may be discarding valuable information gathered during previous comparisons. After a partial match of the beginning of the pattern with corresponding characters in the string, we indeed know the last part of the string, and perhaps could have precompiled some data (from the pattern) which could be used for a more rapid advance in the text string. The following example of a search for the word Hooligan illustrates the principle of the algorithm. Underlined characters are those which were compared. Note that each time two compared characters do not match, the pattern is shifted all the way, because a smaller shift could not possibly lead to a full match.
Hoola-Hoola girls like Hooligans.

Hooligan
Hooligan
Hooligan
Hooligan
Hooligan
Hooligan
......
Hooligan
Using the predicates P and Q, the KMP-algorithm is the following:
i := 0; j := 0;
WHILE (j < M) & (i < N) DO
(* Q(i-j) & P(i-j, j) *)
WHILE (j >= 0) & (s[i] # p[j]) DO j := D END ;
INC(i); INC(j)
END


This formulation is admittedly not quite complete, because it contains an unspecified shift value D. We shall return to it shortly, but first point out that the conditions Q(i-j) and P(i-j, j) are maintained as global invariants, to which we may add the relations 0
≤
i < N and 0
≤
j < M.
This suggests that we must abandon the notion that i always marks the current position of the first pattern character in the text. Rather, the alignment position of the pattern is now i-j.

If the algorithm terminates due to j = M, the term P(i-j, j) of the invariant implies P(i-M, M), that is, a match at position i-M. Otherwise it terminates with i = N, and since j < M, the invariant Q(i) implies that no match exists at all. We must now demonstrate that the algorithm never falsifies the invariant. It is easy to show that it is established at the beginning with the values i = j = 0. Let us first investigate the effect of the two statements incrementing i and j by 1. They apparently neither represent a shift of the pattern to the right, nor do they falsify Q(i-j), since the difference remains unchanged. But could they falsify P(i-j, j), the second factor of the invariant? We notice that at this point the negation of the inner while clause holds, i.e.
either j < 0 or si = pj. The latter extends the partial match and establishes P(i-j, j+1). In the former case, we postulate that P(i-j, j+1) hold as well. Hence, incrementing both i and j by 1 cannot falsify the invariant either. The only other assignment left in the algorithm is j := D. We shall simply postuate that the value D always be such that replacing j by D will maintain the invariant.
452
like
0
dislike
0
mail
flag

You must LOGIN to add comments