The Knuth-morris-pratt Algorithm

The Knuth-Morris-Pratt algorithm: We now present a linear-time string-matching algorithm due to Knuth, Morris, and Pratt. Their algorithm avoids the computation of the transition function δ altogether, and its matching time is Θ(n) using just an auxiliary function π[1 ‥ m] precomputed from the pattern in time Θ(m). The array π allows the transition function δ to be computed efficiently (in an amortized sense) "on the fly" as needed. Roughly speaking, for any state q = 0, 1, . . . , m and any character a ∈ Σ, the value π[q] contains the information that is independent of a and is needed to compute δ(q, a). (This remark will be clarified shortly.) Since the array π has only m entries, whereas δ has Θ(m |Σ|) entries, we save a factor of |Σ| in the preprocessing time by computing π rather than δ.

The prefix function for a pattern: The prefix function π for a pattern encapsulates knowledge about how the pattern matches against shifts of itself. This information can be used to avoid testing useless shifts in the naive pattern-matching algorithm or to avoid the precomputation of δ for a string-matching automaton.

Consider the operation of the naive string matcher. Figure 32.10(a) shows a particular shift s of a template containing the pattern P = ababaca against a text T. For this example, q = 5 of the characters have matched successfully, but the 6th pattern character fails to match the corresponding text character. The information that q characters have matched successfully determines the corresponding text characters. Knowing these q text characters allows us to determine immediately that certain shifts are invalid. In the example of the figure, the shift s 1 is necessarily invalid, since the first pattern character (a) would be aligned with a text character that is known to match with the second pattern character (b). The shift s′ = s 2 shown in part (b) of the figure, however, aligns the first three pattern characters with three text characters that must necessarily match. In general, it is useful to know the answer to the following question:

Figure 32.10: The prefix function π. (a) The pattern P = ababaca is aligned with a text T so that the first q = 5 characters match. Matching characters, shown shaded, are connected by vertical lines. (b) Using only our knowledge of the 5 matched characters, we can deduce that a shift of s 1 is invalid, but that a shift of s′ = s 2 is consistent with everything we know about the text and therefore is potentially valid. (c) The useful information for such deductions can be precomputed by comparing the pattern with itself. Here, we see that the longest prefix of P that is also a proper suffix of P₅ is P₃. This information is precomputed and represented in the array π, so that π[5] = 3. Given that q characters have matched successfully at shift s, the next potentially valid shift is at s′ = s (q - π[q]).

Given that pattern characters P[1 ‥ q] match text characters T[s 1 ‥ s q], what is the least shift s′ > s such that
where s′ k = s q?

Such a shift s′ is the first shift greater than s that is not necessarily invalid due to our knowledge of T[s 1 ‥ s q]. In the best case, we have that s′ = s q, and shifts s 1, s 2, . . . , s q - 1 are all immediately ruled out. In any case, at the new shift s′ we don't need to compare the first k characters of P with the corresponding characters of T, since we are guaranteed that they match by equation (32.5).

The necessary information can be precomputed by comparing the pattern against itself, as illustrated in Figure 32.10(c). Since T[s′ 1 ‥ s′ k] is part of the known portion of the text, it is a suffix of the string P_q. Equation (32.5) can therefore be interpreted as asking for the largest k < q such that P_k ⊐ P_q. Then, s′ = s (q - k) is the next potentially valid shift. It turns out to be convenient to store the number k of matching characters at the new shift s′, rather than storing, say, s′ - s. This information can be used to speed up both the naive string-matching algorithm and the finite-automaton matcher.

We formalize the precomputation required as follows. Given a pattern P[1 ‥ m], the prefix function for the pattern P is the function π : {1, 2, . . . , m} → {0, 1, . . . , m - 1} such that

π[q] = max {k : k < q and P_k ⊐ P_q}.

That is, π[q] is the length of the longest prefix of P that is a proper suffix of P_q. As another example, Figure 32.11(a) gives the complete prefix function π for the pattern ababababca.

Figure 32.11: An illustration of Lemma 32.5 for the pattern P = ababababca and q = 8. (a) The π function for the given pattern. Since π[8] = 6, π[6] = 4, π[4] = 2, and π[2] = 0, by iterating π we obtain π*[8] = {6, 4, 2, 0}. (b) We slide the template containing the pattern P to the right and note when some prefix P_k of P matches up with some proper suffix of P₈; this happens for k = 6, 4, 2, and 0. In the figure, the first row gives P, and the dotted vertical line is drawn just after P₈. Successive rows show all the shifts of P that cause some prefix P_k of P to match some suffix of P₈. Successfully matched characters are shown shaded. Vertical lines connect aligned matching characters. Thus, {k : k < q and P_k ⊐ P_q} = {6, 4,2, 0}. The lemma claims that π*[q] = {k : k < q and P_k ⊐ P_q} for all q.