Design Analysis of Algorithm

The Rabin-karp Algorithm

The Rabin-Karp algorithm: Rabin and Karp have proposed a string-matching algorithm that performs well in practice and that also generalizes to other algorithms for related problems, such as two-dimensional pattern matching. The Rabin-Karp algorithm uses Θ(m) preprocessing time, and its worst-case running time is Θ((n - m 1)m). Based on certain assumptions, however, its average-case running time is better.

This algorithm makes use of elementary number-theoretic notions such as the equivalence of two numbers modulo a third number. You may want to refer to Section 31.1 for the relevant definitions.

For expository purposes, let us assume that Σ = {0, 1, 2, . . . , 9}, so that each character is a decimal digit. (In the general case, we can assume that each character is a digit in radix-d notation, where d = |Σ|.) We can then view a string of k consecutive characters as representing a length-k decimal number. The character string 31415 thus corresponds to the decimal number 31,415. Given the dual interpretation of the input characters as both graphical symbols and digits, we find it convenient in this section to denote them as we would digits, in our standard text font.

Given a pattern P[1 m], we let p denote its corresponding decimal value. In a similar manner, given a text T [1 n], we let ts denote the decimal value of the length-m substring T[s 1 s m], for s = 0, 1, . . . , n - m. Certainly, ts = p if and only if T [s 1 s m] = P[1 m]; thus, s is a valid shift if and only if ts = p. If we could compute p in time Θ(m) and all the ts values in a total of Θ(n - m 1) time,[1] then we could determine all valid shifts s in time Θ(m) Θ(n - m 1) = Θ(n) by comparing p with each of the ts's. (For the moment, let's not worry about the possibility that p and the ts's might be very large numbers.)

We can compute p in time Θ(m) using Horner's rule:

p = P[m] 10 (P[m - 1] 10(P[m - 2] · · · 10(P[2] 10P[1]) )).

The value t0 can be similarly computed from T [1 m] in time Θ(m).

To compute the remaining values t1, t2, . . . , tn-m in time Θ(n - m), it suffices to observe that ts 1 can be computed from ts in constant time, since

For example, if m = 5 and ts = 31415, then we wish to remove the high-order digit T [s 1] = 3 and bring in the new low-order digit (suppose it is T [s 5 1] = 2) to obtain

ts 1

=

10(31415 - 10000 · 3) 2

 

=

14152.

Subtracting 10m-1 T[s 1] removes the high-order digit from ts, multiplying the result by 10 shifts the number left one position, and adding T [s m 1] brings in the appropriate low-order digit. If the constant 10m-1 is precomputed (which can be done in time O(lg m) using the techniques of Elementary number-theoretic notions, although for this application a straightforward O(m)-time method is quite adequate), then each execution of equation (32.1) takes a constant number of arithmetic operations. Thus, we can compute p in time Θ(m) and compute t0, t1, . . . , tn-m in time Θ(n - m 1), and we can find all occurrences of the pattern P[1 m] in the text T [1 n] with Θ(m) preprocessing time and Θ(n - m 1) matching time.

The only difficulty with this procedure is that p and ts may be too large to work with conveniently. If P contains m characters, then assuming that each arithmetic operation on p (which is m digits long) takes "constant time" is unreasonable. Fortunately, there is a simple cure for this problem, as shown in Figure 32.5: compute p and the ts's modulo a suitable modulus q. Since the computation of p, t0, and the recurrence (32.1) can all be performed modulo q, we see that we can compute p modulo q in Θ(m) time and all the ts's modulo q in Θ(n - m 1) time. The modulus q is typically chosen as a prime such that 10q just fits within one computer word, which allows all the necessary computations to be performed with single-precision arithmetic. In general, with a d-ary alphabet {0, 1, . . . , d - 1}, we choose q so that dq fits within a computer word and adjust the recurrence equation (32.1) to work modulo q, so that it becomes

where h = dm-1 (mod q) is the value of the digit "1" in the high-order position of an m-digit text window.