Design Analysis of Algorithm

String Matching With Finite Automata

In order to specify the string-matching automaton corresponding to a given pattern P[1 m], we first define an auxiliary function σ , called the suffix function corresponding to P. The function σ is a mapping from Σ* to {0, 1, . . . , m} such that σ(x) is the length of the longest prefix of P that is a suffix of x:

σ(x) = max {k : Pk x}.

The suffix function σ is well defined since the empty string P0 = ε is a suffix of every string. As examples, for the pattern P = ab, we have σ(ε) = 0, σ(ccaca) = 1, and σ(ccab) = 2. For a pattern P of length m, we have σ(x) = m if and only if P x. It follows from the definition of the suffix function that if x y, then σ(x) σ(y).

We define the string-matching automaton that corresponds to a given pattern P[1 m] as follows.

  • The state set Q is {0, 1, . . . , m}. The start state q0 is state 0, and state m is the only accepting state.

  • The transition function δ is defined by the following equation, for any state q and character a:


Here is an intuitive rationale for defining δ(q, a) = σ(Pqa). The machine maintains as an invariant of its operation that

In words, this means that after scanning the first i characters of the text string T , the machine is in state φ(Ti) = q, where q = σ(Ti) is the length of the longest suffix of Ti that is also a prefix of the pattern P. If the next character scanned is T [i 1] = a, then the machine should make a transition to state σ(Ti 1) = σ(Tia). The proof of the theorem shows that σ(Tia) = σ(Pqa). That is, to compute the length of the longest suffix of Tia that is a prefix of P, we can compute the longest suffix of Pqa that is a prefix of P. At each state, the machine only needs to know the length of the longest prefix of P that is a suffix of what has been read so far. Therefore, setting δ(q, a) = σ(Pqa) maintains the desired invariant (32.4). This informal argument will be made rigorous shortly.

In the string-matching automaton of Figure 32.7, for example, δ(5, b) = 4. We make this transition because if the automaton reads a b in state q = 5, then Pq b = ababab, and the longest prefix of P that is also a suffix of ababab is P4 = abab.

To clarify the operation of a string-matching automaton, we now give a simple, efficient program for simulating the behavior of such an automaton (represented by its transition function δ) in finding occurrences of a pattern P of length m in an input text T [1 n]. As for any string-matching automaton for a pattern of length m, the state set Q is {0, 1, . . . , m}, the start state is 0, and the only accepting state is state m.

	FINITE-AUTOMATON-MATCHER(T, δ, m)
1  n  length[T]
2  q  0
3  for i  1 to n
4      do q  δ(q, T[i])
5         if q = m
6            then print "Pattern occurs with shift" i - m

The simple loop structure of FINITE-AUTOMATON-MATCHER implies that its matching time on a text string of length n is Θ(n). This matching time, however, does not include the preprocessing time required to compute the transition function δ. We address this problem later, after proving that the procedure FINITE-AUTOMATON-MATCHER operates correctly.

Consider the operation of the automaton on an input text T [1 n]. We shall prove that the automaton is in state σ(Ti) after scanning character T [i]. Since σ(Ti) = m if and only if P Ti, the machine is in the accepting state m if and only if the pattern P has just been scanned. To prove this result, we make use of the following two lemmas about the suffix function σ.