String Matching With Finite Automata
In order to specify the string-matching automaton corresponding to a given pattern P[1 ‥ m], we first define an auxiliary function σ , called the suffix function corresponding to P. The function σ is a mapping from Σ* to {0, 1, . . . , m} such that σ(x) is the length of the longest prefix of P that is a suffix of x:
σ(x) = max {k : Pk ⊐ x}.
The suffix function σ is well defined since the empty string P0 = ε is a suffix of every string. As examples, for the pattern P = ab, we have σ(ε) = 0, σ(ccaca) = 1, and σ(ccab) = 2. For a pattern P of length m, we have σ(x) = m if and only if P ⊐ x. It follows from the definition of the suffix function that if x ⊐ y, then σ(x) ≤ σ(y).
We define the string-matching automaton that corresponds to a given pattern P[1 ‥ m] as follows.
-
The state set Q is {0, 1, . . . , m}. The start state q0 is state 0, and state m is the only accepting state.
-
The transition function δ is defined by the following equation, for any state q and character a:
Here is an intuitive rationale for defining δ(q, a) = σ(Pqa). The machine maintains as an invariant of its operation that
In words, this means that after scanning the first i characters of the text string T , the machine is in state φ(Ti) = q, where q = σ(Ti) is the length of the longest suffix of Ti that is also a prefix of the pattern P. If the next character scanned is T [i 1] = a, then the machine should make a transition to state σ(Ti 1) = σ(Tia). The proof of the theorem shows that σ(Tia) = σ(Pqa). That is, to compute the length of the longest suffix of Tia that is a prefix of P, we can compute the longest suffix of Pqa that is a prefix of P. At each state, the machine only needs to know the length of the longest prefix of P that is a suffix of what has been read so far. Therefore, setting δ(q, a) = σ(Pqa) maintains the desired invariant (32.4). This informal argument will be made rigorous shortly.
In the string-matching automaton of Figure 32.7, for example, δ(5, b) = 4. We make this transition because if the automaton reads a b in state q = 5, then Pq b = ababab, and the longest prefix of P that is also a suffix of ababab is P4 = abab.
To clarify the operation of a string-matching automaton, we now give a simple, efficient program for simulating the behavior of such an automaton (represented by its transition function δ) in finding occurrences of a pattern P of length m in an input text T [1 ‥ n]. As for any string-matching automaton for a pattern of length m, the state set Q is {0, 1, . . . , m}, the start state is 0, and the only accepting state is state m.
FINITE-AUTOMATON-MATCHER(T, δ, m) 1 n ← length[T] 2 q ← 0 3 for i ← 1 to n 4 do q ← δ(q, T[i]) 5 if q = m 6 then print "Pattern occurs with shift" i - m
The simple loop structure of FINITE-AUTOMATON-MATCHER implies that its matching time on a text string of length n is Θ(n). This matching time, however, does not include the preprocessing time required to compute the transition function δ. We address this problem later, after proving that the procedure FINITE-AUTOMATON-MATCHER operates correctly.
Consider the operation of the automaton on an input text T [1 ‥ n]. We shall prove that the automaton is in state σ(Ti) after scanning character T [i]. Since σ(Ti) = m if and only if P ⊐ Ti, the machine is in the accepting state m if and only if the pattern P has just been scanned. To prove this result, we make use of the following two lemmas about the suffix function σ.