Prefix Span: Prefix-projected Sequential Pattern Growth

Introduction: Pattern growth is a method of frequent-pattern mining that does not require candidate generation. The technique originated in the FP-growth algorithm for transaction databases.

The general idea of this approach is as follows: it finds the frequent single items, and then compresses this information into a frequent-pattern

tree, or FP-tree. The FP-tree is used to generate a set of projected databases, each associated with one frequent item. Each of these databases is mined separately. The algorithm builds prefix patterns, which it concatenates with suffix patterns to find frequent patterns, avoiding candidate generation. Here, we look at Prefix Span, which extends the pattern-growth approach to instead mine sequential patterns.

Suppose that all the items within an event are listed alphabetically. For example, instead of listing the items in an event as, say, (bac), we list them as (abc) without loss of generality. Given a sequence a = {e1e2. . . . . en} (where each ei corresponds to a frequent event in a sequence database, S), a sequence β = {e’1e’2   e’m} (m ≤ n) is called a prefix of a if and only if (1) e’i = ei for (i ≤ m-1); (2) e’m ⊆ em; and (3) all the frequent items in (em – e’m) are alphabetically after those in e’m. Sequence γ = {e’’mem 1. . .  . .  .en} is called the suffix of a with respect to prefix β, denoted as γ = a/β, where e’’m = (em - e’m). We also denote a = β. γ. Note if b is not a subsequence of a, the suffix of a with respect to b is empty.