Data Mining & Data Warehousing

Mining Sequence Patterns In Biological Data

Introduction: A sequence database consists of sequences of ordered elements or events, recorded with or without a concrete notion of time. There are many applications involving sequence data. Typical examples include customer shopping sequences, Web click streams, biological sequences, sequences of events in science and engineering, and in natural and social developments.

Sequential Pattern Mining: Concepts and Primitives

“What is sequential pattern mining?” Sequential pattern mining is the mining of frequently occurring ordered events or subsequences as patterns. An example of a sequential pattern is “Customers who buy a Canon digital camera are likely to buy an HP color printer within a month.” For retail data, sequential patterns are useful for shelf placement and promotions. This industry, as well as telecommunications and other businesses, may also use sequential patterns for targeted marketing, customer retention, and many other tasks. Other areas in which sequential patterns can be applied include Web access pattern analysis, weather prediction, production processes, and network intrusion detection.

Notice that most studies of sequential pattern mining concentrate on categorical (or symbolic) patterns, whereas numerical curve analysis usually belongs to the scope of trend analysis and forecasting in statistical time-series analysis.

The sequential pattern mining problem was first introduced by Agrawal and Srikant in 1995 [AS95] based on their study of customer purchase sequences, as follows: “Given a set of sequences, where each sequence consists of a list of events (or elements) and each event consists of a set of items, and given a user-specified minimum support threshold of min sup, sequential pattern mining finds all frequent subsequences, that is, the subsequences whose occurrence frequency in the set of sequences is no less than min sup.”

Let’s establish some vocabulary for our discussion of sequential pattern mining. Let I = {I1, I2, : : : , Ip} be the set of all items. An itemset is a nonempty set of items.

A sequence is an ordered list of events. A sequence s is denoted {e1, e, 2e3, . . . el}, where event e1 occurs before e2, which occurs before e3, and so on. Event e j is also called an element of s. In the case of customer purchase data, an event refers to a shopping trip in which a customer bought items at a certain store. The event is thus an itemset, that is, an unordered list of items that the customer purchased during the trip. The itemset (or event) is denoted (x1x2. . . . .xq), where xk is an item. For brevity, the brackets are omitted if an element has only one item, that is, element (x) is written as x. Suppose that a customer made several shopping trips to the store. These ordered events form a sequence for the customer. That is, the customer first bought the items in s1, and then later bought the items in s2, and so on. An item can occur at most once in an event of a sequence, but can occur multiple times in different events of a sequence. The number of instances of items in a sequence is called the length of the sequence. A sequence with length l is called an l-sequence. A sequence α = {a1a2 . . . . an} is called a subsequence of another sequence β = {b1b2.. . . . .  .bm}, and b is a super sequence of a, denoted as α v β, if there exist integers 1 ≤ j1 < j2 <    < jn ≤ m such that a1 ⊆ bj1 , a2 ⊆ bj2 , . . . , an ⊆ bjn . For example, if α = {(ab), d} and β = {(abc), (de)}, where a, b, c, d, and e are items, then a is a subsequence of b and b is a super sequence of a.

A sequence database, S, is a set of tuples, {SID, s}, where SID is a sequence ID and s is a sequence. For our example, S contains sequences for all customers of the store. A tuple {SID, s] is said to contain a sequence a, if a is a subsequence of s. The support of a sequence α in a sequence database S is the number of tuples in the database containing a, that is, support S(α) = | (SID, s)|({SID, s} ∈ S)^(α v s)g j. It can be denoted as support (α) if the sequence database is clear from the context. Given a positive integer min sup as the minimum support threshold, a sequence a is frequent in sequence database S if supportS(α) ≥ min_sup. That is, for sequence a to be frequent, it must occur at least min sup times in S. A frequent sequence is called a sequential pattern. A sequential pattern with length l is called an l-pattern. The following example illustrates these concepts.