Frequent-pattern Mining In Data Streams

Introduction: Frequent-pattern mining finds a set of patterns that occur frequently in a data set, where a pattern can be a set of items (called an itemset), a subsequence, or a substructure. A pattern is considered frequent if its count satisfies a minimum support. Scalable methods for mining frequent patterns have been extensively studied for static data sets. However, mining such patterns in dynamic data streams poses substantial new challenges.

Many existing frequent-pattern mining algorithms require the system to scan the whole data set more than once, but this is unrealistic for infinite data streams. How can we perform incremental updates of frequent itemsets for stream data since an infrequent itemset can become frequent later on, and hence cannot be ignored? Moreover, a frequent itemset can become infrequent as well. The number of infrequent itemsets is exponential and so it is impossible to keep track of all of them.

To overcome this difficulty, there are two possible approaches. One is to keep track of only a predefined, limited set of items and itemsets. This method has very limited usage and expressive power because it requires the system to confine the scope of examination to only the set of predefined itemsets beforehand. The second approach is to derive an approximate set of answers. In practice, approximate answers are often sufficient. A number of approximate item or itemset counting algorithms have been developed in recent research. Here we introduce one such algorithm: the Lossy Counting algorithm. It approximates the frequency of items or itemsets within a user-specified error bound, ε.

Lossy Counting Algorithm

We first introduce the Lossy Counting algorithm for frequent items. This algorithm is fairly simple but quite efficient. We then look at how the method can be extended to find approximate frequent itemsets. “How does the Lossy Counting algorithm find frequent items?” A user first provides two input parameters: (1) the min support threshold, σ, and (2) the error bound mentioned previously, denoted as e. The incoming stream is conceptually divided into buckets of width w = [1/ε]. Let N be the current stream length, that is, the number of items seen so far. The algorithm uses a frequency-list data structure for all items with frequency greater than 0. For each item, the list maintains f , the approximate frequency count, and Δ, the maximum possible error of f .
The algorithm processes buckets of items as follows. When a new bucket comes in, the items in the bucket are added to the frequency list. If a given item already exists in the list, we simply increase its frequency count, f . Otherwise, we insert it into the list with a frequency count of 1. If the new item is from the bth bucket, we set Δ, the maximum possible error on the frequency count of the item, to be b-1. Based on our discussion so far, the item frequency counts hold the actual frequencies rather than approximations. They become approximates, however, because of the next step. Whenever a bucket “boundary” is reached (that is, N has reached a multiple of width w, such as w, 2w, 3w, etc.), the frequency list is examined. Let b be the current bucket number. An item entry is deleted if, for that entry, f Δ ≤ b. In this way, the algorithm aims to keep the frequency list small so that it may fit in main memory. The frequency count stored for each item will either be the true frequency of the item or an underestimate of it.
“By how much can a frequency count be underestimated?” One of the most important factors in approximation algorithms is the approximation ratio (or error bound). Let’s look at the case where an item is deleted. This occurs when f Δ ≤ b for an item, where b is the current bucket number. We know that b ≤ N=w, that is, b ≤ εN. The actual frequency of an item is at most f Δ. Thus, the most that an item can be underestimated is εN. If the actual support of this item is s (this is the minimum support or lower bound for it to be considered frequent), then the actual frequency is σN, and the frequency, f , on the frequency list should be at least (σN - εN). Thus, if we output all of the items in the frequency list having an f value of at least (σN - εN), then all of the frequent items will be output. In addition, some sub frequent items (with an actual frequency of at least σN - εN but less than σN) will be output, too.
The Lossy Counting algorithm has three nice properties: (1) there are no false negatives, that is, there is no true frequent item that is not output; (2) false positives are quite “positive” as well, since the output items will have a frequency of at least Σn-εN; and (3) the frequency of a frequent item can be underestimated by at most εN. For frequent items, this underestimation is only a small fraction of its true frequency, so this approximation is acceptable.
“How much space is needed to save the frequency list?” It has been shown that the algorithm takes at most 1/εlog(εN) entries in computation, where N is the stream length so far. If we assume that elements with very low frequency tend to occur more or less uniformly at random, then it has been shown that Lossy Counting requires no more than 7e space. Thus, the space requirement for this algorithm is reasonable.
It is much more difficult to find frequent itemsets than to find frequent items in data streams, because the number of possible itemsets grows exponentially with that of different items. As a consequence, there will be many more frequent itemsets. If we still process the data bucket by bucket, we will probably run out of memory. An alternative is to process as many buckets as we can at a time.
To find frequent itemsets, transactions are still divided by buckets with bucket size, w = [1/ε]. However, this time, we will read as many buckets as possible into main memory. As before, we maintain a frequency list, although now it pertains to itemsets rather than items. Suppose we can read β buckets into main memory. After that, we update the frequency list by all these buckets as follows. If a given itemset already exists in the frequency list, we update f by counting the occurrences of the itemset among the current batch of b buckets. If the updated entry satisfies f Δ ≤ b, where b is the current bucket number, we delete the entry. If an itemset has frequency f ≥ β and does not appear in the list, it is inserted as a new entry where D is set to b-β as the maximum error of f .
In practice, β will be large, such as greater than 30. This approach will save memory because all itemsets with frequency less than β will not be recorded in the frequency list anymore. For smaller values of β (such as 1 for the frequent item version of the algorithm described earlier), more spurious subsets will find their way into the frequency list. This would drastically increase the average size and refresh rate of the frequency list and harm the algorithm’s efficiency in both time and space.

In general, Lossy Counting is a simple but effective algorithm for finding frequent items and itemsets approximately. Its limitations lie in three aspects: (1) the space bound is insufficient because the frequency list may grow infinitely as the stream goes on; (2) for frequent itemsets, the algorithm scans each transaction many times and the size of main memory will greatly impact the efficiency of the algorithm; and (3) the output is based on all of the previous data, although users can be more interested in recent data than that in the remote past. A tilted time frame model with different time granularities can be integrated with Lossy Counting in order to emphasize the recency of the data.