Data Mining & Data Warehousing

Frequent Pattern Based Clustering Methods

Introduction: Frequent pattern mining can be applied to clustering, resulting in frequent pattern–based cluster analysis. Frequent pattern mining, as the name implies, searches for patterns (such as sets of items or objects) that occur frequently in large data sets. Frequent pattern mining can lead to the discovery of interesting associations and correlations among data objects.

The idea behind frequent pattern–based cluster analysis is that the frequent patterns discovered may also indicate clusters. Frequent pattern–based cluster analysis is well suited to high-dimensional data. It can be viewed as an extension of the dimension-growth subspace clustering approach. However, the boundaries of different dimensions are not obvious, since here they are represented by sets of frequent itemsets. That is, rather than growing the clusters dimension by dimension, we grow sets of frequent itemsets, which eventually lead to cluster descriptions. Typical examples of frequent pattern–based cluster analysis include the clustering of text documents that contain thousands of distinct keywords, and the analysis of microarray data that contain tens of thousands of measured values or “features.” In this section, we examine two forms of frequent pattern–based cluster analysis: frequent term–based text clustering and clustering by pattern similarity in microarray data analysis.

In frequent term–based text clustering, text documents are clustered based on the frequent terms they contain. Using the vocabulary of text document analysis, a term is any sequence of characters separated from other terms by a delimiter. A term can be made up of a single word or several words. In general, we first remove non text information (such as HTML tags and punctuation) and stop words. Terms are then extracted. A stemming algorithm is then applied to reduce each term to its basic stem. In this way, each document can be represented as a set of terms. Each set is typically large. Collectively, a large set of documents will contain a very large set of distinct terms. If we treat each term as a dimension, the dimension space will be of very high dimensionality! This poses great challenges for document cluster analysis. The dimension space can be referred to as term vector space, where each document is represented by a term vector.

This difficulty can be overcome by frequent term–based analysis. That is, by using an efficient frequent itemset mining algorithm introduced in Section 5.2, we can mine a set of frequent terms from the set of text documents. Then, instead of clustering on high-dimensional term vector space, we need only consider the low-dimensional frequent term sets as “cluster candidates.” Notice that a frequent term set is not a cluster but rather the description of a cluster. The corresponding cluster consists of the set of documents containing all of the terms of the frequent term set. A well-selected subset the set of all frequent term sets can be considered as a clustering.

An advantage of frequent term–based text clustering is that it automatically generates a description for the generated clusters in terms of their frequent term sets. Traditional clustering methods produce only clusters—a description for the generated clusters requires an additional processing step.

Another interesting approach for clustering high-dimensional data is based on pattern similarity among the objects on a subset of dimensions. Here we introduce the pCluster method, which performs clustering by pattern similarity in microarray data analysis. In DNA microarray analysis, the expression levels of two genes may rise and fall synchronously in response to a set of environmental stimuli or conditions. Under the pCluster model, two objects are similar if they exhibit a coherent pattern on a subset of dimensions. Although the magnitude of their expression levels may not be close, the patterns they exhibit can be very much alike. Discovery of such clusters of genes is essential in revealing significant connections in gene regulatory networks.