Rock: A Hierarchical Clustering Algorithm For Categorical Attributes

Introduction: ROCK (RObust Clustering using linKs) is a hierarchical clustering algorithm that explores the concept of links (the number of common neighbors between two objects) for data with categorical attributes. Traditional clustering algorithms for clustering data with Boolean and categorical attributes use distance functions. However, experiments show that such distance measures cannot lead to high-quality clusters when clustering categorical data. Furthermore, most clustering algorithms assess only the similarity between points when clustering; that is, at each step, points that are the most similar are merged into a single cluster.

This “localized” approach is prone to errors. For example, two distinct clusters may have a few points or outliers that are close; therefore, relying on the similarity between points to make clustering decisions could cause the two clusters to be merged. ROCK takes a more global approach to clustering by considering the neighborhoods of individual pairs of points. If two similar points also have similar neighborhoods, then the two points likely belong to the same cluster and so can be merged.

More formally, two points, pi and pj, are neighbors if sim(pi, pj) ≥ θ, where sim is a similarity function and θ is a user-specified threshold. We can choose sim to be a distance metric or even a non-metric that is normalized so that its values fall between 0 and 1, with larger values indicating that the points are more similar. The number of links between pi and pj is defined as the number of common neighbors between pi and pj. If the number of links between two points is large, then it is more likely that they belong to the same cluster. By considering neighboring data points in the relationship between individual pairs of points, ROCK is more robust than standard clustering methods that focus only on point similarity.

Transactions are considered records with Boolean attributes, each corresponding to an individual item, such as bread or cheese. In the record for a transaction, the attribute corresponding to an item is true if the transaction contains the item; otherwise, it is false. Other data sets with categorical attributes can be handled in a similar manner. ROCK’s concepts of neighbors and links are illustrated in the following example, where the similarity between two “points” or transactions, Ti and Tj, is defined with the Jaccard coefficient as

sim(Ti, Tj) = |Ti ⋂ Tj|/|Ti ⋃ Tj|