Stream: A K-medians-based Stream Clustering Algorithm

Introduction: STREAM is a single-pass, constant factor approximation algorithm that was developed for the k-medians problem. The k-medians problem is to cluster N data points into k clusters or groups such that the sum squared error (SSQ) between the points and the cluster center to which they are assigned is minimized. The idea is to assign similar points to the same cluster, where these points are dissimilar from points in other clusters.

Given a data stream model of points, X1; . . . .  ;XN, with timestamps, T1, . . . , TN, the objective of stream clustering is to maintain a consistently good clustering of the sequence seen so far using a small amount of memory and time.

Recall that in the stream data model, data points can only be seen once, and memory and time are limited. To achieve high-quality clustering, the STREAM algorithm processes data streams in buckets (or batches) of m points, with each bucket fitting in main memory. For each bucket, bi, STREAM clusters the bucket’s points into k clusters. It then summarizes the bucket information by retaining only the information regarding the k centers, with each cluster center being weighted by the number of points assigned to its cluster. STREAM then discards the points, retaining only the center information. Once enough centers have been collected, the weighted centers are again clustered to produce another set of O(k) cluster centers. This is repeated so that at every level, at most m points are retained. This approach results in a one-pass, O(kN)-time, O(Ne)-space (for some constant e < 1), constant-factor approximation algorithm for data stream k-medians. STREAM derives quality k-medians clusters with limited space and time. However, it considers neither the evolution of the data nor time granularity. The clustering can become dominated by the older, outdated data of the stream. In real life, the nature of the clusters may vary with both the moment at which they are computed, as well as the time horizon over which they are measured. For example, a user may wish to examine clusters occurring last week, last month, or last year. These may be considerably different.

Therefore, a data stream clustering algorithm should also provide the flexibility to compute clusters over user-defined time periods in an interactive manner. The following algorithm, CluStream, addresses these concerns.

CluStream: Clustering Evolving Data Streams

CluStream is an algorithm for the clustering of evolving data streams based on user specified, on-line clustering queries. It divides the clustering process into on-line and offline components. The on-line component computes and stores summary statistics about the data stream using microclusters, and performs incremental on-line computation and maintenance of the microclusters. The off-line component does macro clustering and answers various user questions using the stored summary statistics, which are based on the tilted time frame model.

To cluster evolving data streams based on both historical and current stream data information, the tilted time frame model (such as a progressive logarithmic model) is adopted, which stores the snapshots of a set of microclusters at different levels of granularity depending on recency. The intuition here is that more information will be needed for more recent events as opposed to older events. The stored information can be used for processing history-related, user-specific clustering queries.

A microcluster in CluStream is represented as a clustering feature. CluStream extends the concept of the clustering feature developed in BIRCH to include the temporal domain. As a temporal extension of the clustering feature, a microcluster for a set of d-dimensionalpoints,X1, . . . ,Xn, with timestamps,T1, : : : , Tn, is defined as the (2d 3) tuple (CF2x;CF1x;CF2t , CF1t , n), wherein CF2x and CF1x are d-dimensional vectors while CF2t , CF1t , and n are scalars. CF2x maintains the sum of the squares of the data values per dimension, that is, Σni =1X2i. Similarly, for each dimension, the sum of the data values is maintained in CF1x. From a statistical point of view, CF2x and CF1x represent the second- and first-order moments of the data, respectively. The sum of squares of the time stamps is maintained in CF2t .The sum of the time stamps is maintained in CF1t. Finally, the number of data points in the microcluster is maintained in n.

Clustering features have additive and subtractive properties that make them very useful for data stream cluster analysis. For example, two microclusters can be merged by adding their respective clustering features. Furthermore, a large number of microclusters can be maintained without using a great deal of memory. Snapshots of these microclusters are stored away at key points in time based on the tilted time frame.