Clustering Evolving Data Streams
Introduction: Imagine a huge amount of dynamic stream data. Many applications require the automated clustering of such data into groups based on their similarities. Examples include applications for network intrusion detection, analyzing Web click streams, and stock market analysis. Although there are many powerful methods for clustering static data sets, clustering data streams places additional constraints on such algorithms. As we have seen, the data stream model of computation requires algorithms to make a single pass over the data, with bounded memory and limited processing time, whereas the stream may be highly dynamic and evolving over time.
For effective clustering of stream data, several new methodologies have been developed, as follows:
- Compute and store summaries of past data: Due to limited memory space and fast response requirements, compute summaries of the previously seen data, store the relevant results, and use such summaries to compute important statistics when required.
- Apply a divide-and-conquer strategy: Divide data streams into chunks based on order of arrival, compute summaries for these chunks, and then merge the summaries. In this way, larger models can be built out of smaller building blocks.
- Incremental clustering of incoming data streams: Because stream data enter the system continuously and incrementally, the clusters derived must be incrementally refined.
- Perform micro clustering as well as macro clustering analysis: Stream clusters can be computed in two steps: (1) compute and store summaries at the micro cluster level, where microclusters are formed by applying a hierarchical bottom-up clustering algorithm, and (2) compute macroclusters (such as by using another clustering algorithm to group the microclusters) at the user-specified level. This two step computation effectively compresses the data and often results in a smaller margin of error.
- Explore multiple time granularities for the analysis of cluster evolution: Because the more recent data often play a different role from that of the remote (i.e., older) data in stream data analysis, use a tilted time frame model to store snapshots of summarized data at different points in time.
- Divide stream clustering into on-line and off-line processes: While data are streaming in, basic summaries of data snapshots should be computed, stored, and incrementally updated. Therefore, an on-line process is needed to maintain such dynamically changing clusters. Meanwhile, a user may pose queries to ask about past, current, or evolving clusters. Such analysis can be performed off-line or as a process independent of on-line cluster maintenance.