Data Mining & Data Warehousing

Stream: A K-medians-based Stream Clustering Algorithm

  • The on-line microcluster processing is divided into two phases: (1) statistical data collection and (2) updating of microclusters. In the first phase, a total of q microclusters, M1, : : : , Mq, are maintained, where q is usually significantly larger than the number of natural clusters and is determined by the amount of available memory. In the second phase, microclusters are updated. Each new data point is added to either an existing cluster or a new one. To decide whether a new cluster is required, a maximum boundary
  • for each cluster is defined. If the new data point falls within the boundary, it is added to the cluster; otherwise, it is the first data point in a new cluster. When a data point is added to an existing cluster, it is “absorbed” because of the additive property of the microclusters.
  • When a data point is added to a new cluster, the least recently used existing cluster has to be removed or two existing clusters have to be merged, depending on certain criteria, in order to create memory space for the new cluster.
  • The off-line component can perform user-directed macro clustering or cluster evolution analysis. Macro clustering allows a user to explore the stream clusters over different time horizons. A time horizon, h, is a history of length h of the stream. Given a user specified time horizon, h, and the number of desired macroclusters, k, macro clustering finds k high-level clusters over h. This is done as follows: First, the snapshot at time tc-h is subtracted from the snapshot at the current time, tc. Clusters older than the beginning of the horizon are not included. The microclusters in the horizon are considered as weighted pseudo-points and are reclustered in order to determine higher-level clusters.
  • Notice that the clustering process is similar to the method used in STREAM but requires only two snapshots (the beginning and end of the horizon) and is more flexible over a range of user queries.
  • “What if a user wants to see how clusters have changed over, say, the last quarter or the last year?” Cluster evolution analysis looks at how clusters change over time. Given a user-specified time horizon, h, and two clock times, t1 and t2 (where t1 < t2), cluster evolution analysis examines the evolving nature of the data arriving between (t2-h, t2) and that arriving between (t1-h, t1). This involves answering questions like whether new clusters in the data at time t1 were not present at time t2, or whether some of the original clusters were lost. This also involves analyzing whether some of the original clusters at time t1 shifted in position and nature. With the available microcluster information, this can be done by computing the net snapshots of the microclusters, N(t1, h) and N(t2, h), and then computing the snapshot changes over time. Such evolution analysis of the data over time can be used for network intrusion detection to identify new types of attacks within the network.
  • CluStream was shown to derive high-quality clusters, especially when the changes are dramatic. Moreover, it offers rich functionality to the user because it registers the essential historical information with respect to cluster evolution. The tilted time frame along with the micro clustering structure allow for better accuracy and efficiency on real data. Finally, it maintains scalability in terms of stream size, dimensionality, and the number of clusters.
  • In general, stream data mining is still a fast-evolving research field. With the massive amount of data streams populating many applications, it is expected that many new stream data mining methods will be developed, especially for data streams containing additional semantic information, such as time-series streams, spatiotemporal data streams, and video and audio data streams.