Data Mining & Data Warehousing

Stream Olap And Stream Data Cubes

Introduction: Stream data are generated continuously in a dynamic environment, with huge volume, infinite flow, and fast-changing behavior. It is impossible to store such data streams completely in a data warehouse. Most stream data represent low-level information, consisting of various kinds of detailed temporal and other features.

To find interesting or unusual patterns, it is essential to perform multidimensional analysis on aggregate measures (such as sum and average). This would facilitate the discovery of critical changes in the data at higher levels of abstraction, from which users can drill down to examine more detailed levels, when needed. Thus multidimensional OLAP analysis is still needed in stream data analysis, but how can we implement it? Consider the following motivating example.

Time Dimension with Compressed Time Scale: Tilted Time Frame

In stream data analysis, people are usually interested in recent changes at a fine scale but in long-term changes at a coarse scale. Naturally, we can register time at different levels of granularity. The most recent time is registered at the finest granularity; the more distant time is registered at a coarser granularity; and the level of coarseness depends on the application requirements and on how old the time point is (from the current time). Such a time dimension model is called a tilted time frame. This model is sufficient for many analysis tasks and also ensures that the total amount of data to retain in memory or to be stored on disk is small.

There are many possible ways to design a titled time frame. Here we introduce three models, as illustrated in Figure 8.1: (1) natural tilted time frame model, (2) logarithmic tilted time frame model, and (3) progressive logarithmic tilted time frame model.

A natural tilted time frame model is shown in Figure 8.1(a), where the time frame (or window) is structured in multiple granularities based on the “natural” or usual time scale: the most recent 4 quarters (15 minutes), followed by the last 24 hours, then 31 days, and then 12 months (the actual scale used is determined by the application). Based on this model, we can compute frequent itemsets in the last hour with the precision of a quarter of an hour, or in the last day with the precision of an hour, and

so on until the whole year with the precision of a month.2 This model registers only 4 24 31 12 = 71 units of time for a year instead of 365244 = 35,040 units, with an acceptable trade-off of the grain of granularity at a distant time.

The second model is the logarithmic tilted time frame model, as shown in Figure 8.1(b), where the time frame is structured in multiple granularities according to a logarithmic scale. Suppose that the most recent slot holds the transactions of the current quarter. The remaining slots are for the last quarter, the next two quarters (ago), 4 quarters, 8 quarters, 16 quarters, and so on, growing at an exponential rate. According to this model, with one year of data and the finest precision at a quarter, we would need log2(365244) 1 =16:1 units of time instead of 365X24X4 =35,040 units. That is, we would just need 17 time frames to store the compressed information. The third method is the progressive logarithmic tilted time frame model, where snapshots are stored at differing levels of granularity depending on the recency. Let T be the clock time elapsed since the beginning of the stream. Snapshots are classified into different frame numbers, which can vary from0 to max frame, where log2(T)-max_capacity ≤ max frame ≤ log2(T), and max capacity is the maximal number of snapshots held in each frame.

Each snapshot is represented by its timestamp. The rules for insertion of a snapshot t (at time t) into the snapshot frame table are defined as follows: (1) if (t mod 2i) = 0 but (t mod 2i 1) 6= 0, t is inserted into frame number i if i ≤ max frame; otherwise (i.e., i > max frame), t is inserted into max frame; and (2) each slot has a max capacity. At the insertion of t into frame number i, if the slot already reaches its max capacity, the oldest snapshot in this frame is removed and the new snapshot inserted.