Methodologies For Stream Data Processing And Stream Data Systems
Introduction: As seen from the previous discussion, it is impractical to scan through an entire data stream more than once. Sometimes we cannot even “look” at every element of a stream because the stream flows in so fast and changes so quickly. The gigantic size of such data sets also implies that we generally cannot store the entire stream data set in main memory or even on disk. The problem is not just that there is a lot of data; it is that the universes that we are keeping track of are relatively large, where a universe is the domain of possible values for an attribute.
For example, if we were tracking the ages of millions of people, our universe would be relatively small, perhaps between zero and one hundred and twenty. We could easily maintain exact summaries of such data. In contrast, the universe corresponding to the set of all pairs of IP addresses on the Internet is very large, which makes exact storage intractable. A reasonable way of thinking about data streams is to actually think of a physical stream of water. Heraclitus once said that you can never step in the same stream twice,1 and so it is with stream data.
For effective processing of stream data, new data structures, techniques, and algorithms are needed. Because we do not have an infinite amount of space to store stream data, we often tradeoff between accuracy and storage. That is, we generally are willing to settle for approximate rather than exact answers. Synopses allow for this by providing summaries of the data, which typically can be used to return approximate answers to queries. Synopses use synopsis data structures, which are any data structures that are substantially smaller than their base data set (in this case, the stream data). From the algorithmic point of view, we want our algorithms to be efficient in both space and time.
Instead of storing all or most elements seen so far, using O(N) space, we often want to use poly logarithmic space, O(logk N), where N is the number of elements in the stream data. We may relax the requirement that our answers are exact, and ask for approximate answers within a small error range with high probability. That is, many data stream– based algorithms compute an approximate answer within a factor e of the actual answer, with high probability. Generally, as the approximation factor (1 e) goes down, the space requirements go up.
Random Sampling: Rather than deal with an entire data stream, we can think of sampling the stream at periodic intervals. “To obtain an unbiased sampling of the data, we need to know the length of the stream in advance. But what can we do if we do not know this length in advance?” In this case, we need to modify our approach.
A technique called reservoir sampling can be used to select an unbiased random sample of s elements without replacement. The idea behind reservoir sampling is relatively simple. We maintain a sample of size at least s, called the “reservoir,” from which a random sample of size s can be generated. However, generating this sample from the reservoir can be costly, especially when the reservoir is large. To avoid this step, we maintain a set of s candidates in the reservoir, which form a true random sample of the elements seen so far in the stream. As the data stream flows, every new element has a certain probability of replacing an old element in the reservoir. Let’s say we have seen N elements thus far in the stream. The probability that a new element replaces an old one, chosen at random, is then s=N. This maintains the invariant that the set of s candidates in our reservoir forms a random sample of the elements seen so far.
Sliding Windows: Instead of sampling the data stream randomly, we can use the sliding window model to analyze stream data. The basic idea is that rather than running computations on all of the data seen so far, or on some sample, we can make decisions based only on recent data. More formally, at every time t, a new data element arrives. This element “expires” at time t w, where w is the window “size” or length. The sliding window model is useful for stocks or sensor networks, where only recent events may be important. It also reduces memory requirements because only a small window of data is stored.
Histograms: The histogram is a synopsis data structure that can be used to approximate the frequency distribution of element values in a data stream. A histogram partitions the data into a set of contiguous buckets. Depending on the partitioning rule used, the width (bucket value range) and depth (number of elements per bucket) can vary. The equal-width partitioning rule is a simple way to construct histograms, where the range of each bucket is the same. Although easy to implement, this may not sample the probability distribution function well. A better approach is to use V-Optimal histograms . Similar to clustering, V-Optimal histograms define bucket sizes that minimize the frequency variance within each bucket, which better captures the distribution of the data. These histograms can then be used to approximate query answers rather than using sampling techniques.