Descriptive Data Summarization
Introduction: For data preprocessing to be successful, it is essential to have an overall picture of your data. Descriptive data summarization techniques can be used to identify the typical properties of your data and highlight which data values should be treated as noise or outliers. Thus, we first introduce the basic concepts of descriptive data summarization before getting into the concrete workings of data preprocessing techniques.
For many data preprocessing tasks, users would like to learn about data characteristics regarding both central tendency and dispersion of the data. Measures of central tendency include mean, median, mode, and midrange, while measures of data dispersion include quartiles, inter quartile range (IQR), and variance. These descriptive statistics are of great help in understanding the distribution of the data. Such measures have been studied extensively in the statistical literature. From the data mining point of view, we need to examine how they can be computed efficiently in large databases. In particular, it is necessary to introduce the notions of distributive measure, algebraic measure, and holistic measure. Knowing what kind of measure we are dealing with can help us choose an efficient implementation for it.
Measuring the Central Tendency
In this section, we look at various ways to measure the central tendency of data. The most common and most effective numerical measure of the “center” of a set of data is the (arithmetic) mean. Let x1;x2; : : : ;x N be a set of N values or observations, such as for some attribute, like salary. The mean of this set of values is
This corresponds to the built-in aggregate function, average (avg () in SQL), provided in relational database systems.
A distributive measure is a measure (i.e., function) that can be computed for a given data set by partitioning the data into smaller subsets, computing the measure for each subset, and then merging the results in order to arrive at the measure’s value for the original (entire) data set. Both sum () and count () are distributive measures because they can be computed in this manner. Other examples include max () and min (). An algebraic measure is a measure that can be computed by applying an algebraic function to one or more distributive measures. Hence, average (or mean ()) is an algebraic measure because it can be computed by sum ()/count ().When computing data cubes2, sum () and count () are typically saved in pre-computation. Thus, the derivation of average for data cubes is straightforward.
Sometimes, each value xi in a set may be associated with a weight wi, for i = 1; : : : ;N. The weights reflect the significance, importance, or occurrence frequency attached to their respective values. In this case, we can compute
This is called the weighted arithmetic mean or the weighted average. Note that the weighted average is another example of an algebraic measure.
Although the mean is the single most useful quantity for describing a data set, it is not always the best way of measuring the center of the data. A major problem with the mean is its sensitivity to extreme (e.g., outlier) values. Even a small number of extreme values can corrupt the mean. For example, the mean salary at a company may be substantially pushed up by that of a few highly paid managers. Similarly, the average score of a class in an exam could be pulled down quite a bit by a few very low scores. To offset the effect caused by a small number of extreme values, we can instead use the trimmed mean, which is the mean obtained after chopping off values at the high and low extremes. For example, we can sort the values observed for salary and remove the top and bottom 2% before computing the mean. We should avoid trimming too large a portion (such as 20%) at both ends as this can result in the loss of valuable information.