Distance-based Outlier Detection
Introduction: The notion of distance-based outliers was introduced to counter the main limitations imposed by statistical methods. An object, o, in a data set, D, is a distance-based (DB) outlier with parameters pct and d_{min}, that is, a DB (pct;d_{min})-outlier, if at least a fraction, pct, of the objects in D lie at a distance greater than d_{min} from o. In other words, rather than relying on statistical tests, we can think of distance-based outliers as those objects that do not have “enough” neighbors, where neighbors are defined based on distance from the given object. In comparison with statistical-based methods, distance based outlier detection generalizes the ideas behind discordancy testing for various standard distributions. Distance-based outlier detection avoids the excessive computation that can be associated with fitting the observed distribution into some standard distribution and in selecting discordancy tests.
For many discordancy tests, it can be shown that if an object, o, is an outlier according to the given test, then o is also a DB (pct, d_{min})-outlier for some suitably defined pct and d_{min}. For example, if objects that lie three or more standard deviations from the mean are considered to be outliers, assuming a normal distribution, then this definition can be generalized by a DB(0:9988, 0:13s) outlier. Several efficient algorithms for mining distance-based outliers have been developed.
These are outlined as follows.
Index-based algorithm: Given a data set, the index-based algorithm uses multidimensional indexing structures, such as R-trees or k-d trees, to search for neighbors of each object o within radius d_{min} around that object. Let M be the maximum number of objects within the d_{min}-neighborhood of an outlier. Therefore, onceM 1 neighbor of object o are found, it is clear that o is not an outlier. This algorithm has a worst-case complexity of O(n^{2}k), where n is the number of objects in the data set and k is the dimensionality. The index-based algorithm scales well as k increases. However, this complexity evaluation takes only the search time into account, even though the task of building an index in itself can be computationally intensive.
Nested-loop algorithm: The nested-loop algorithm has the same computational complexity as the index-based algorithm but avoids index structure construction and tries to minimize the number of I/Os. It divides the memory buffer space into two halves and the data set into several logical blocks. By carefully choosing the order in which blocks are loaded into each half, I/O efficiency can be achieved.
Cell-based algorithm:To avoid O(n^{2}) computational complexity, a cell-based algorithm was developed for memory-resident data sets. Its complexity is O(c^{k} n), where c is a constant depending on the number of cells and k is the dimensionality. In this method, the data space is partitioned into cells with a side length equal to d_{min} / 2 √k . Each cell has two layers surrounding it. The first layer is one cell thick, while the second is [2 √k –1] cells thick, rounded up to the closest integer. The algorithm counts outliers on a cell-by-cell rather than an object-by-object basis. For a given cell, it accumulates three counts—the number of objects in the cell, in the cell and the first layer together, and in the cell and both layers together. Let’s refer to these counts as cell count, cell 1 layer count, and cell 2 layers count, respectively.
“How are outliers determined in this method?” Let M be the maximum number of outliers that can exist in the d_{min}neighborhood of an outlier.
- An object, o, in the current cell is considered an outlier only if cell 1 layer count is less than or equal to M. If this condition does not hold, then all of the objects in the cell can be removed from further investigation as they cannot be outliers.
- If cell 2 layers count is less than or equal to M, then all of the objects in the cell are considered outliers. Otherwise, if this number is more than M, then it is possible that some of the objects in the cell may be outliers. To detect these outliers, object-by-object processing is used where, for each object, o, in the cell, objects in the second layer of o are examined. For objects in the cell, only those objects having no more than M points in their d_{min} neighborhoods are outliers.
The d_{min}-neighborhood of an object consists of the object’s cell, all of its first layer, and some of its second layer.
A variation to the algorithm is linear with respect to n and guarantees that no more than three passes over the data set are required. It can be used for large disk-resident data sets, yet does not scale well for high dimensions.
Distance-based outlier detection requires the user to set both the pct and d_{min} parameters. Finding suitable settings for these parameters can involve much trial and error.