User-constrained Cluster Analysis

Introduction: Specifically, a package delivery company with n customers would like to determine locations for k service stations so as to minimize the traveling distance between customers and service stations. The company’s customers are regarded as either high-value customers (requiring frequent, regular services) or ordinary customers (requiring occasional services). The manager has stipulated two constraints: each station should serve (1) at least 100 high-value customers and (2) at least 5,000 ordinary customers.

This can be considered as a constrained optimization problem. We could consider using a mathematical programming approach to handle it. However, such a solution is difficult to scale to large data sets. To cluster n customers into k clusters, a mathematical programming approach will involve at least k n variables. As n can be as large as a few million, we could end up having to solve a few million simultaneous equations— a very expensive feat. A more efficient approach is proposed that explores the idea of micro clustering, as illustrated below.

The general idea of clustering a large data set into k clusters satisfying user-specified constraints goes as follows. First, we can find an initial “solution” by partitioning the data set into k groups, satisfying the user-specified constraints, such as the two constraints in our example. We then iteratively refine the solution by moving objects from one cluster to another, trying to satisfy the constraints. For example, we can move a set of m customers from cluster Ci to Cj if Ci has at least m surplus customers (under the specified constraints), or if the result of moving customers into Ci from some other clusters (including from Cj) would result in such a surplus. The movement is desirable if the total sum of the distances of the objects to their corresponding cluster centers is reduced. Such movement can be directed by selecting promising points to be moved, such as objects that are currently assigned to some cluster, Ci, but that are actually closer to a representative (e.g., centroid) of some other cluster, Cj. We need to watch out for and handle deadlock situations (where a constraint is impossible to satisfy), in which case, a deadlock resolution strategy can be employed.

To increase the clustering efficiency, data can first be preprocessed using the micro clustering idea to form micro clusters (groups of points that are close together), thereby avoiding the processing of all of the points individually. Object movement, deadlock detection, and constraint satisfaction can be tested at the micro cluster level, which reduces the number of points to be computed. Occasionally, such micro clusters may need to be broken up in order to resolve deadlocks under the constraints. This methodology ensures that the effective clustering can be performed in large data sets under the user-specified constraints with good efficiency and scalability.