Data Mining & Data Warehousing

Mining Variant And Constrained Substructure Patterns

Introduction: The frequent sub-graph mining discussed in the previous section handles only one special kind of graphs: labeled, undirected, connected simple graphs without any specific constraints. That is, we assume that the database to be mined contains a set of graphs, each consisting of a set of labeled vertices and labeled but undirected edges, with no other constraints. However, many applications or users may need to enforce various kinds of constraints on the patterns to be mined or seek variant substructure patterns. For example, we may like to mine patterns, each of which contains certain specific vertices/edges, or where the total number of vertices/edges is within a specified range.Or what if we seek patterns where the average density of the graph patterns is above a threshold? Although it is possible to develop customized algorithms for each such case, there are too many variant cases to consider. Instead, a general framework is needed—one that can classify constraints on the graph patterns. Efficient constraint-based methods can then be developed for mining substructure patterns and their variants. In this section, we study several variants and constrained substructure patterns and looks at how they can be mined.

Mining Closed Frequent Substructures: The first important variation of a frequent substructure is the closed frequent substructure. Take mining frequent sub-graphs as an example. As with frequent itemset mining and sequential pattern mining, mining graph patterns may generate an explosive number of patterns. This is particularly true for dense data sets, because all of the sub-graphs of a frequent graph are also frequent. This is an inherent problem, because according to the Apriori property, all the subgraphs of a frequent substructure must be frequent. A large graph pattern may generate an exponential number of frequent sub-graphs. For example, among 423 confirmed active chemical compounds in an AIDS antiviral screen data set, there are nearly 1 million frequent graph patterns whose support is at least 5%. This renders the further analysis on frequent graphs nearly impossible.One way to alleviate this problem is to mine only frequent closed graphs, where a frequent graph G is closed if and only if there is no proper supergraph G0 that has the same support as G. Alternatively, we can mine maximal subgraph patterns where a frequent pattern G is maximal if and only if there is no frequent super-pattern of G. A set of closed subgraph patterns has the same expressive power as the full set of subgraph patterns under the same minimum support threshold, because the latter can be generated by the derived set of closed graph patterns. On the other hand, the maximal pattern set is a subset of the closed pattern set. It is usually more compact than the closed pattern set. However, we cannot use it to reconstruct the entire set of frequent patterns—the support information of a pattern is lost if it is a proper sub pattern of a maximal pattern, yet carries a different support.Example: Maximal frequent graph. The second graph is not maximal because it has a frequent supergraph.

Mining closed graphs leads to a complete but more compact representation. For example, for the AIDS antiviral data set mentioned above, among the 1 million frequent graphs, only about 2,000 are closed frequent graphs. If further analysis, such as classification or clustering, is performed on closed frequent graphs instead of frequent graphs, it will achieve similar accuracy with less redundancy and higher efficiency. An efficient method, called Close Graph, was developed for mining closed frequent graphs by extension of the gSpan algorithm.

Extension of Pattern-Growth Approach: Mining Alternative Substructure Patterns: A typical pattern-growth graph mining algorithm, such as gSpan or Close Graph, mines labeled, connected, undirected frequent or closed subgraph patterns. Such a graph mining framework can easily be extended for mining alternative substructure patterns. Here we discuss a few such alternatives.

  • First, the method can be extended for mining unlabeled or partially labeled graphs. Each vertex and each edge in our previously discussed graphs contain labels. Alternatively, if none of the vertices and edges in a graph is labeled, the graph is unlabeled. A graph is partially labeled if only some of the edges and/or vertices are labeled. To handle such cases, we can build a label set that contains the original label set and a new empty label, f. Label f is assigned to vertices and edges that do not have labels. Notice that label f may match with any label or with f only, depending on the application semantics. With this transformation, gSpan (and CloseGraph) can directly mine unlabeled or partially labeled graphs.
  • Second, we examine whether gSpan can be extended to mining non simple graphs. A non simple graph may have a self-loop (i.e., an edge joins a vertex to itself) and multiple edges (i.e., several edges connecting two of the same vertices). In gSpan, we always first grow backward edges and then forward edges. In order to accommodate self-loops, the growing order should be changed to backward edges, self-loops, and forward edges. If we allow sharing of the same vertices in two neighboring edges in a DFS code, the definition of DFS lexicographic order can handle multiple edges smoothly. Thus gSpan can mine non simple graphs efficiently, too.
  • Third, we see how gSpan can be extended to handle mining directed graphs. In a directed graph, each edge of the graph has a defined direction. If we use a 5-tuple, (i; j; li; l(i; j); l j), to represent an undirected edge, then for directed edges, a new state is introduced to form a 6-tuple, (i; j;d; li; l(i; j); l j), where d represents the direction of an edge. Let d = 1 be the direction from i (vi) to j (v j), whereas d = -1 is that from j (v j) to i (vi). Notice that the sign of d is not related to the forwardness or backwardness of an edge. When extending a graph with one more edge, this edge may have two choices of d, which only introduces a new state in the growing procedure and need not change the framework of gSpan.
  • Fourth, the method can also be extended to mining disconnected graphs. There are two cases to be considered: (1) the graphs in the data set may be disconnected, and (2) the graph patterns may be disconnected. For the first case, we can transform the original data set by adding a virtual vertex to connect the disconnected graphs in each graph. We then apply gSpan on the new graph data set. For the second case, we redefine the DFS code. A disconnected graph pattern can be viewed as a set of connected graphs, r = {g0;g1; : : : ;gm}, where gi is a connected graph, 0 ≤ i ≤ m. Because each graph can be mapped to a minimum DFS code, a disconnected graph r can be translated into a code, g = (s0; s1; : : : ; sm), where si is the minimum DFS code of gi. The order of gi in r is irrelevant. Thus, we enforce an order in {si} such that s0 ≤ s1 ≤ : : : ≤ sm. γ can be extended by either adding one-edge sm 1 (sm ≤ sm 1) or by extending sm, . . . , and s0. When checking the frequency of γ in the graph data set, make sure that g0;g1; : : : ; and gm are disconnected with each other.
  • Finally, if we view a tree as a degenerated graph, it is straightforward to extend the method to mining frequent subtrees. In comparison with a general graph, a tree can be considered as a degenerated direct graph that does not contain any edges that can go back to its parent or ancestor nodes. Thus if we consider that our traversal always starts at the root (because the tree does not contain any backward edges), gSpan is ready to mine tree structures. Based on the mining efficiency of the pattern growth–based approach, it is expected that gSpan can achieve good performance in tree-structure mining.

Constraint-Based Mining of Substructure Patterns

  • As we have seen in previous chapters, various kinds of constraints can be associated with a user’s mining request. Rather than developing many case-specific substructure mining algorithms, it is more appropriate to set up a general framework of constraint-based substructure mining so that systematic strategies can be developed to push constraints deep into the mining process.
  • Constraint-based mining of frequent substructures can be developed systematically, similar to the constraint-based mining of frequent patterns and sequential patterns introduced in Chapters 5 and 8. Take graph mining as an example. As with the constraint based frequent pattern mining framework outlined in Chapter 5, graph constraints can be classified into a few categories, including anti monotonic, monotonic, and succinct. Efficient constraint-based mining methods can be developed in a similar way by extending efficient graph-pattern mining algorithms, such as gSpan and Close Graph.