Efficient And Scalable Frequently Itemset Mining Methods
Introduction: The efficiency of this algorithm is analyzed as follows:
Step 1 of the algorithm is essentially a relational query to collect the task-relevant data into the working relation,W. Its processing efficiency depends on the query processing methods used. Given the successful implementation and commercialization of database systems, this step is expected to have good performance.
Algorithm: Attribute oriented induction. Mining generalized characteristics in a relational database given a user’s data mining request.
Input:
- DB, a relational database;
- DMQuery, a data mining query;
- a list, a list of attributes (containing attributes, ai);
- Gen(ai), a set of concept hierarchies or generalization operators on attributes, ai;
- a gen thresh(ai), attribute generalization thresholds for each ai.
Output: P, a Prime generalized relation.
Method:
1. W ->get task relevant data (DMQuery, DB); // Let W, the working relation, hold the task-relevant data.
2. prepare_for_generalization (W); // This is implemented as follows.
(a) Scan W and collect the distinct values for each attribute, ai. (Note: If W is very large, this may be done by examining a sample of W.)
(b) For each attribute ai, determine whether ai should be removed, and if not, compute its minimum desired level Li based on its given or default attribute threshold, and determine the mapping pairs (v, v0), where v is a distinct value of ai in W, and v0 is its corresponding generalized value at level Li.
3. P <-generalization (W),
The Prime generalized relation, P, is derived by replacing each value v inW by its corresponding v0 in the mapping while accumulating count and computing any other aggregate values.
This step can be implemented efficiently using either of the two following variations:
(a)For each generalized tuple, insert the tuple into a sorted prime relation P by a binary search: if the tuple is already in P, simply increase its count and other aggregate values accordingly; otherwise, insert it into P.
(b) Since in most cases the number of distinct values at the prime relation level is small, the prime relation can be coded as an m-dimensional array where m is the number of attributes in P, and each dimension contains the corresponding generalized attribute values. Each array element holds the corresponding count and other aggregation values, if any. The insertion of a generalized tuple is performed by measure aggregation in the corresponding array element.
Step 2 collects statistics on the working relation. This requires scanning the relation at most once. The cost for computing the minimum desired level and determining the mapping pairs, (v, v0), for each attribute is dependent on the number of distinct values for each attribute and is smaller than N, the number of tuples in the initial relation.
Step 3 derives the prime relation, P. This is performed by inserting generalized tuples into P. There are a total of N tuples in W and p tuples in P. For each tuple, t, in W, we substitute its attribute values based on the derived mapping-pairs. This results in a generalized tuple, t0. If variation (a) is adopted, each t0 takes O(log p) to find the location for count increment or tuple insertion. Thus the total time complexity is O(N X log p) for all of the generalized tuples. If variation (b) is adopted, each t0 takes O(1) to find the tuple for count increment. Thus the overall time complexity is O(N) for all of the generalized tuples.
Many data analysis tasks need to examine a good number of dimensions or attributes. This may involve dynamically introducing and testing additional attributes rather than just those specified in the mining query. Moreover, a user with little knowledge of the truly relevant set of data may simply specify “in relevance to *” in the mining query, which includes all of the attributes into the analysis. Therefore, an advanced concept description mining process needs to perform attribute relevance analysis on large sets of attributes to select the most relevant ones.