Data Mining & Data Warehousing

Mining Frequent Itemsets Using Vertical Data Format

Introduction: Both the Apriori and FP-growth methods mine frequent patterns from a set of transactions in TID-itemset format (that is, {TID : itemsetg), where TID is a transaction-id and itemset is the set of items bought in transaction TID. This data format is known as horizontal data format. Alternatively, data can also be presented in item-TID set format (that is, fitem : TID setg), where item is an item name, and TID set is the set of transaction identifiers containing the item.

This format is known as vertical data format. In this section, we look at how frequent itemsets can also be mined efficiently using vertical data format, which is the essence of the ECLAT (Equivalence CLASS Transformation) algorithm developed by Zaki [Zak00].

Algorithm: FP growth. Mine frequent itemsets using an FP-tree by pattern fragment growth.

Input:

  • D, a transaction database;
  • min sup, the minimum support count threshold.

Output: The complete set of frequent patterns.

Method:

1. The FP-tree is constructed in the following steps:

(a) Scan the transaction database D once. Collect F, the set of frequent items, and their support counts. Sort F in support count descending order as L, the list of frequent items.

(b) Create the root of an FP-tree, and label it as “null.” For each transaction Trans in D do the following. Select and sort the frequent items in Trans according to the order of L. Let the sorted frequent item list in Trans be [pjP], where p is the first element and P is the remaining list. Call insert tree([pjP], T), which is performed as follows. If T has a child N such that N.item-name=p.item-name, then increment N’s count by 1; else create a new node N, and let its count be 1, its parent link be linked to T, and its node-link to the nodes with the same item-name via the node-link structure. If P is nonempty, call insert tree(P, N) recursively.

2. The FP-tree is mined by calling FP growth(FP tree, null), which is implemented as follows.

procedure FP growth(Tree,α)

(1) if Tree contains a single path P then

(2) for each combination (denoted as β) of the nodes in the path P

(3) generate pattern β∪α with support count = minimum support count o f nodes in β;

(4) else for each ai in the header of Tree {

(5) generate pattern β = ai∪ a with support count = ai:support_count;

(6) construct β’s conditional pattern base and then β’s conditional FP tree Treeb;

(7) if Treeβ ≠0 then

(8) call FP growth(Treeβ, β); g