New ideas on FP-Growth and batch incremental mining with FP-Tree

New ideas on FP-Growth and batch incremental mining with FP-Tree
Presenters - Fung Ho Long, Kwok Chung Hin

Review: Frequent pattern mining?
Goal: Discover frequent patterns within transactions Given: A set of transactions with items Target: Find item sets with occurrence >= threshold Applications: recommendations, outlier detections, classifications etc. Well-known solutions: Apriori: Tremendous database scans FP-Tree (Today’s focus): ONLY two database scans 1st scan: Generate frequent item list 2nd scan: Build tree and recursively mine patterns by FP-Growth

Limitations of FP-Growth
Require recursive generations of many conditional trees Construction of conditional trees are time and memory consuming Can we do better? Existing improvements: FP-array – Reduce one tree traversal for each conditional tree generation QFP-growth – Avoid generation of conditional FP-trees through a top-down approach

FP-Growth improvements: FP-array
Motivation: About 80% of CPU time is used for tree traversal Maintain a data structure called FP-array, where each element [a,b] represents frequency count of itemset {a, b} for all transactions By looking at the corresponding row, frequency counts for particular conditional tree can be found directly. Outcome: Reduce one tree traversal when constructing each conditional tree

FP-Growth improvements: FP-array
Results: Faster than usual FP-Growth, especially on small support threshold (memory similar) Limitations: Extra memory overhead and still require conditional tree constructions Reference: G. Grahne and J. Zhu, "Fast algorithms for frequent itemset mining using FP- trees," Knowledge and Data Engineering, IEEE Transactions on, vol. 17, pp , 2005.

FP-Growth improvements: QFP-Growth
Motivation: Generation of conditional trees are time and memory consuming Avoid constructing conditional trees by using temporary root and mining from the top (top-down approach)

FP-Growth improvements: QFP-Growth
Datasets: IBM Quest Generator Results: Competitive in time and memory consumption Reference: Y. Qiu, Yong-Jie Lan and Qing-Song Xie, "An improved algorithm of mining from FP- tree," Machine Learning and Cybernetics, 2004.Proceedings of International Conference on, vol. 3, pp , 2004.

FP-Growth improvements: Our work
We propose another way for FP-Growth WITHOUT construction of conditional trees Idea: Maintain an extra attribute called mcount in each tree node Algorithm: Tree Node = {item, count, mcount, childs, pr}, Frequency list F, Link list L Set all node.mcount = node.count Begin FP-Growth2 (link_list[i]) Go through the link_list containing all nodes of i Traverse the tree from bottom and count the item frequencies F’ Maintain conditional link list clink_list of nodes For each item j in F’ that is frequent in global descending order => results.insert(FP-Growth2 (clink_list[j])) For each result in results => results.push_back(i) return results; Initial Call: FP-Growth2 (L[item]), for each frequent item in F in descending order

FP-Growth improvements: Our work
Outcome: No need to construct conditional trees Tests: mushroom and chess with different thresholds Results: Faster in time and less memory is needed

Incremental mining

Incremental mining Given old transactions, new transactions arrive
Old frequent patterns can become infrequent and vice versa How to get the updated frequent patterns? Naïve solution: Redo everything from scratch Practical? Consider an enterprise database where updates are frequent and enormous transactions exist Improve FP-Tree to suit this purpose?

Assumptions Finite number of transactions
Global pattern mining (i.e. no decay for old transactions) Similar data distributions in old and new transactions Complete and exact solution (Not probabilistic/approximated) Note: First three assumptions will not hold for mining stream data which is another field called stream data mining (Beyond this work’s interest)

Stochastic VS Batch approaches
Stochastic approaches: On each new transaction, insert it and restructure the tree at immediate Examples: Adjusted FP-Tree for incremental mining (AFPIM) Compact Pattern Tree (CP-Tree) Problems: Frequent tree modifications + DB rescans Batch approaches: (Solve problems above) On each SET of new transactions, insert them TOGETHER and maintain the tree afterwards Example: Fast updated FP-Tree (FUFP) New problems: Compactness of FP-Tree is destroyed by violating item descending order

Stochastic approach - AFPIM
Motivation: Dynamically adjust the tree structure on every arrival of new transaction Idea: Maintain an extra parameter called pre-threshold, allowing pre-frequent items to be inserted On arrival of new transactions, calculate the new frequency item list Arrange the order of nodes on each branch by swapping according to new frequencies through bubble sort, followed by inserting new transactions Database rescan is needed when infrequent items become frequent; remove items with frequency below pre-threshold Results: Better than FP-Growth as no. of transactions increases Limitation: Careful selection of pre-threshold is needed, otherwise too frequent database rescans and tree modifications Reference: J. Koh and S. Shieh, "An efficient approach for maintaining association rules based on adjusting FP-tree structures," in International Conference on Database Systems for Advanced Applications, 2004, pp

Stochastic approach – CP-Tree
Motivation: Similar to AFPIM but use branch sorting method (BSM) Idea: Not swapping a node to right place separately [O(n2)] But extract each branch and do merge sort [O(n log2 n)] Insert back the sorted branch afterwards Difference between previous approach depending on degree of displacement (DD) Results: Even better than AFPIM, especially for high DD scenarios Limitation: AFPIM still better for low DD scenarios, hybrid approach is suggested Reference: S. K. Tanbeer, C. F. Ahmed, B. Jeong and Y. Lee, "CP-tree: A tree structure for single-pass frequent pattern mining," in Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2008, pp

Batch approach - FUFP Idea Input Output
Update the constructed FP-tree with new transactions by pruning infrequent items and adding new frequent items and the corresponding nodes to the leaf nodes Input An old database, an FP-tree, a new batch of transaction Output An updated FP-tree Reference: T. Hong, C. Lin and Y. Wu, "Incrementally fast updated frequent pattern trees," Expert Syst. Appl., vol. 34, pp , 2008.

Batch approach - FUFP Algorithm
Scan the new transactions to get all the items and their counts Check whether the items are large in the new transactions and original transactions by comparing their counts in the new transactions and original transactions with the minimum count

c1: case 1 (the item is large in the new transactions and in the original database) Update the count in the FP-tree Put the item in the set of Insert_items c2: case 2 (the item is small in the new transactions but large in the original database) If the item is still large in the updated database else Remove the item and corresponding nodes from the tree c3: case 3 (the item is large in the new transactions but small in the original database) Rescan the original database to find out the transactions with the item Put the item in the sets of Rescan_items and Insert_items

Update the FP-tree with the transactions in the original database with the items in the set of Rescan_items by only inserting the items in the set of Rescan_items at the end of the branch Update the FP-tree with the new transactions with the items in the set of Insert_items

Batch approach - FUFP Results from the paper
transactions were used to construct the FP-tree 5,000 transactions were sequentially used each time as new transactions

FUFP implementation: Results of our work
Tests: mushroom with different threshold Two kinds of partition: 95% - old transactions, 5% – new transactions 90% - old transactions, 5% - new transactions Results: Faster in time and less memory is needed

Limitations of FUFP Destroy structure by lazy insertion
Destroy structure when frequency of new items change Loss of compactness of FP-tree Mining would require a lot more time when new transactions are large Difference is too little when old/new transactions are not in extreme proportion

Summary of our results Replace FP-growth with our new idea (new FP-growth) 15% ~ 20% improvement in time for small enough threshold 54% ~ 65% improvement in memory for small enough threshold Replace naïve reconstruction with FUFP 20% - 90% improvement in time for small enough threshold 10% - 30% improvement in memory for small enough threshold Tested by mushroom and chess datasets

Challenges/Future directions
Is there any other faster techniques for FP-Growth that does not require conditional tree generation? Should we use a lazy swapping threshold to allow items to swap back to the correct descending order when frequency difference between its immediate item becomes too large? How can we determine it? (good balance between conciseness and time complexity) Can we improve FUFP to detect whether a complete reconstruction would be a better choice? (before starting FUFP)

Demo (if time allows)

Questions?

New ideas on FP-Growth and batch incremental mining with FP-Tree

Similar presentations

Presentation on theme: "New ideas on FP-Growth and batch incremental mining with FP-Tree"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

New ideas on FP-Growth and batch incremental mining with FP-Tree

Similar presentations

Presentation on theme: "New ideas on FP-Growth and batch incremental mining with FP-Tree"— Presentation transcript:

Similar presentations

About project

Feedback