Mining Frequent Item Sets by Opportunistic Projection

Slides:

Advertisements

Similar presentations

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

Advertisements

Mining Frequent Patterns Using FP-Growth Method Ivan Tanasić Department of Computer Engineering and Computer Science, School of Electrical.

Graph Mining Laks V.S. Lakshmanan

Frequent Closed Pattern Search By Row and Feature Enumeration

LOGO Association Rule Lecturer: Dr. Bo Yuan

The FP-Growth/Apriori Debate Jeffrey R. Ellis CSE 300 – 01 April 11, 2002.

FP-Growth algorithm Vasiljevic Vladica,

Data Mining Association Analysis: Basic Concepts and Algorithms

Frequent Item Mining.

Rakesh Agrawal Ramakrishnan Srikant

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.

Our New Progress on Frequent/Sequential Pattern Mining We develop new frequent/sequential pattern mining methods Performance study on both synthetic and.

Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining Association Rules Yao Meng Hongli Li Database II Fall 2002.

1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.

Mining Frequent patterns without candidate generation Jiawei Han, Jian Pei and Yiwen Yin.

Association Analysis: Basic Concepts and Algorithms.

Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.

Data Mining Association Analysis: Basic Concepts and Algorithms

CS 590M Fall 2001: Security Issues in Data Mining Lecture 5: Association Rules, Sequential Associations.

Pattern Lattice Traversal by Selective Jumps Osmar R. Zaïane and Mohammad El-Hajj Department of Computing Science, University of Alberta Edmonton, AB,

Association Rule Mining (Some material adapted from: Mining Sequential Patterns by Karuna Pande Joshi)‏

Association Analysis (3). FP-Tree/FP-Growth Algorithm Use a compressed representation of the database using an FP-tree Once an FP-tree has been constructed,

Pattern-growth Methods for Sequential Pattern Mining: Principles and Extensions Jiawei Han (UIUC) Jian Pei (Simon Fraser Univ.)

ACM SIGKDD Aug – Washington, DC  M. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada Inverted Matrix: Efficient Discovery.

Performance and Scalability: Apriori Implementation.

SEG Tutorial 2 – Frequent Pattern Mining.

1 Top Down FP-Growth for Association Rule Mining Ke Wang Liu Tang Jiawei Han Junqiang Liu Simon Fraser University.

Mining Association Rules between Sets of Items in Large Databases presented by Zhuang Wang.

Ch5 Mining Frequent Patterns, Associations, and Correlations

Mining Frequent Patterns without Candidate Generation Presented by Song Wang. March 18 th, 2009 Data Mining Class Slides Modified From Mohammed and Zhenyu’s.

VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.

1 Apriori Algorithm Review for Finals. SE 157B, Spring Semester 2007 Professor Lee By Gaurang Negandhi.

Sequential PAttern Mining using A Bitmap Representation

Jiawei Han, Jian Pei, and Yiwen Yin School of Computing Science Simon Fraser University Mining Frequent Patterns without Candidate Generation SIGMOD 2000.

AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.

Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —

Data Mining Frequent-Pattern Tree Approach Towards ARM Lecture

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

EFFICIENT ITEMSET EXTRACTION USING IMINE INDEX By By U.P.Pushpavalli U.P.Pushpavalli II Year ME(CSE) II Year ME(CSE)

Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )

1 Data Mining and Warehousing: Session 6 Association Analysis Jia-wei Han

Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang

Mining Frequent Patterns without Candidate Generation.

Mining Frequent Patterns without Candidate Generation : A Frequent-Pattern Tree Approach 指導教授：廖述賢博士報告人：朱佩慧班級：管科所博一.

Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.

Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?

LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

Data Mining Find information from data data ? information.

From Path Tree To Frequent Patterns: A Framework for Mining Frequent Patterns Yabo Xu, Jeffrey Xu Yu, Guimei Liu, Hongjun Lu, Proc. of the 2002 IEEE International.

Association Analysis (3)

Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {

1 Top Down FP-Growth for Association Rule Mining By Ke Wang.

Fast Mining Frequent Patterns with Secondary Memory Kawuu W. Lin, Sheng-Hao Chung, Sheng-Shiung Huang and Chun-Cheng Lin Department of Computer Science.

CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets

Data Mining Find information from data data ? information.

Jian Pei and Runying Mao (Simon Fraser University)

Information Management course

Byung Joon Park, Sung Hee Kim

CARPENTER Find Closed Patterns in Long Biological Datasets

CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets

به نام خداوند جان و خرد الگوکاوي در پايگاه‌هاي تراکنش بسيار بزرگ با استفاده از رويکرد تقسيم وحل Frequent Pattern Mining on Very Large Transaction Databases.

Gyozo Gidofalvi Uppsala Database Laboratory

Mining Frequent Patterns without Candidate Generation

Frequent-Pattern Tree

FP-Growth Wenlong Zhang.

CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets

Presentation transcript:

Mining Frequent Item Sets by Opportunistic Projection Junqiang Liu1,4, Yunhe Pan1, Ke Wang2, Jiawei Han3 1 Institute of Artificial Intelligence, Zhejiang University, China 2 School of Computing Science, Simon Fraser University, Canada 3 Department of Computer Science, UIUC, USA 4 Dept. of CS, Hangzhou University of Commerce, China

Outline How to discover frequent item sets Previous works Our approach: Mining Frequent Item Sets by Opportunistic Projection Performance evaluations Conclusions

What Are Frequent Items Sets What is a frequent item set? set of items, X, that occurs together frequently in a database, i.e., support(X) ≥ a given threshold Example Given support threshold 3, frequent item sets are as follows: a:3, b:3, c:4, f :4, m:3, p:3, ac:3, af :3, am:3, cf :3, cm:3, cp:3, fm:3, acf :3, acm:3, afm:3, cfm:3, acfm:3 tid items 01 a c d f g i m p 02 a b c f l m o 03 b f h j o 04 b c k p s 05 a c e f l m n p

How To Discover Frequent Item Sets Frequent item sets can be represented by a tree, which is not necessarily materailized. Mining process: a process of tree construction, accompanied by a process of projecting transaction subsets

Frequent Item Set Tree - FIST FIST is an ordered tree each node: (item,weight) the following are imposed items ordered on a path (top-down) items ordered at children (left to right) Frequent item set a path starting from the FIST root its support is the ending node’s weight PTS - projected transaction subset Each FIST node has its own PTS, filtered or unfiltered All transactions that support the frequent item set represented by the node

Frequent Item Set Tree (example)

Factors relate to Mining Efficiency and Scalability The FIST construction strategy breadth first v.s. depth first The PTS representation Memory-based representation: array-based, tree-based, vertical bitmap, horizontal bitstring, etc. Disk-based representation PTS projecting method and item counting method

Previous Works Research Strategy PTS Representation Projecting Method Remarks Apriori breadth first original DB on the fly Repetitive DB Scans Huge FIST for dense Exp. pattern matching Tree Projection FPGrowth depth first FP-tree recursively materialize conditional DB/Fptree #of conditional FPtree in same order of mag. as # of fre. item sets H-Mine H-struct partially materialize sub H-struct Not most eff. for sparse Call FP-Growth for dense Partition for large Depth Project horizontal bitstring selective projection Maximal fre. item sets Less efficient than array-based for sparse & large Less efficient than tree-based for dense MAFIA vertical bitmap recursively materialize compressions

Our Approach: Mining Frequent Item Sets by Opportunistic Projection Philosophy: The algorithm must adapt the construction strategy of FIST, the representation of PTS, and the methods of item counting in and projection of PTSs to the features of PTSs. Main points: Mining sparse data by projecting array-based PTS Intelligent projecting tree-based PTS for dense data Heuristics for opportunistic projection

Mining sparse data by projecting array-based PTS TVLA – threaded varied length array for sparse PTS FIL– local frequent items list LQ – linked queues arrays Each local frequent item has a FIL entry that consists of an item, a count, & a pointer. Each transaction is stored in an array that is threaded to FIL by LQ according to the heading item in the imposing order.

How to project TVLA for PTS Arrays (transactions) that support a node’s first child are threaded by the LQ attached to the first entry of FIL. (see previous figure) TVLA for a child node’s PTS has its own FIL and LQ. A child TVLA is unfiltered if it shares arrays with its parent, filtered otherwise.

How to project TVLA for PTS (cont.) Get next child’s PTS by shifting transactions threaded in the LQ currently explored (current child’s PTS)

Intelligent projecting tree-based PTS for dense data Tree-based Representation of dense PTS, inspired by FP-Growth Novel projecting methods, totally differ from FP-Growth Bottom up pseudo projection Top down pseudo projection

Tree-based Representation of dense PTS TTF - threaded transaction forest IL - item list: each entry consists of an item, a count, and a pointer. Forest: each node labeled by an item, associated with a weight. Each local item in PTS has an entry in the IL. Each transaction in the PTS is one path starting from a root in the forest. count is the number of transactions represented by the path. All nodes of the same item threaded by an IL entry. TTF is filtered if only local frequent items appear in TTF, otherwise unfiltered.

Bottom up pseudo projection of TTF (example)

Top down pseudo projection of TTF (example)

Opportunistic Projection: Observations and Heuristics Upper portion of a FIST can fit in memory. Transactions’ Number that support length k item sets decreases sharply when k is greater than 2. Heuristic 1: Grow the upper portion of a FIST breadth first. Grow the lower portion under level k depth first, whenever the reduced transaction set can be represented by a memory based structure, either TVLA or TTF.

Opportunistic Projection: Observations and Heuristics(2) TTF compresses well at lower levels or denser branches, where there are fewer local frequent items in PTSs and the relative support is larger. TTF is space expensive relative to TVLA if its compression ratio is less than 6-t/n ( t: number of transactions, n: number of items in a PTS). Heuristic 2: Represent PTSs by TVLA at high levels on FIST, unless the estimated compression ratio of TTF is sufficiently high.

Opportunistic Projection: Observations and Heuristics(3) PTSs shrink very quickly at high levels or sparse branches on FIST where filtered PTSs are usually in form of TVLA. PTSs at lower levels or dense branches shrink slowly where PTSs are represented by TTF. The creation of filtered TTF involves expensive pattern matching. Heuristic 3: Make a filtered copy for the child TVLA as long as there is free memory when projecting a parent TVLA. Delimitate the pseudo child TTF first and then make a filtered copy if it shrinks substantially sharp when projecting a parent TTF.

Algorithm OpportuneProject OpportuneProject(Database: D) begin create a null root for frequent item set tree T; D’= BreadthFirst(T, D); GuidedDepthFirst(root_of_T, D’); end

Performance Evaluation: Efficiency on BMS-POS (sparse)

Performance Evaluation: Efficiency on BMS-WebView1 (sparse)

Performance Evaluation: Efficiency on BMS-WebView2 (sparse)

Performance Evaluation: Efficiency on Connect4 (dense)

Performance Evaluation: Efficiency on T25I20D100kN20kL5k

Performance Evaluation: Scalability on T25I20D1mN20kL5k

Performance Evaluation: Scalability on T25I20D10mN20kL5k

Performance Evaluation: Scalability on T25I20D100k~15mN20kL5k

Conclusions OpportuneProject maximize efficiency and scalability for all data features by combining depth first with breadth first search strategies array-based and tree-based representation for projected transaction subsets unfiltered, and filetered projections

Acknowledgement We would like to thank Blue Martini Software, Inc. for providing us the BMS datasets!

References [1] R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent itemsets. In Journal of Parallel and Distributed Computing (Special Issue on High Performance Data Mining), 2000. [2] R. Agarwal, C. Aggarwal, and V. V. V. Prasad. Depth first generation of long patterns, in Proceedings of SIGKDD Conference, 2000. [3] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In SIGMOD’93, Washington, D.C., May 1993. [4] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In VLDB'94, pp. 487-499, Santiago, Chile, Sept. 1994. [5] R.J.Bayardo. Efficiently mining long patterns from databases. In SIGMOD’98, pp. 85-93, Seattle, Washington, June 1998. [6] D.Burdick, M.Calimlim, J.Gehrke. MAFIA: A maximal frequent itemset algorithm for transactional databases. In proceedings of the 17th Internation Conference on Data Engineering, Heidelberg, Germany, April 2001. [7] Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, Shalom Tsur. Dynamic Itemset Counting and Implication Rules for Market Basket Analysis. In SIGMOD’97, 255-264. Tucson, AZ, May 1997.

References (2) [8] J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. In VLDB'95, Zuich, Switzerland, Sept. 1995. [9] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD’2000, Dallas, TX, May 2000. [10] D-I. Lin and Z. M. Kedem. Pincer-search: A new algorithm for discovering the maximum frequent set. In 6th Intl. Conf. Extending Database Technology, March 1998. [11] J.S.Park, M.S.Chen, and P.S.Yu. An effective hash based algorithm for mining association rules. In Proc. 1995 ACM-SIGMOD, 175-186, San Jose, CA, Feb. 1995. [12] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang, H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases, Proc. 2001 Int. Conf. on Data Mining (ICDM'01)}, San Jose, CA, Nov. 2001. [13] Ashok Sarasere, Edward Omiecinsky, and Shamkant Navathe. An efficient algorithm for mining association rules in large databases. In 21st Int'l Conf. on Very Large Databases (VLDB), Zurich, Switzerland, Sept. 1995.

References (3) [14] H.Toivonen. Sampling large databases for association rules. In Proc. 1996 Int. Conf. Very Large Data Bases (VLDB’96), 134-145, Bombay, India, Sept. 1996. [15] Zijian Zheng, Ron Kohavi and Llew Mason. Real World Performance of Association Rule Algorithms. In Proc. 2001 Int. Conf. on Knowledge Discovery in Databases (KDD'01), San Francisco, California, Aug. 2001. [16] http://fuzzy.cs.uni-magdeburg.de/~borgelt/src/apriori.exe [17] http://www.almaden.ibm.com/cs/quest/syndata.html [18] http://www.ics.uci.edu/~mlearn/MLRepository.html

Thank you !!!