An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining 2/Oct/2007 Discovery Science 2007 Takeaki Uno (National Institute of Informatics)

Slides:

Advertisements

Similar presentations

Association Rule Mining

Advertisements

Mining Association Rules

Recap: Mining association rules from large datasets

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

gSpan: Graph-based substructure pattern mining

Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo.

LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo.

FP (FREQUENT PATTERN)-GROWTH ALGORITHM ERTAN LJAJIĆ, 3392/2013 Elektrotehnički fakultet Univerziteta u Beogradu.

Data Mining Association Analysis: Basic Concepts and Algorithms

Rakesh Agrawal Ramakrishnan Srikant

Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.

Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center.

Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,

Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.

FP-growth. Challenges of Frequent Pattern Mining Improving Apriori Fp-growth Fp-tree Mining frequent patterns with FP-tree Visualization of Association.

Data Mining Association Analysis: Basic Concepts and Algorithms

1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.

Association Analysis: Basic Concepts and Algorithms.

Data Mining Association Analysis: Basic Concepts and Algorithms

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

Fast Algorithms for Association Rule Mining

Association Analysis (3). FP-Tree/FP-Growth Algorithm Use a compressed representation of the database using an FP-tree Once an FP-tree has been constructed,

1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.

1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.

An Efficient Algorithm for Finding Similar Short Substrings from Large Scale String Data May/23/2008 PAKDD 2008 Takeaki Uno National Institute of Informatics,

FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,

Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.

A Fast Algorithm for Enumerating Bipartite Perfect Matchings Takeaki Uno (National Institute of Informatics, JAPAN)

1 Apriori Algorithm Review for Finals. SE 157B, Spring Semester 2007 Professor Lee By Gaurang Negandhi.

Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of.

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura

LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University.

CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.

LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

CanTree: a tree structure for efficient incremental mining of frequent patterns Carson Kai-Sang Leung, Quamrul I. Khan, Tariqul Hoque ICDM ’ 05 報告者：林靜怡.

Speeding Up Enumeration Algorithms with Amortized Analysis Takeaki Uno (National Institute of Informatics, JAPAN)

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.

Association Analysis (3)

Output Sensitive Algorithm for Finding Similar Objects Jul/2/2007 Combinatorial Algorithms Day Takeaki Uno Takeaki Uno National Institute of Informatics,

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.

1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.

Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),

Fast Algorithms for BIG DATA (title means “I make slides according to the interests of audience ) 14/Jan/2012 NII Shonan-meeting (open problem seminar)

Approach to Data Mining from Algorithm and Computation Takeaki Uno, ETH Switzerland, NII Japan Hiroki Arimura, Hokkaido University, Japan.

Reducing Number of Candidates

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Rules Repoussis Panagiotis.

Frequent Pattern Mining

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Analysis: Basic Concepts and Algorithms

Market Basket Analysis and Association Rules

Output Sensitive Enumeration

Association Analysis: Basic Concepts

Presentation transcript:

An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining 2/Oct/2007 Discovery Science 2007 Takeaki Uno (National Institute of Informatics) Hiroki Arimura (Hokkaido University)

Frequent Pattern Mining problem of finding all frequently appearing patterns from (large scale) database database: transaction, tree, string, graph, vector pattern: subset, tree, path, sequence, graph, geograph… Genome info experiments database ATGCGCCGTA TAGCGGGTGG TTCGCGTTAG GGATATAAAT GCGCCAAATA ATAATGTATTA TTGAAGGGCG ACAGTCTCTCA ATAAGCGGCT ATGCGCCGTA TAGCGGGTGG TTCGCGTTAG GGATATAAAT GCGCCAAATA ATAATGTATTA TTGAAGGGCG ACAGTCTCTCA ATAAGCGGCT ex1ex2ex3ex4 ● ▲ ▲ ● ▲ ● ● ▲ ● ● ● ▲ ● ▲ ● ● ● ▲ ● ● ▲ ▲ ▲ ▲ ex1 ●,ex3 ▲ ex2 ●,ex4 ● ex2 ●, ex3 ▲, ex4 ● ex2 ▲,ex3 ▲ ． ex1 ●,ex3 ▲ ex2 ●,ex4 ● ex2 ●, ex3 ▲, ex4 ● ex2 ▲,ex3 ▲ ． ATGCAT CCCGGGTAA GGCGTTA ATAAGGG ． ATGCAT CCCGGGTAA GGCGTTA ATAAGGG ．

This Research address transaction database transaction database: D ∀ D transaction database: each record (transaction) T of the database is a subset of the itemset E, i.e., D, ∀ T ∈ D, T ⊆ E frequent itemset: frequent itemset: subset of E included in at least σ transactions problems - - so many patterns for finding valuable patterns - - inclusion is strict, to deal with errors   "patterns ambiguously included in many transactions" are impotant We introduce an ambiguous inclusion, and propose an efficient mining algorithm We introduce an ambiguous inclusion, and propose an efficient mining algorithm minimum support threshold

Related Works Such frequent itemset mining with ambiguity is called fault-tolerant pattern, degenerate pattern, soft occurrence - - ambiguity for inclusion is, "pattern is included if the ratio of included items is more than the threshold - - another approach: find combinations of itemset and transaction set, such that few pairs of item and transaction do not satisfy inclusion relation - - similarity is used, for string matching and homology search Few "enumeration type" research with completeness Look at practical models and algorithms, from algorithm theory

Notations for F.I.M. For itemset K, occurrence of : D occurrence of K: transaction of D including K : occurrence set of : Occ(K): occurrence set of K: the set of occurrences of K :frequency of : frq(K): frequency of K: the size of Occ(K) 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 ＝ D ＝ Occ( {1,2} ) ＝＝ { {1,2,5,6,7,9}, {1,2,7,8,9} } Occ( {2,7,9} ) ＝＝ { {1,2,5,6,7,9}, {1,2,7,8,9}, {2,7,9} }

Frequent Itemset Frequent itemset: Frequent itemset: itemset with frequency no less than σ ( σ is called minimum support (threshold) )Ex.) 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 D＝D ＝D＝D ＝ Itemsets included in no less than 3 transactions {1} {2} {7} {9} {1,7} {1,9} {2,7} {2,9} {7,9} {1,7,9} {2,7,9} Frequent itemset mining: problem of enumerating all frequent itemsets for given database D and minimum support σ Frequent itemset mining: problem of enumerating all frequent itemsets for given database D and minimum support σ

Inclusion with Ambiguity Ambiguous inclusion relation for itemset P and transaction T Popular definition: |P∩T| ／ |P| ≧ θ for threshold θ<1   lose monotonicity of frequent itemsets   there is a frequent itemset s.t. "any its subset is infrequent"   much cost for computation {1,2,3} ⊆ {1,2,4,5} for θ= 0.6 {1,2,3,4,5,6,7} ⊆ {1,3,5,6,7} for θ= 0.6 {1,2,3} ⊆ {1,4,5} for θ= 0.6 θ= 0.6 {1,2} {2,3} {1,3} θ= 0.6 {1,2} {2,3} {1,3} {1,2,3}  included in all subset  not for any {1,2,3}  included in all subset  not for any

k-pseudo Inclusion Use threshold for #non-included items: k-pseudo inclusion: |P ＼ T| ≦ k for threshold k ≧ 0 ( k-pseudo [occurrence / occurrence set / frequency] )   monotonicity is kept   able to find characterizations such as "many transactions include at least 3 items of P" {1,2,3} ⊆ {1,2,4,5} for k = 1 {1,2,3,4,5,6,7} ⊆ {1,3,5,6,7} for k = 1 {1,2,3} ⊆ {1,4,5} for k = 1

k Pseudo Frequent Itemset k-pseudo frequent itemset: D k-pseudo frequent itemset: itemset k-pseudo included in at least σ transactions of D 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 D ＝ 1-pseudo frequent itemsets for σ=3 {1,2,3} {1,2,4} {1,2,5} {1,2,7} {1,2,9} {1,3,7} {1,3,9} {1,4,7} {1,4,9} {1,5,7} {1,5,9} {1,6,7} {1,6,9} {1,7,8} {1,7,9} {1,8,9} {2,3,7} {2,3,9} {2,4,7} {2,4,9} {2,5,7} {2,5,8} {2,5,9} {2,6,7} {2,6,9} {2,7,8} {2,7,9} {2,8,9} {3,7,9} {4,7,9} {5,7,9} {6,7,9} {7,8,9} {1,2,7,9} {1,3,7,9} {1,4,7,9}{1,5,7,9} {1,6,7,9} {1,7,8,9} {2,3,7,9} {2,4,7,9} {2,5,7,9} {2,6,7,9} {2,7,8,9} Many trivial patterns How to efficiently enumerate? Many trivial patterns How to efficiently enumerate?

Enumeration using Monotonicity Pseudo frequent itemsets have monotone property thereby simple backtrack algorithm work For each k-pseudo frequent itemset P, compute k-pseudo frequency of each P+e If the k-pseudo frequency of P+e is no less than σ, generate recursive call to enumerate k-pseudo frequent itemsets including P+e freq 111…1 000…0 φ 1,3 1,2 1,2,31,2,41,3,42,3, ,42,41,42,3 1,2,3,4 Polynomial time enumeration How to efficiently computate?

Computing k-Pseudo Occurrences Define Occ =h (P) = { T ∈ D | |P ＼ T| = h }   set of transactions missing just h items of P   Occ ≦ k (P) = ∪ h ≦ k Occ =h (P) Occ =h (P ∪ e) = Occ =h (P)∩Occ(e) ∪ Occ =h-1 (P) ＼ Occ(e)   update of pseudo occurrence set is done by taking intersection compute Occ =h (P)∩Occ(e) for all pair of e and h ABCDEFGABCDEFG ABCDEFGABCDEFG ABCFABCF ABCFABCF BCFBCF BCFBCF ABEFGABEFG ABEFGABEFG ABCDABCD ABCDABCD CDCD CDCD ABCDFABCDF ABCDFABCDF BACDFBACDF BACDFBACDF ABCDEFGABCDEFG ABCDEFGABCDEFG ABCDABCD ABCDABCD ABCDABCD ABCDABCD Occ 0 Occ 1 Occ 2 P

Taking Intersections Efficiently Occ =h (P ∪ e) = Occ =h (P)∩Occ(e) ∪ Occ =h-1 (P) ＼ Occ(e)   having the same properties as usual occurrences   can use many existing techniques for updating occurrence set (down project, delivery, bitmap…) Database reduction (FP-tree) is also available In deeper levels of recursion, transactions to be scanned becomes few, thereby the computation is fast A: 1,2,5,6,7,9 B: 2,3,4,5 C: 1,2,7,8,9 D: 1,7,9 E: 2,7,9 F: 2 1: A,C,D 2: A,B,C,E,F 3: B 4: B 5: A,B 6: A 7: A,C,D,E 8: C 9: A,C,D,E

Using Bottom-wideness Backtrack (depth-first search) generates several recursive calls in each iteration   The computation tree spreads exponentially by going down   The computation time is dominated by the bottom level iterations on the recursion tree Amortized computation time is reduced to that of bottom levels ・・・ long time short time Since occurrences to be computed is few in lower levels, Since occurrences to be computed is few in lower levels,

For Large Minimum Support When σ is large, we access many transactions on the bottom levels   Improvements by bottom-wideness is not drastic Reduce the database to speed up the bottoms (1) (1) Delete items less than the maximum item in P (2) (2) Delete items being infrequent on the occurrence set database (since it never be added in the recursive call) (3) (3) unify the same transactions The database size is constant in the bottom levels in practice No big difference from small σ P={1,3}, k=1, σ=4

any Under the k-pseudo inclusion, itemsets of size no more than k is included in any transaction itemsets of size bit greater than k is also included in many transactions   Many small and trivial frequent itemsets We want to ignore these itemsets in practice   Consider problem of directly finding pseudo frequent itemsets of size l Small & Trivial Patterns

Need exponential time if search all itemsets of size l   Pruning unnecessary search is crucial   Take candidates according to partial structure Let P be a k-pseudo frequent itemset of size l WLOG, P={1,…,l} and sorted in decreasing order of |Occ =k (P) ＼ Occ({e})| Consider the (k-1)-pseudo frequency of itemset {1,…,y} Any transaction in Occ =k (P) ＼ Occ({e}), e>y (k-1)-pseudo includes {1,…,y} Directly Finding Large Itemset

Any transaction in Occ =k (P) ＼ Occ({e}), e>y (k-1)-pseudo includes {1,…,y}   |Occ k-1 ({1,…,y})| ≧ | ∪ e=y+1,...,|P| (Occ k (P) ＼ Occ({e}))| average of |Occ k (P) ＼ Occ({e})| is no less than (k / |P|) |Occ =k (P)| 1,…,y are sorted in increasing order of |Occ k (P) ＼ Occ({e})|   |Occ k-1 ({1,…,y})| ≧ |Occ k (P)|×(|P|-y)/|P| Search Route to Itemset of Size l There is a sequence of itemsets from empty set to P composed only of itemsets satisfying partial frequency condition Partial frequency condition

Example for Partial Frequency Condition 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 D ＝ 1-pseudo frequent itemsets satisfying the partial frequency condition satisfying the partial frequency condition {1} {2} {5} {7} {9} {1,2} {1,5} {1,6} {1,7} {1,8} {1,9} {2,3} {2,4} {2,5} {2,6} {2,7} {2,8} {2,9} {3,5} {4,5} {5,6} {5,7} {5,9} {6,7} {6,9} {7,8} {7,9} {8,9} #frequent itemsets to be searched is decreased,  efficient search is expected #frequent itemsets to be searched is decreased,  efficient search is expected Itemsets satisfying the partial frequency condition, for k=1, σ=3, l=3

Any k-pseudo frequent itemset of size l can be found by passing through those satisfying partial frequency condition   Let's do backtrack search Always exist an item whose removal satisfies the condition Tail extension is not available (removal of tail may violate condition) Simple hill climbing generates duplications So, use a generation rule to avoid duplication (reverse search) Restricted Search Route by P.F.C.

Rule: generate itemset P from P ＼ {e} maximizing |Occ k-1 (P ＼ {e})| (Tie is broken by choosing the minimum index) ReverseSearch (P) 1. if P|=1 then output P; return; 2. for each e ∈ P do if P+e is a k-pseudo frequent itemset satisfying P.F.C. then if e maximizes |Occ k-1 (P ＼ {e})| then ReverseSearch (P+e) 3. end for |Occ k-1 (P ＼ {e})| can be efficiently computed by existing methods Reverse Search for P.F.C. O(|P|×||D||) time for one iteration

ConclusionConclusion Introduced ambiguous inclusion relation such that at most k items of the pattern is not included Pseudo frequent itemset mining under the inclusion (monotonicity, intersection, many small-trivial patterns) Reverse search for directly finding frequent itemset with fixed size implementation and experiments extension of the technique to other pattern mining approach to inclusion with "ratio r %" implementation and experiments extension of the technique to other pattern mining approach to inclusion with "ratio r %" Future works