An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining 2/Oct/2007 Discovery Science 2007 Takeaki Uno (National Institute of Informatics)

An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining 2/Oct/2007 Discovery Science 2007 Takeaki Uno (National Institute of Informatics) Hiroki Arimura (Hokkaido University)

Frequent Pattern Mining problem of finding all frequently appearing patterns from (large scale) database database: transaction, tree, string, graph, vector pattern: subset, tree, path, sequence, graph, geograph… Genome info experiments database ATGCGCCGTA TAGCGGGTGG TTCGCGTTAG GGATATAAAT GCGCCAAATA ATAATGTATTA TTGAAGGGCG ACAGTCTCTCA ATAAGCGGCT ATGCGCCGTA TAGCGGGTGG TTCGCGTTAG GGATATAAAT GCGCCAAATA ATAATGTATTA TTGAAGGGCG ACAGTCTCTCA ATAAGCGGCT ex1ex2ex3ex4 ● ▲ ▲ ● ▲ ● ● ▲ ● ● ● ▲ ● ▲ ● ● ● ▲ ● ● ▲ ▲ ▲ ▲ ex1 ●,ex3 ▲ ex2 ●,ex4 ● ex2 ●, ex3 ▲, ex4 ● ex2 ▲,ex3 ▲ ． ex1 ●,ex3 ▲ ex2 ●,ex4 ● ex2 ●, ex3 ▲, ex4 ● ex2 ▲,ex3 ▲ ． ATGCAT CCCGGGTAA GGCGTTA ATAAGGG ． ATGCAT CCCGGGTAA GGCGTTA ATAAGGG ．

This Research address transaction database transaction database: D ∀ D transaction database: each record (transaction) T of the database is a subset of the itemset E, i.e., D, ∀ T ∈ D, T ⊆ E frequent itemset: frequent itemset: subset of E included in at least σ transactions problems - - so many patterns for finding valuable patterns - - inclusion is strict, to deal with errors   "patterns ambiguously included in many transactions" are impotant We introduce an ambiguous inclusion, and propose an efficient mining algorithm We introduce an ambiguous inclusion, and propose an efficient mining algorithm minimum support threshold

Related Works Such frequent itemset mining with ambiguity is called fault-tolerant pattern, degenerate pattern, soft occurrence - - ambiguity for inclusion is, "pattern is included if the ratio of included items is more than the threshold - - another approach: find combinations of itemset and transaction set, such that few pairs of item and transaction do not satisfy inclusion relation - - similarity is used, for string matching and homology search Few "enumeration type" research with completeness Look at practical models and algorithms, from algorithm theory

Notations for F.I.M. For itemset K, occurrence of : D occurrence of K: transaction of D including K : occurrence set of : Occ(K): occurrence set of K: the set of occurrences of K :frequency of : frq(K): frequency of K: the size of Occ(K) 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 ＝ D ＝ Occ( {1,2} ) ＝＝ { {1,2,5,6,7,9}, {1,2,7,8,9} } Occ( {2,7,9} ) ＝＝ { {1,2,5,6,7,9}, {1,2,7,8,9}, {2,7,9} }

Frequent Itemset Frequent itemset: Frequent itemset: itemset with frequency no less than σ ( σ is called minimum support (threshold) )Ex.) 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 D＝D ＝D＝D ＝ Itemsets included in no less than 3 transactions {1} {2} {7} {9} {1,7} {1,9} {2,7} {2,9} {7,9} {1,7,9} {2,7,9} Frequent itemset mining: problem of enumerating all frequent itemsets for given database D and minimum support σ Frequent itemset mining: problem of enumerating all frequent itemsets for given database D and minimum support σ

Inclusion with Ambiguity Ambiguous inclusion relation for itemset P and transaction T Popular definition: |P∩T| ／ |P| ≧ θ for threshold θ<1   lose monotonicity of frequent itemsets   there is a frequent itemset s.t. "any its subset is infrequent"   much cost for computation {1,2,3} ⊆ {1,2,4,5} for θ= 0.6 {1,2,3,4,5,6,7} ⊆ {1,3,5,6,7} for θ= 0.6 {1,2,3} ⊆ {1,4,5} for θ= 0.6 θ= 0.6 {1,2} {2,3} {1,3} θ= 0.6 {1,2} {2,3} {1,3} {1,2,3}  included in all subset  not for any {1,2,3}  included in all subset  not for any

k-pseudo Inclusion Use threshold for #non-included items: k-pseudo inclusion: |P ＼ T| ≦ k for threshold k ≧ 0 ( k-pseudo [occurrence / occurrence set / frequency] )   monotonicity is kept   able to find characterizations such as "many transactions include at least 3 items of P" {1,2,3} ⊆ {1,2,4,5} for k = 1 {1,2,3,4,5,6,7} ⊆ {1,3,5,6,7} for k = 1 {1,2,3} ⊆ {1,4,5} for k = 1

k Pseudo Frequent Itemset k-pseudo frequent itemset: D k-pseudo frequent itemset: itemset k-pseudo included in at least σ transactions of D 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 D ＝ 1-pseudo frequent itemsets for σ=3 {1,2,3} {1,2,4} {1,2,5} {1,2,7} {1,2,9} {1,3,7} {1,3,9} {1,4,7} {1,4,9} {1,5,7} {1,5,9} {1,6,7} {1,6,9} {1,7,8} {1,7,9} {1,8,9} {2,3,7} {2,3,9} {2,4,7} {2,4,9} {2,5,7} {2,5,8} {2,5,9} {2,6,7} {2,6,9} {2,7,8} {2,7,9} {2,8,9} {3,7,9} {4,7,9} {5,7,9} {6,7,9} {7,8,9} {1,2,7,9} {1,3,7,9} {1,4,7,9}{1,5,7,9} {1,6,7,9} {1,7,8,9} {2,3,7,9} {2,4,7,9} {2,5,7,9} {2,6,7,9} {2,7,8,9} Many trivial patterns How to efficiently enumerate? Many trivial patterns How to efficiently enumerate?

Enumeration using Monotonicity Pseudo frequent itemsets have monotone property thereby simple backtrack algorithm work For each k-pseudo frequent itemset P, compute k-pseudo frequency of each P+e If the k-pseudo frequency of P+e is no less than σ, generate recursive call to enumerate k-pseudo frequent itemsets including P+e freq 111…1 000…0 φ 1,3 1,2 1,2,31,2,41,3,42,3,4 1 2 34 3,42,41,42,3 1,2,3,4 Polynomial time enumeration How to efficiently computate?

Computing k-Pseudo Occurrences Define Occ =h (P) = { T ∈ D | |P ＼ T| = h }   set of transactions missing just h items of P   Occ ≦ k (P) = ∪ h ≦ k Occ =h (P) Occ =h (P ∪ e) = Occ =h (P)∩Occ(e) ∪ Occ =h-1 (P) ＼ Occ(e)   update of pseudo occurrence set is done by taking intersection compute Occ =h (P)∩Occ(e) for all pair of e and h ABCDEFGABCDEFG ABCDEFGABCDEFG ABCFABCF ABCFABCF BCFBCF BCFBCF 8 9 10 11 12 ABEFGABEFG ABEFGABEFG ABCDABCD ABCDABCD CDCD CDCD ABCDFABCDF ABCDFABCDF BACDFBACDF BACDFBACDF ABCDEFGABCDEFG ABCDEFGABCDEFG ABCDABCD ABCDABCD ABCDABCD ABCDABCD Occ 0 Occ 1 Occ 2 P

Taking Intersections Efficiently Occ =h (P ∪ e) = Occ =h (P)∩Occ(e) ∪ Occ =h-1 (P) ＼ Occ(e)   having the same properties as usual occurrences   can use many existing techniques for updating occurrence set (down project, delivery, bitmap…) Database reduction (FP-tree) is also available In deeper levels of recursion, transactions to be scanned becomes few, thereby the computation is fast A: 1,2,5,6,7,9 B: 2,3,4,5 C: 1,2,7,8,9 D: 1,7,9 E: 2,7,9 F: 2 1: A,C,D 2: A,B,C,E,F 3: B 4: B 5: A,B 6: A 7: A,C,D,E 8: C 9: A,C,D,E

Using Bottom-wideness Backtrack (depth-first search) generates several recursive calls in each iteration   The computation tree spreads exponentially by going down   The computation time is dominated by the bottom level iterations on the recursion tree Amortized computation time is reduced to that of bottom levels ・・・ long time short time Since occurrences to be computed is few in lower levels, Since occurrences to be computed is few in lower levels,

For Large Minimum Support When σ is large, we access many transactions on the bottom levels   Improvements by bottom-wideness is not drastic Reduce the database to speed up the bottoms (1) (1) Delete items less than the maximum item in P (2) (2) Delete items being infrequent on the occurrence set database (since it never be added in the recursive call) (3) (3) unify the same transactions The database size is constant in the bottom levels in practice No big difference from small σ 135 12346 17 23467 34567 23467 P={1,3}, k=1, σ=4

any Under the k-pseudo inclusion, itemsets of size no more than k is included in any transaction itemsets of size bit greater than k is also included in many transactions   Many small and trivial frequent itemsets We want to ignore these itemsets in practice   Consider problem of directly finding pseudo frequent itemsets of size l Small & Trivial Patterns

Need exponential time if search all itemsets of size l   Pruning unnecessary search is crucial   Take candidates according to partial structure Let P be a k-pseudo frequent itemset of size l WLOG, P={1,…,l} and sorted in decreasing order of |Occ =k (P) ＼ Occ({e})| Consider the (k-1)-pseudo frequency of itemset {1,…,y} Any transaction in Occ =k (P) ＼ Occ({e}), e>y (k-1)-pseudo includes {1,…,y} Directly Finding Large Itemset

Any transaction in Occ =k (P) ＼ Occ({e}), e>y (k-1)-pseudo includes {1,…,y}   |Occ k-1 ({1,…,y})| ≧ | ∪ e=y+1,...,|P| (Occ k (P) ＼ Occ({e}))| average of |Occ k (P) ＼ Occ({e})| is no less than (k / |P|) |Occ =k (P)| 1,…,y are sorted in increasing order of |Occ k (P) ＼ Occ({e})|   |Occ k-1 ({1,…,y})| ≧ |Occ k (P)|×(|P|-y)/|P| Search Route to Itemset of Size l There is a sequence of itemsets from empty set to P composed only of itemsets satisfying partial frequency condition Partial frequency condition

Example for Partial Frequency Condition 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 D ＝ 1-pseudo frequent itemsets satisfying the partial frequency condition satisfying the partial frequency condition {1} {2} {5} {7} {9} {1,2} {1,5} {1,6} {1,7} {1,8} {1,9} {2,3} {2,4} {2,5} {2,6} {2,7} {2,8} {2,9} {3,5} {4,5} {5,6} {5,7} {5,9} {6,7} {6,9} {7,8} {7,9} {8,9} #frequent itemsets to be searched is decreased,  efficient search is expected #frequent itemsets to be searched is decreased,  efficient search is expected Itemsets satisfying the partial frequency condition, for k=1, σ=3, l=3

Any k-pseudo frequent itemset of size l can be found by passing through those satisfying partial frequency condition   Let's do backtrack search Always exist an item whose removal satisfies the condition Tail extension is not available (removal of tail may violate condition) Simple hill climbing generates duplications So, use a generation rule to avoid duplication (reverse search) Restricted Search Route by P.F.C.

Rule: generate itemset P from P ＼ {e} maximizing |Occ k-1 (P ＼ {e})| (Tie is broken by choosing the minimum index) ReverseSearch (P) 1. if P|=1 then output P; return; 2. for each e ∈ P do if P+e is a k-pseudo frequent itemset satisfying P.F.C. then if e maximizes |Occ k-1 (P ＼ {e})| then ReverseSearch (P+e) 3. end for |Occ k-1 (P ＼ {e})| can be efficiently computed by existing methods Reverse Search for P.F.C. O(|P|×||D||) time for one iteration

ConclusionConclusion Introduced ambiguous inclusion relation such that at most k items of the pattern is not included Pseudo frequent itemset mining under the inclusion (monotonicity, intersection, many small-trivial patterns) Reverse search for directly finding frequent itemset with fixed size implementation and experiments extension of the technique to other pattern mining approach to inclusion with "ratio r %" implementation and experiments extension of the technique to other pattern mining approach to inclusion with "ratio r %" Future works

An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining 2/Oct/2007 Discovery Science 2007 Takeaki Uno (National Institute of Informatics)

Similar presentations

Presentation on theme: "An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining 2/Oct/2007 Discovery Science 2007 Takeaki Uno (National Institute of Informatics)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining 2/Oct/2007 Discovery Science 2007 Takeaki Uno (National Institute of Informatics)

Similar presentations

Presentation on theme: "An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining 2/Oct/2007 Discovery Science 2007 Takeaki Uno (National Institute of Informatics)"— Presentation transcript:

Similar presentations

About project

Feedback