Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of.

Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of Informatics, JAPAN (The Guraduate University for Advanced Science) (2) Hokkaido University, JAPAN

Frequent Pattern Mining Problem of finding all frequently appearing patterns from given database database: transaction database (itemset), tree, graph, vector patterns: itemset, tree, path/cycle, graph, geometric graph… genome experiments database Extract frequently appearing patterns ATGCGCCGTA TAGCGGGTGG TTCGCGTTAG GGATATAAAT GCGCCAAATA ATAATGTATTA TTGAAGGGCG ACAGTCTCTCA ATAAGCGGCT ATGCGCCGTA TAGCGGGTGG TTCGCGTTAG GGATATAAAT GCGCCAAATA ATAATGTATTA TTGAAGGGCG ACAGTCTCTCA ATAAGCGGCT 実験 1 実験 2 実験 3 実験 4 ● ▲ ▲ ● ▲ ● ● ▲ ● ● ● ▲ ● ▲ ● ● ● ▲ ● ● ▲ ▲ ▲ ▲ ・・実験 1 ●, 実験 3 ▲ ・・実験 2 ●, 実験 4 ● ・・実験 2 ●, 実験 3 ▲, 実験 4 ● ・・実験 2 ▲, 実験 3 ▲ ．・・実験 1 ●, 実験 3 ▲ ・・実験 2 ●, 実験 4 ● ・・実験 2 ●, 実験 3 ▲, 実験 4 ● ・・実験 2 ▲, 実験 3 ▲ ．・・ ATGCAT ・・ CCCGGGTAA ・・ GGCGTTA ・・ ATAAGGG ．・・ ATGCAT ・・ CCCGGGTAA ・・ GGCGTTA ・・ ATAAGGG ．

Researches on Pattern Mining So many studies and applications on itemsets, sequences, trees, graphs, geometric graphs Thanks to the efficient algorithms, we would say any simple structures can be enumerated in practically short time One of the next problems is “how to handle the noise, error, and ambiguity”   usual “inclusion” is too strict   we want to find patterns “mostly” included in many records We consider ambiguous appearance of patterns

Related Works on Ambiguity It is popular to detect “ambiguous XXXX”   dense substructures: clustering, community discovering…   homology search on genome sequence Heuristic search is popular because of the difficulty on modeling and computation Advantage Advantage: usually works efficiently Problem Problem: not easy to understand “what is found” much more cost for additional conditions(for each solution) Here we look at the problem from “algorithmic point of view” (efficient models arising from efficient computation)

Itemset Mining In this talk, we focus on the itemset mining transaction database D: ∀ D transaction database D: each record called transaction is a subset of itemset E, that is, ∀ T ∈ D, T ⊆ E Occ(P): set of transactions including P frq(P) = |Occ(P)|: #transactions including P  P is a frequent itemset  frq(P) ≥σ (σ is minimum support) D Problem is to enumerate all frequent itemsets in D We introduce ambiguous inclusion for frequent itemset mining

Related works fault-tolerant pattern 、 degenerate pattern 、 soft occurrence, etc. mainly two approaches (1) (1) generalize inclusion: (1-a) (1-a) the ratio of included items ≥θ  include   lose monotonicity; no subset may be frequent in the worst case   several heuristic-search-based algorithms (1-b) (1-b) at most k items are not included  include   satisfy monotonicity; so many small itemsets are frequent   maximal enumeration or complete enumeration with small k 1,2 2,3 1,3 θ=66%

Related works 2 (2) (2) find pairs of itemset and transaction set such that few of them do not satisfy inclusion   equivalent to finding dense submatrix, or dense bicluster so many equivalent patterns will be found   mainly, heuristic search for finding one such dense substructure ambiguity on the transaction set   an itemset can have many partners (2) We introduce a new model for (2) to avoid redundancy, and propose an efficient depth-first search type algorithm (2) We introduce a new model for (2) to avoid redundancy, and propose an efficient depth-first search type algorithm items transactions

Average Inclusion ⇔ inclusion ratio of t for P ⇔ | t∩P | ／ |P| average inclusion ratio of transaction set T for P ⇔ ⇔ average of inclusion ratio over all transactions in T ∑ |t ∩ P| ／ ( |P| × |T| )   equivalent to dense submatrix/subgraph of transaction-item inclusion matrix/graph For a density threshold θ, maximum co-occurrence size cov(P) of itemset P ⇔ ⇔ maximum size of transaction set s.t. average inclusion ratio ≥θ 1,3,4 2,4,5 1,2 1,3,4 2,4,5 1,2 2,3  50% 4,5  50% 1,2  66% 2,3  50% 4,5  50% 1,2  66%

Problem Definition For a density threshold θ, the maximum co-occurrence size cov(P) of itemset P ⇔ ⇔ maximum size of transaction set s.t. average inclusion ratio ≥θ Ambiguous frequent itemset: itemset P s.t., cov(P) ≥ σ (σ: minimum support) Ambiguous frequent itemsets are not monotone !! 1,3,4 2,4,5 1,2 1,3,4 2,4,5 1,2 θ=66%: cov({3}) = 1 cov({2}) = 3 cov({1,3}) = 2 cov({1,2}) = 3 θ=66%: cov({3}) = 1 cov({2}) = 3 cov({1,3}) = 2 cov({1,2}) = 3 Ambiguous frequent itemset enumeration: the problem of outputting all ambiguous frequent itemsets for given database D, density threshold θ, minimum support σ Ambiguous frequent itemset enumeration: the problem of outputting all ambiguous frequent itemsets for given database D, density threshold θ, minimum support σ The goal is to develop an efficient algorithm for this problem

Hardness for Branch-and-Bound A straightforward approach to this problem is branch-and-bound In each iteration, divide the problem into two non-empty problems by the inclusion of an item i 1, i 2 i1i1i1i1 v1v1v1v1 Checking the existence of ambiguous frequent itemset is NP-comp. (Theorem 1) Checking the existence of ambiguous frequent itemset is NP-comp. (Theorem 1)

Is This Really Hard? We proved NP-hardness for "very dense graphs"   unclear for middle dense graph   not impossible for polynomial time enumeration θ= 1 θ= 0 easy hard ?????????? polynomial time in (input size) + (output size) polynomial time in (input size) + (output size)

Efficient Algorithm: Idea of Reverse Search We don’t use branch and bound, but use reverse search Define an acyclic parent-child relation on all objects to be found Recursively find children to search, thus an algorithm for finding all children is sufficient objectsobjects Depth-first search on the rooted tree induced by the relation

Neighboring Relation AmbiOcc(P) of an ambiguous frequent itemset P ⇔ ⇔ lexicographically minimum one among transaction sets whose average inclusion ratio ≥θ and size = cov(P) e*(P): e e e*(P): the item e in P s.t. # transactions in AmbiOcc(P) including e is the minimum (ties are broken by taking the minimum index) e*(P) the parent Prt(P) of P: P ＼ e*(P) A: 1,3,4,7 B: 2,4,5 C: 1,2,7 D: 1,4,5,7 E: 2,3,6 F: 3,4,6 A: 1,3,4,7 B: 2,4,5 C: 1,2,7 D: 1,4,5,7 E: 2,3,6 F: 3,4,6 {1,4,5}   D, A,B, C,F, E AmbiOcc({1,4,5}) = {D,A,B,C} {1,4,5}   D, A,B, C,F, E AmbiOcc({1,4,5}) = {D,A,B,C} θ ＝ 66%, σ= 4 e*(P) = 5 Prt({1,4,5})   {1,4} AmbiOcc({1,4}) = {D,A, B,C, F} e*(P) = 5 Prt({1,4,5})   {1,4} AmbiOcc({1,4}) = {D,A, B,C, F}

Properties of Parent e*(P) The parent Prt(P) of P: P ＼ e*(P)   uniquely defined Average inclusion ratio of AmbiOcc(P) for P does not decrease   Prt(P) is an ambiguous frequent itemset |Prt(P)| < |P| (parent is always smaller)   the relation is acyclic, and induces a tree (rooted at φ) A: 1,3,4,7 B: 2,4,5 C: 1,2,7 D: 1,4,5,7 E: 2,3,6 F: 3,4,6 A: 1,3,4,7 B: 2,4,5 C: 1,2,7 D: 1,4,5,7 E: 2,3,6 F: 3,4,6 {1,4,5}   D, A,B, C,F, E AmbiOcc({1,4,5}) = {D,A,B,C} {1,4,5}   D, A,B, C,F, E AmbiOcc({1,4,5}) = {D,A,B,C} θ ＝ 66%, σ= 4 e*(P) = 5 Prt({1,4,5})   {1,4} AmbiOcc({1,4}) = {D,A, B,C, F} e*(P) = 5 Prt({1,4,5})   {1,4} AmbiOcc({1,4}) = {D,A, B,C, F}

Enumeration Tree The relation is acyclic, and induces a tree (rooted at φ) We call the tree enumeration tree A: 1,3,4,7 B: 2,4,5, C: 1,2,7 D: 1,4,5,7 E: 2,3,6 F: 3,4,6 A: 1,3,4,7 B: 2,4,5, C: 1,2,7 D: 1,4,5,7 E: 2,3,6 F: 3,4,6 θ ＝ 66%, σ= 4 1,7 3,4 4,5 1,4 4,7 1,4,7 1,4,5 1,3,4 3,4,7 4,5,7 1,2,7 1,3,7 1,5,7 φ φ 1 1 2 2 3 3 4 4 7 7 1,3,4,7 1,4,5,7

Listing Children To perform a depth-first search on enumeration tree, what we have to do is “finding all children of given itemset” P = Prt(P’) is obtained by removing an item from P’   a child P’ of P is obtained by adding an item to P   to find all children, we examine all possible items itemsetsitemsets φ

Check Candidates An item addition does not always yield a child   They are just “candidates” If the parent of a candidate P’ = P ∪ e is P (satisfies e*(P’) = e ), P’ is a child of P   checking by computing e*(P ∪ e), for each candidate P ∪ e itemsetsitemsets Enumeration is done in O(||D||n) time for each ambifuous frequent itemset Theorem φ

Algorithm Description Algorithm AFIM ( P:pattern, D:database ) output P compute cov(P ∪ e) for all item e not in P for each e s.t. cov(P ∪ e) ≥ σ do compute AmbiOcc(P ∪ e) compute e*(P ∪ e) if e*(P ∪ e) = e then call AFIM ( P ∪ e, D ) done

Computing cov(P ∪ e) A transaction set whose size and average inclusion ratio are equal to AmbiOcc(P ∪ e) is obtained by choosing transactions in the decreasing order of average inclusion ratio cov(P) ≥ cov(P ∪ e) always holds for any transactions T and T’ such that average inclusion ratio of T for P is larger than T’   average inclusion ratio of T for P ∪ e is no less than T’   we can restrict the choice to transactions in AmbiOcc(P), to compute cov(P ∪ e)

Example of Computing cov computation of cov(P ∪ e) for P={1,4} and e=5 A: 1,3,4,7 B: 2,4,5 C: 1,2,7 D: 1,4,5,7 E: 2,3,6 F: 3,4,6 A: 1,3,4,7 B: 2,4,5 C: 1,2,7 D: 1,4,5,7 E: 2,3,6 F: 3,4,6 AmbiOcc({1,4,5})= {D, A,B, C},F,E AmbiOcc({1,4,5})= {D, A,B, C},F,E θ ＝ 66%, σ= 4 AmbiOcc({1,4}) = {D,A, B,C,F},E inc. 2 items inc. 1 item inc. 3 items inc. 2 items inc. no item inc. 1 item

Efficient Computation of cov’s For efficient computation, we classify transactions by inclusion ratio When we compute cov(P ∪ e), we compute the intersection of each group and Occ(e)   inclusion ratio increases, for transactions included in Occ(e)   by moving such transactions, classification for P ∪ e is obtained  This task for all items is done efficiently by Delivery, which takes O(||G||) time where ||G|| is the sum of transaction sizes in group G  computation of cov(P ∪ e) can be done in linear time 0 miss 1 miss 2 miss 3 miss 4 miss 5 miss

Computing AmbiOcc and e* Computation of AmbiOcc(P ∪ e) needs greedy choice of transactions, in the decreasing order of (inclusion ratio & index)  Computation of e*(P ∪ e) needs intersection of AmbiOcc(P ∪ e) and Occ(i) for each i ∈ P  Delivery   need O(||D||) time in the worst case However, when cov(P) is small, not so many transactions may be scanned, thus we expect the average computation time is not so long

Bottom-widenessBottom-wideness DFS search generates several recursive calls in each iteration   Recursion tree grows exponentially, by going down   Computation time is dominated by the lowest levels Computation time decreases by going down Near by bottom levels, computation time may be close to σ, thus an iteration may take O(σt) time where t is the average size of transactions ・・・ long time short time

Computational Experiments CPU: Pentium M 1.1GHz, memory: 256MB OS: Windows XP + Cygwin Code: C Compiler: gcc 2.3 Test instances are taken from benchmark datasets for frequent itemset mining

BMS-WebView 2 A real-world web access data (sparse; transaction siz = 4.5)

MushroomMushroom A real-world machine learning data of mushrooms (density = 1/3)

Possibility for Further Improvements Ratio of unnecessary operations, non-maximal patterns

ConclusionConclusion Introduced a new model for frequent itemset mining with ambiguous inclusion relation, which avoids redundancy Showed a hardness result for branch-and-bound Showed efficiency on practical (sparse) datasets Future Works: Reduce the time complexity and fill the gap from the practice Efficient models and computation for maximal ones Application of the technique to the other problems (ambiguous pattern mining for graph, tree, vector data, etc.)

Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of.

Similar presentations

Presentation on theme: "Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of.

Similar presentations

Presentation on theme: "Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of."— Presentation transcript:

Similar presentations

About project

Feedback