Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of.

Slides:



Advertisements
Similar presentations
Algorithm Analysis Input size Time I1 T1 I2 T2 …
Advertisements

Recap: Mining association rules from large datasets
Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
DATA MINING Association Rule Discovery. AR Definition aka Affinity Grouping Common example: Discovery of which items are frequently sold together at a.
gSpan: Graph-based substructure pattern mining
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
More Efficient Generation of Plane Triangulations Shin-ichi Nakano Takeaki Uno Gunma University National Institute of JAPAN Informatics, JAPAN 23/Sep/2003.
LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo.
Approximation Algorithms Chapter 5: k-center. Overview n Main issue: Parametric pruning –Technique for approximation algorithms n 2-approx. algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
FP-growth. Challenges of Frequent Pattern Mining Improving Apriori Fp-growth Fp-tree Mining frequent patterns with FP-tree Visualization of Association.
Review: Search problem formulation
Aki Hecht Seminar in Databases (236826) January 2009
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Association Analysis: Basic Concepts and Algorithms.
Implicit Hitting Set Problems Richard M. Karp Harvard University August 29, 2011.
Data Mining Association Analysis: Basic Concepts and Algorithms
Pattern Lattice Traversal by Selective Jumps Osmar R. Zaïane and Mohammad El-Hajj Department of Computing Science, University of Alberta Edmonton, AB,
Lecture 5 (Classification with Decision Trees)
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
2-Layer Crossing Minimisation Johan van Rooij. Overview Problem definitions NP-Hardness proof Heuristics & Performance Practical Computation One layer:
Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu.
Backtracking Reading Material: Chapter 13, Sections 1, 2, 4, and 5.
1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.
An Efficient Algorithm for Finding Similar Short Substrings from Large Scale String Data May/23/2008 PAKDD 2008 Takeaki Uno National Institute of Informatics,
FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,
Graph Indexing: A Frequent Structure­ based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†
Distributed Constraint Optimization Michal Jakob Agent Technology Center, Dept. of Computer Science and Engineering, FEE, Czech Technical University A4M33MAS.
Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.
A Fast Algorithm for Enumerating Bipartite Perfect Matchings Takeaki Uno (National Institute of Informatics, JAPAN)
1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Analysis of Algorithms
Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura
LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,
An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining 2/Oct/2007 Discovery Science 2007 Takeaki Uno (National Institute of Informatics)
Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
New Algorithms for Enumerating All Maximal Cliques
Restrictions on Concept Lattices for Pattern Management Léonard Kwuida, Rokia Missaoui, Beligh Ben Amor, Lahcen Boumedjout, Jean Vaillancourt October 20,
Speeding Up Enumeration Algorithms with Amortized Analysis Takeaki Uno (National Institute of Informatics, JAPAN)
Association Analysis (3)
CPSC 322, Lecture 6Slide 1 Uniformed Search (cont.) Computer Science cpsc322, Lecture 6 (Textbook finish 3.5) Sept, 17, 2012.
Output Sensitive Algorithm for Finding Similar Objects Jul/2/2007 Combinatorial Algorithms Day Takeaki Uno Takeaki Uno National Institute of Informatics,
ICS 353: Design and Analysis of Algorithms Backtracking King Fahd University of Petroleum & Minerals Information & Computer Science Department.
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
BITS Pilani Pilani Campus Data Structure and Algorithms Design Dr. Maheswari Karthikeyan Lecture1.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Fast Algorithms for BIG DATA (title means “I make slides according to the interests of audience ) 14/Jan/2012 NII Shonan-meeting (open problem seminar)
Computational Challenges in BIG DATA 28/Apr/2012 China-Korea-Japan Workshop Takeaki Uno National Institute of Informatics & Graduated School for Advanced.
Approach to Data Mining from Algorithm and Computation Takeaki Uno, ETH Switzerland, NII Japan Hiroki Arimura, Hokkaido University, Japan.
Gspan: Graph-based Substructure Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Analysis and design of algorithm
Output Sensitive Enumeration
Output Sensitive Enumeration
Output Sensitive Enumeration
Output Sensitive Enumeration
Fraction-Score: A New Support Measure for Co-location Pattern Mining
Presentation transcript:

Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of Informatics, JAPAN (The Guraduate University for Advanced Science) (2) Hokkaido University, JAPAN

Frequent Pattern Mining Problem of finding all frequently appearing patterns from given database database: transaction database (itemset), tree, graph, vector patterns: itemset, tree, path/cycle, graph, geometric graph… genome experiments database Extract frequently appearing patterns ATGCGCCGTA TAGCGGGTGG TTCGCGTTAG GGATATAAAT GCGCCAAATA ATAATGTATTA TTGAAGGGCG ACAGTCTCTCA ATAAGCGGCT ATGCGCCGTA TAGCGGGTGG TTCGCGTTAG GGATATAAAT GCGCCAAATA ATAATGTATTA TTGAAGGGCG ACAGTCTCTCA ATAAGCGGCT 実験 1 実験 2 実験 3 実験 4 ● ▲ ▲ ● ▲ ● ● ▲ ● ● ● ▲ ● ▲ ● ● ● ▲ ● ● ▲ ▲ ▲ ▲ ・ ・ 実験 1 ●, 実験 3 ▲ ・ ・ 実験 2 ●, 実験 4 ● ・ ・ 実験 2 ●, 実験 3 ▲, 実験 4 ● ・ ・ 実験 2 ▲, 実験 3 ▲ . ・ ・ 実験 1 ●, 実験 3 ▲ ・ ・ 実験 2 ●, 実験 4 ● ・ ・ 実験 2 ●, 実験 3 ▲, 実験 4 ● ・ ・ 実験 2 ▲, 実験 3 ▲ . ・ ・ ATGCAT ・ ・ CCCGGGTAA ・ ・ GGCGTTA ・ ・ ATAAGGG . ・ ・ ATGCAT ・ ・ CCCGGGTAA ・ ・ GGCGTTA ・ ・ ATAAGGG .

Researches on Pattern Mining So many studies and applications on itemsets, sequences, trees, graphs, geometric graphs Thanks to the efficient algorithms, we would say any simple structures can be enumerated in practically short time One of the next problems is “how to handle the noise, error, and ambiguity”   usual “inclusion” is too strict   we want to find patterns “mostly” included in many records We consider ambiguous appearance of patterns

Related Works on Ambiguity It is popular to detect “ambiguous XXXX”   dense substructures: clustering, community discovering…   homology search on genome sequence Heuristic search is popular because of the difficulty on modeling and computation Advantage Advantage: usually works efficiently Problem Problem: not easy to understand “what is found” much more cost for additional conditions(for each solution) Here we look at the problem from “algorithmic point of view” (efficient models arising from efficient computation)

Itemset Mining In this talk, we focus on the itemset mining transaction database D: ∀ D transaction database D: each record called transaction is a subset of itemset E, that is, ∀ T ∈ D, T ⊆ E Occ(P): set of transactions including P frq(P) = |Occ(P)|: #transactions including P  P is a frequent itemset  frq(P) ≥σ (σ is minimum support) D Problem is to enumerate all frequent itemsets in D We introduce ambiguous inclusion for frequent itemset mining

Related works fault-tolerant pattern 、 degenerate pattern 、 soft occurrence, etc. mainly two approaches (1) (1) generalize inclusion: (1-a) (1-a) the ratio of included items ≥θ  include   lose monotonicity; no subset may be frequent in the worst case   several heuristic-search-based algorithms (1-b) (1-b) at most k items are not included  include   satisfy monotonicity; so many small itemsets are frequent   maximal enumeration or complete enumeration with small k 1,2 2,3 1,3 θ=66%

Related works 2 (2) (2) find pairs of itemset and transaction set such that few of them do not satisfy inclusion   equivalent to finding dense submatrix, or dense bicluster so many equivalent patterns will be found   mainly, heuristic search for finding one such dense substructure ambiguity on the transaction set   an itemset can have many partners (2) We introduce a new model for (2) to avoid redundancy, and propose an efficient depth-first search type algorithm (2) We introduce a new model for (2) to avoid redundancy, and propose an efficient depth-first search type algorithm items transactions

Average Inclusion ⇔ inclusion ratio of t for P ⇔ | t∩P | / |P| average inclusion ratio of transaction set T for P ⇔ ⇔ average of inclusion ratio over all transactions in T ∑ |t ∩ P| / ( |P| × |T| )   equivalent to dense submatrix/subgraph of transaction-item inclusion matrix/graph For a density threshold θ, maximum co-occurrence size cov(P) of itemset P ⇔ ⇔ maximum size of transaction set s.t. average inclusion ratio ≥θ 1,3,4 2,4,5 1,2 1,3,4 2,4,5 1,2 2,3  50% 4,5  50% 1,2  66% 2,3  50% 4,5  50% 1,2  66%

Problem Definition For a density threshold θ, the maximum co-occurrence size cov(P) of itemset P ⇔ ⇔ maximum size of transaction set s.t. average inclusion ratio ≥θ Ambiguous frequent itemset: itemset P s.t., cov(P) ≥ σ (σ: minimum support) Ambiguous frequent itemsets are not monotone !! 1,3,4 2,4,5 1,2 1,3,4 2,4,5 1,2 θ=66%: cov({3}) = 1 cov({2}) = 3 cov({1,3}) = 2 cov({1,2}) = 3 θ=66%: cov({3}) = 1 cov({2}) = 3 cov({1,3}) = 2 cov({1,2}) = 3 Ambiguous frequent itemset enumeration: the problem of outputting all ambiguous frequent itemsets for given database D, density threshold θ, minimum support σ Ambiguous frequent itemset enumeration: the problem of outputting all ambiguous frequent itemsets for given database D, density threshold θ, minimum support σ The goal is to develop an efficient algorithm for this problem

Hardness for Branch-and-Bound A straightforward approach to this problem is branch-and-bound In each iteration, divide the problem into two non-empty problems by the inclusion of an item i 1, i 2 i1i1i1i1 v1v1v1v1 Checking the existence of ambiguous frequent itemset is NP-comp. (Theorem 1) Checking the existence of ambiguous frequent itemset is NP-comp. (Theorem 1)

Is This Really Hard? We proved NP-hardness for "very dense graphs"   unclear for middle dense graph   not impossible for polynomial time enumeration θ= 1 θ= 0 easy hard ?????????? polynomial time in (input size) + (output size) polynomial time in (input size) + (output size)

Efficient Algorithm: Idea of Reverse Search We don’t use branch and bound, but use reverse search Define an acyclic parent-child relation on all objects to be found Recursively find children to search, thus an algorithm for finding all children is sufficient objectsobjects Depth-first search on the rooted tree induced by the relation

Neighboring Relation AmbiOcc(P) of an ambiguous frequent itemset P ⇔ ⇔ lexicographically minimum one among transaction sets whose average inclusion ratio ≥θ and size = cov(P) e*(P): e e e*(P): the item e in P s.t. # transactions in AmbiOcc(P) including e is the minimum (ties are broken by taking the minimum index) e*(P) the parent Prt(P) of P: P \ e*(P) A: 1,3,4,7 B: 2,4,5 C: 1,2,7 D: 1,4,5,7 E: 2,3,6 F: 3,4,6 A: 1,3,4,7 B: 2,4,5 C: 1,2,7 D: 1,4,5,7 E: 2,3,6 F: 3,4,6 {1,4,5}   D, A,B, C,F, E AmbiOcc({1,4,5}) = {D,A,B,C} {1,4,5}   D, A,B, C,F, E AmbiOcc({1,4,5}) = {D,A,B,C} θ = 66%, σ= 4 e*(P) = 5 Prt({1,4,5})   {1,4} AmbiOcc({1,4}) = {D,A, B,C, F} e*(P) = 5 Prt({1,4,5})   {1,4} AmbiOcc({1,4}) = {D,A, B,C, F}

Properties of Parent e*(P) The parent Prt(P) of P: P \ e*(P)   uniquely defined Average inclusion ratio of AmbiOcc(P) for P does not decrease   Prt(P) is an ambiguous frequent itemset |Prt(P)| < |P| (parent is always smaller)   the relation is acyclic, and induces a tree (rooted at φ) A: 1,3,4,7 B: 2,4,5 C: 1,2,7 D: 1,4,5,7 E: 2,3,6 F: 3,4,6 A: 1,3,4,7 B: 2,4,5 C: 1,2,7 D: 1,4,5,7 E: 2,3,6 F: 3,4,6 {1,4,5}   D, A,B, C,F, E AmbiOcc({1,4,5}) = {D,A,B,C} {1,4,5}   D, A,B, C,F, E AmbiOcc({1,4,5}) = {D,A,B,C} θ = 66%, σ= 4 e*(P) = 5 Prt({1,4,5})   {1,4} AmbiOcc({1,4}) = {D,A, B,C, F} e*(P) = 5 Prt({1,4,5})   {1,4} AmbiOcc({1,4}) = {D,A, B,C, F}

Enumeration Tree The relation is acyclic, and induces a tree (rooted at φ) We call the tree enumeration tree A: 1,3,4,7 B: 2,4,5, C: 1,2,7 D: 1,4,5,7 E: 2,3,6 F: 3,4,6 A: 1,3,4,7 B: 2,4,5, C: 1,2,7 D: 1,4,5,7 E: 2,3,6 F: 3,4,6 θ = 66%, σ= 4 1,7 3,4 4,5 1,4 4,7 1,4,7 1,4,5 1,3,4 3,4,7 4,5,7 1,2,7 1,3,7 1,5,7 φ φ ,3,4,7 1,4,5,7

Listing Children To perform a depth-first search on enumeration tree, what we have to do is “finding all children of given itemset” P = Prt(P’) is obtained by removing an item from P’   a child P’ of P is obtained by adding an item to P   to find all children, we examine all possible items itemsetsitemsets φ

Check Candidates An item addition does not always yield a child   They are just “candidates” If the parent of a candidate P’ = P ∪ e is P (satisfies e*(P’) = e ), P’ is a child of P   checking by computing e*(P ∪ e), for each candidate P ∪ e itemsetsitemsets Enumeration is done in O(||D||n) time for each ambifuous frequent itemset Theorem φ

Algorithm Description Algorithm AFIM ( P:pattern, D:database ) output P compute cov(P ∪ e) for all item e not in P for each e s.t. cov(P ∪ e) ≥ σ do compute AmbiOcc(P ∪ e) compute e*(P ∪ e) if e*(P ∪ e) = e then call AFIM ( P ∪ e, D ) done

Computing cov(P ∪ e) A transaction set whose size and average inclusion ratio are equal to AmbiOcc(P ∪ e) is obtained by choosing transactions in the decreasing order of average inclusion ratio cov(P) ≥ cov(P ∪ e) always holds for any transactions T and T’ such that average inclusion ratio of T for P is larger than T’   average inclusion ratio of T for P ∪ e is no less than T’   we can restrict the choice to transactions in AmbiOcc(P), to compute cov(P ∪ e)

Example of Computing cov computation of cov(P ∪ e) for P={1,4} and e=5 A: 1,3,4,7 B: 2,4,5 C: 1,2,7 D: 1,4,5,7 E: 2,3,6 F: 3,4,6 A: 1,3,4,7 B: 2,4,5 C: 1,2,7 D: 1,4,5,7 E: 2,3,6 F: 3,4,6 AmbiOcc({1,4,5})= {D, A,B, C},F,E AmbiOcc({1,4,5})= {D, A,B, C},F,E θ = 66%, σ= 4 AmbiOcc({1,4}) = {D,A, B,C,F},E inc. 2 items inc. 1 item inc. 3 items inc. 2 items inc. no item inc. 1 item

Efficient Computation of cov’s For efficient computation, we classify transactions by inclusion ratio When we compute cov(P ∪ e), we compute the intersection of each group and Occ(e)   inclusion ratio increases, for transactions included in Occ(e)   by moving such transactions, classification for P ∪ e is obtained  This task for all items is done efficiently by Delivery, which takes O(||G||) time where ||G|| is the sum of transaction sizes in group G  computation of cov(P ∪ e) can be done in linear time 0 miss 1 miss 2 miss 3 miss 4 miss 5 miss

Computing AmbiOcc and e* Computation of AmbiOcc(P ∪ e) needs greedy choice of transactions, in the decreasing order of (inclusion ratio & index)  Computation of e*(P ∪ e) needs intersection of AmbiOcc(P ∪ e) and Occ(i) for each i ∈ P  Delivery   need O(||D||) time in the worst case However, when cov(P) is small, not so many transactions may be scanned, thus we expect the average computation time is not so long

Bottom-widenessBottom-wideness DFS search generates several recursive calls in each iteration   Recursion tree grows exponentially, by going down   Computation time is dominated by the lowest levels Computation time decreases by going down Near by bottom levels, computation time may be close to σ, thus an iteration may take O(σt) time where t is the average size of transactions ・・・ long time short time

Computational Experiments CPU: Pentium M 1.1GHz, memory: 256MB OS: Windows XP + Cygwin Code: C Compiler: gcc 2.3 Test instances are taken from benchmark datasets for frequent itemset mining

BMS-WebView 2 A real-world web access data (sparse; transaction siz = 4.5)

MushroomMushroom A real-world machine learning data of mushrooms (density = 1/3)

Possibility for Further Improvements Ratio of unnecessary operations, non-maximal patterns

ConclusionConclusion Introduced a new model for frequent itemset mining with ambiguous inclusion relation, which avoids redundancy Showed a hardness result for branch-and-bound Showed efficiency on practical (sparse) datasets Future Works: Reduce the time complexity and fill the gap from the practice Efficient models and computation for maximal ones Application of the technique to the other problems (ambiguous pattern mining for graph, tree, vector data, etc.)