Efficient Closed Pattern Mining in Strongly Accessible Set Systems

Slides:



Advertisements
Similar presentations
Recap: Mining association rules from large datasets
Advertisements

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
1 NP-completeness Lecture 2: Jan P The class of problems that can be solved in polynomial time. e.g. gcd, shortest path, prime, etc. There are many.
8.4 Closures of Relations. Intro Consider the following example (telephone line, bus route,…) abc d Is R, defined above on the set A={a, b, c, d}, transitive?
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Greedy Algorithms for Matroids Andreas Klappenecker.
Association Analysis: Basic Concepts and Algorithms.
NP-Complete Problems Reading Material: Chapter 10 Sections 1, 2, 3, and 4 only.
Greedy Algorithms Reading Material: Chapter 8 (Except Section 8.5)
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
CS5371 Theory of Computation Lecture 1: Mathematics Review I (Basic Terminology)
Greedy Algorithms Like dynamic programming algorithms, greedy algorithms are usually designed to solve optimization problems Unlike dynamic programming.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Graph Algorithms Using Depth First Search Prepared by John Reif, Ph.D. Distinguished Professor of Computer Science Duke University Analysis of Algorithms.
MCS312: NP-completeness and Approximation Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
GRAPHS CSE, POSTECH. Chapter 16 covers the following topics Graph terminology: vertex, edge, adjacent, incident, degree, cycle, path, connected component,
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Sets.
Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura
Prabhas Chongstitvatana1 NP-complete proofs The circuit satisfiability proof of NP- completeness relies on a direct proof that L  p CIRCUIT-SAT for every.
Complexity Non-determinism. NP complete problems. Does P=NP? Origami. Homework: continue on postings.
Greedy Algorithms and Matroids Andreas Klappenecker.
Week 11 - Monday.  What did we talk about last time?  Binomial theorem and Pascal's triangle  Conditional probability  Bayes’ theorem.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Theory of Computation, Feodor F. Dragan, Kent State University 1 TheoryofComputation Spring, 2015 (Feodor F. Dragan) Department of Computer Science Kent.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Sets.
GRAPHS. Graph Graph terminology: vertex, edge, adjacent, incident, degree, cycle, path, connected component, spanning tree Types of graphs: undirected,
Spanning tree Lecture 4.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Lecture. Today Problem set 9 out (due next Thursday) Topics: –Complexity Theory –Optimization versus Decision Problems –P and NP –Efficient Verification.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
COMP53311 Association Rule Mining Prepared by Raymond Wong Presented by Raymond Wong
Introduction to NP Instructor: Neelima Gupta 1.
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Mining Complex Data COMP Seminar Spring 2011.
Week 11 - Wednesday.  What did we talk about last time?  Graphs  Paths and circuits.
Greedy Algorithms. p2. Activity-selection problem: Problem : Want to schedule as many compatible activities as possible., n activities. Activity i, start.
Fifteen Puzzle Move: can move a square adjacent to the empty square to the empty square.
More NP-complete problems
Reducing Number of Candidates
Computing Connected Components on Parallel Computers
The countable character of uncountable graphs François Laviolette Barbados 2003.
Association Rules Repoussis Panagiotis.
Frequent Pattern Mining
Lectures on Network Flows
Chapter 5. Optimal Matchings
Graph Algorithms Using Depth First Search
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
3.3 Applications of Maximum Flow and Minimum Cut
ICS 353: Design and Analysis of Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Alternating tree Automata and Parity games
Graphs All tree structures are hierarchical. This means that each node can only have one parent node. Trees can be used to store data which has a definite.
Coverage Approximation Algorithms
Association Analysis: Basic Concepts and Algorithms
Frequent-Pattern Tree
Prabhas Chongstitvatana
Association Analysis: Basic Concepts
GRAPHS.
Presentation transcript:

Efficient Closed Pattern Mining in Strongly Accessible Set Systems Mario Boley, Tamás Horváth, Axel Poigné, Stefan Wrobel Fraunhofer IAIS, Sankt Augustin & University of Bonn Germany

Closed Frequent Patterns data mining definition: frequent patterns that cannot be further enlarged without changing their support example: closed frequent itemsets compact representation of frequent itemsets number of frequent itemsets can be exponentially larger than that of closed frequent itemsets A B C D E 1

The Closed Set Mining (CSM) Problem in many cases, the closed frequent pattern mining problem is an instance of the Closed Set Mining problem: Given a finite ground set E, a membership oracle MF: 2E  {0,1} defining a family F  2E with   F, and a closure operator : F  F, list the family (F) of closed sets, i.e., (F) = { (X) : XF } . example: closed frequent itemsets E: set of items F: family of frequent itemsets oracle MF decides whether an itemset is frequent for every XF, (X) is the intersection of the transactions containing X

Results on Mining Closed Sets several positive complexity results in different communities, e.g., Formal Concept Analysis (Wille, ’82, Ganter & Wille, ’99) e.g., polynomial delay algorithm (Ganter & Reuter, ‘91) assumption: F is the power set of E Closed Frequent Itemset Mining (Pasquier, Bastide, Taouil, & Lakhal, ’99) e.g., incremental polynomial time algorithm (Boros, Gurvich, Khachiyan, & Makino, ’03) assumption: F is an independence system (closed under taking subsets) question for this talk: What about closed frequent pattern mining problems with set systems not even closed under intersection?

Example: Track Mining given a database of GPS-based recordings of spatio-temporal movements (tracks), list the set of closed frequent connected subgraphs of movements of people or cars in a street network closed frequent connected subgraphs: ‘homogeneous’ connected subnetworks model: street network: undirected graph G = (V,E) tracks: subsets of E embedding operator: subset relation easy to decide underlying set system: F = { X  E : X is frequent and connected } F is not closed under intersection

Example frequency threshold = 1 F = { {a,b,c,d,e,f,g,h,k}, {a,i,j,k} } intersection is not connected a i b j c g h k d f e

Generators and Inductive Generators C: -closed element, i.e., C  (F) generator of C: X  F such that C = (X) inductive generator of C: C’ {e}  F such that - C’ is -closed, - C = (C’  {e}) for some e  E \ C’ example: () = , (a) = (ac) = ac, (ab) = (abd) = (abcd) = abcd abcd has a generator (e.g., ab), but no inductive generator ac has an inductive generator (i.e., a)

The Closed Set Mining (CSM) Problem Lemma: The CSM problem can be solved with polynomial delay if the membership oracle and the closure operator can be computed in polynomial time and for every -closed set except (), there exists an inductive generator. proof sketch: traverse the digraph of -closed sets in depth-first manner (C’,C) is an edge iff there is an e  E \ C’ such that C’ {e} is an inductive generator of C -closed sets are stored in prefix trees

Main Result for Strongly Accessible Set Systems set system (E,F) is strongly accessible if F and for every X,Y  F satisfying X  Y, there exists an eY \ X such that X  {e}  F . there is a sequence X=X0, X1,…,Xk=Y s.t. |Xi \ Xi-1| = 1 for i = 1,…,k Thm: For any finite strongly accessible set system (E,F) (i) given by a polynomial membership oracle and (ii) for any polynomially computable closure operator  : F  F, (F) can be listed with polynomial delay. proof sketch: show that every -closed set has an inductive generator apply the previous lemma

Appl. 1: Closed Frequent Itemset Mining Thm: The closed frequent itemset mining problem can be solved with polynomial delay. proof sketch: family of frequent itemsets is an independence system strongly accessible frequency can be decided in polynomial time set system is given by a polynomial membership oracle closure operator: (X) = intersection of the transactions containing X can be computed in polynomial time

Appl. 2: Closed Frequent Connected Subgraph Mining Given an undirected graph G = (V,E), a transaction database D of subgraphs of G, and an integer frequency threshold t > 0, list the family of closed frequent connected subgraphs of D. Thm: The above problem can be solved with polynomial delay. proof sketch: F: set of frequent connected subgraphs of G not closed under intersection F is strongly accessible membership is decidable in polynomial time closure of a frequent connected subgraph X: largest connected supergraph of X in the intersection of the transactions containing X. it is indeed a closure operator and can be computed in polynomial time

Closed Frequent Connected Subgraph Mining Example: frequency threshold = 2 …

Appl. 3: Closed Frequent Subpath Mining data mining definition: a path P is closed frequent if it is frequent and has strictly larger support than any path P’ containing P there is no closure operator corresponding to this definition example: D = { abc } frequency threshold = 1 F = { ,a, b, c, ab, ac, bc } closed 1-frequent paths: C = { ab, ac, bc } suppose there is a closure operator  s.t. (F) = C because of extensivity: (a) must be ab or ac, say ab (a) = ab is not a subset of (ac) = ac contradicting monotonicity b a c

Appl. 3: Closed Frequent Subpath Mining alternative definition: let P be a path in G compute the intersection GP of the transactions containing P return the intersection of the maximal paths in GP that contain P example: D = { abc } frequency threshold = 1 F = { ,a, b, c, ab, ac, bc } closed 1-frequent paths: C’ = { a, b, c, ab, ac, bc } Thm: The set of closed frequent path w.r.t. the alternative definition can be listed with polynomial delay. b a c

An Open Problem accessible set systems: for all X  F \ {} there is an e  X such that X \ {e}  F there is a sequence  =X0, X1,…, Xk=X s.t. |Xi \ Xi-1| = 1 for i = 1,…,k abcd has no inductive generator Question: Can the positive result on strongly accessible set systems be generalized to accessible set systems?