Approach to Data Mining from Algorithm and Computation Takeaki Uno, ETH Switzerland, NII Japan Hiroki Arimura, Hokkaido University, Japan.

Slides:

Advertisements

Similar presentations

Mining Association Rules

Advertisements

Indexing DNA Sequences Using q-Grams

gSpan: Graph-based substructure pattern mining

Frequent Closed Pattern Search By Row and Feature Enumeration

LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo.

FP-Growth algorithm Vasiljevic Vladica,

FP (FREQUENT PATTERN)-GROWTH ALGORITHM ERTAN LJAJIĆ, 3392/2013 Elektrotehnički fakultet Univerziteta u Beogradu.

1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.

Data Mining Association Analysis: Basic Concepts and Algorithms

Rakesh Agrawal Ramakrishnan Srikant

Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.

732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,

FP-growth. Challenges of Frequent Pattern Mining Improving Apriori Fp-growth Fp-tree Mining frequent patterns with FP-tree Visualization of Association.

Data Mining Association Analysis: Basic Concepts and Algorithms

1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.

Association Analysis: Basic Concepts and Algorithms.

Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.

Data Mining Association Analysis: Basic Concepts and Algorithms

Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu.

Fast Algorithms for Association Rule Mining

Mining Association Rules

Mining Association Rules

1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.

FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,

Chapter 5 Mining Association Rules with FP Tree Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.

Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.

Ch5 Mining Frequent Patterns, Associations, and Correlations

VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.

Sequential PAttern Mining using A Bitmap Representation

Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of.

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.

Data Mining Frequent-Pattern Tree Approach Towards ARM Lecture

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

EFFICIENT ITEMSET EXTRACTION USING IMINE INDEX By By U.P.Pushpavalli U.P.Pushpavalli II Year ME(CSE) II Year ME(CSE)

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura

LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

Mining Frequent Patterns without Candidate Generation.

An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University.

Mining Frequent Patterns without Candidate Generation : A Frequent-Pattern Tree Approach 指導教授：廖述賢博士報告人：朱佩慧班級：管科所博一.

An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining 2/Oct/2007 Discovery Science 2007 Takeaki Uno (National Institute of Informatics)

Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.

Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?

CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.

Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.

LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

New Algorithms for Enumerating All Maximal Cliques

Speeding Up Enumeration Algorithms with Amortized Analysis Takeaki Uno (National Institute of Informatics, JAPAN)

Association Analysis (3)

Detailed Description of an Algorithm for Enumeration of Maximal Frequent Sets with Irredundant Dualization I rredundant B order E numerator Takeaki Uno.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Graph Indexing From managing and mining graph data.

1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Mining Complex Data COMP Seminar Spring 2011.

Fast Algorithms for BIG DATA (title means “I make slides according to the interests of audience ) 14/Jan/2012 NII Shonan-meeting (open problem seminar)

Computational Challenges in BIG DATA 28/Apr/2012 China-Korea-Japan Workshop Takeaki Uno National Institute of Informatics & Graduated School for Advanced.

Gspan: Graph-based Substructure Pattern Mining

Mining Frequent Itemsets over Uncertain Databases

Association Rule Mining

Mining Complex Data COMP Seminar Spring 2011.

COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong

Mining Frequent Patterns without Candidate Generation

Frequent-Pattern Tree

Market Basket Analysis and Association Rules

FP-Growth Wenlong Zhang.

Output Sensitive Enumeration

Presentation transcript:

Approach to Data Mining from Algorithm and Computation Takeaki Uno, ETH Switzerland, NII Japan Hiroki Arimura, Hokkaido University, Japan

Frequent Pattern Mining ・・ Data mining is an important tool for analysis of data in many scientific and industrial areas ・・ The aim of data mining is to “find interesting, or valuable something” ・・ But, we don’t know what is interesting, nor is valuable ・・ So, we give some criteria that would be satisfied by interesting or valuable something, and find all patterns satisfying them.

Image of Pattern Mining ・・ Pattern mining is a problem of find all patterns from the given (possibly structured) database satisfying the given constraints H H C C C C H H H O O N H H H H chemical compounds XML database databases name C C C person agephone name family person name person age C O O H extract interesting patterns Frequent pattern mining is an enumeration problem of all patterns appearing frequently, at least given threshold in the database person family C H H

Approach from… ・・ In real world, the inputs database is usually huge, and the output patterns are so huge, thus efficient computation is very important ・・ Many research are done, but many of them are based on “database, data engineering, modeling”, not algorithm. Ex.) How to make data compressed, how to execute query fast, which model is good, etc… ・・ Here we see want to separate the problems; from algorithmic view, what is important?, what we can do?

Distinguish the Focus, Problems ・・ “my algorithm is very fast for these datasets”, - but the data is very artificial, or including few items… ・・ “the algorithm might not work for huge datasets, if it”, - difficult to be fast for both small and huge ・・ We would like to distinguish the techniques and problems; - - scalability - - I/O - - Huge datasets - - Data compression ・・ The techniques would be “orthogonal”

Approach from… ・・ Many research are done, but many of them are based on “database, data engineering, modeling”, not algorithm. Ex.) How to make data compressed, how to execute query fast, which model is good, etc… ・・ Here we see the problems as “enumeration problems”, and try to clarify what kind of techniques are important for efficient computation, with examples on itemset mining Good Models What techniques enlarge the solvable models toward the good models Solvable Models

From the Algorithm Theory… ・・ Here we focus only on algorithms, thus topics are - - output sensitive computation time (bad, if long time for small output) - - memory use should depend only on input size - - computation time for an iteration - - reduce the input of each iteration This is so important!!! Good Models What techniques enlarge the solvable models toward the good models Solvable Models

From the Algorithm Theory… ・・ Here we focus only on the case that the input fits the memory - - scalability: output sensitive computation time (bad, if long time for small output) - - memory use should depend only on input size - - computation time for an iteration - - reduce the input of each iteration (from bottom wideness) This is so important!!! TIME #iterations time of an iteration = × + I/O

Bottom Wideness ・・ Enumeration algorithms usually have recursive tree structures, there are many iterations in deeper levels Size = time Procedure to reduce input of recursive calls

Bottom Wideness ・・ Enumeration algorithms usually have recursive tree structures, there are many iterations in deeper levels Total computation time will be half only by one reduction for input Size = time Procedure to reduce input of recursive calls

Bottom Wideness ・・ Enumeration algorithms usually have recursive tree structures, there are many iterations in deeper levels Total computation time will be half only by one reduction for input Size = time Procedure to reduce input of recursive calls Recursively reduce the input  computation time is much reduced

Advantage of Bottom Wideness ・・ Suppose that the recursion tree has iterations exponentially many on lower levels (ex. (2 × #level i) ≦ #level i+1 O(n3)O(n3) O(1) recursion tree amortized computation time is O(1) for each output !!

Advantage of Bottom Wideness ・・ Suppose that the recursion tree has iterations exponentially many on lower levels (ex. (2 × #level i) ≦ #level i+1 O(n5)O(n5) O(n) recursion tree amortized computation time is O(n) for each output !! Computation time for each output depends only on bottom levels:   reduce the computation time on lower levels by reduction of input Computation time for each output depends only on bottom levels:   reduce the computation time on lower levels by reduction of input

Frequent Itemset Mining 1,2,5,6,7 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 ＝ D＝＝ D＝・ Transaction database : ・ Transaction database D : transactionsitemset a database composed of transactions defined on itemset E i.e., ∀ t ∈ D, t ⊆ E - - basket data - - links of web pages - - words in documents ・ itemset ・ A subset P of E is called an itemset occurrence: occurrence of P : a transaction in D including P denotation : denotation Occ(P) of P : set of occurrences of P ・ frequency ・ |Occ(P)| is called frequency of P denotation of {1,2} ＝＝ {1,2,5,6,7,9}, {1,2,7,8,9}

Frequent Itemset 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 T ＝ patterns included in at least 3 transactions at least 3 transactions {1} {2} {7} {9} {1,7} {1,9} {2,7} {2,9} {7,9} {1,7,9} {2,7,9} ・ minimum support ・ Given a minimum support θ, Frequent itemset ≧ Frequent itemset: an itemset s.t. (frequency) ≧ θ (a subset of items, which is included in at least θ transactions)Ex.)

Techniques for Efficient Mining ・・ There are many techniques for fast mining - - apriori - - backtracking - - down project - - pruning by infrequent subset - - bitmap - - occurrence deliver - - FP-tree (trie, prefix tree) - - filtering (unification) - - conditional (projected) database - - trimming of database for search strategy for speeding up iterations for database reduction & bottom wideness

Search Strategies ・・ Frequent Itemsets form a connected component on itemset lattice - - Apriori algorithms generate itemsets level-by-level  pruning by infrequent subsets  much memory use - - Backtracking algorithms generate itemset in depth-first manner  small memory use  match down project, etc. φ ,31,23,42,41,42,3 1,2,31,2,41,3,42,3,4 1,2,3,4 frequent Apriori uses long time much memory when output is large

Set k := 0, O k := {φ} While (O k ≠φ) { for each P ∪ e, P ∈ O k { if all P ∪ e-f ∈ O k then compute Occ(P ∪ e) if |Occ(P ∪ e)| ≧ θthen O k+1  P ∪ e } k := k+1 }Backtracking φ ,31,23,42,41,42,3 1,2,31,2,41,3,42,3,4 1,2,3,4 frequent apriori Backtrack (P, Occ(P)) for each e>tail(P) { compute Occ(P ∪ e) if |Occ(P ∪ e)| ≧ θthen Backtrack ( P ∪ e, Occ(P ∪ e) ) } backtracking

Speeds Iteration Bottleneck in iteration is computing Occ(P ∪ e) - - down project: Occ(P ∪ e) = Occ(P ∪ e) ∩ Occ(e)  O(||D ≧ P ||): the size of database of the part larger than tail(P) - - pruning by infrequent subset +  |P| search query + O(c × ||D ≧ P ||) - - bitmap: compute Occ(P ∪ e) ∩ Occ(e) by AND operation  (n -tail(P)) × m/32 operations - - occurrence deliver: comp. Occ(P ∪ e) for all e by one scan of D(P) ≧ P  O(||D(P) ≧ P ||) : D(P) is transactions including P ||D|| m n bitmap is slow if database is sparse, pruning is slow for huge output occurrence deliver is fast if threshold (minimum support) is small bitmap is slow if database is sparse, pruning is slow for huge output occurrence deliver is fast if threshold (minimum support) is small

Occurrence Deliver ・・ Compute the denotations of P ∪ {i} for all i’s at once, 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 ＝D＝＝D＝ A B2345 C12789 D179 E279 F2 P = {1,7} AAA CC Check the frequency for all items to be added in linear time of the database size Check the frequency for all items to be added in linear time of the database size A C D Generating the recursive calls in reverse direction, we can re-use the memory Generating the recursive calls in reverse direction, we can re-use the memory

Database Reductions Conditional database is to reduce database by unnecessary items and transactions, for deeper levels ,3,5 1,2,5,6 1,4,6 1,2,6 3,5 5,6 6 θ = 3 Remove infrequent items, items included in all filtering Unify same transactions 3,5 ×3 5,6 6 ×2 6 1 Remove infrequent items, automatically unified FP-tree, prefix tree Linear time O(||D||log ||D||) time Compact if database is dense and large

Summary of Techniques ・・ Database is dense and large even for bottom levels of computation  support is large ・・ #output solutions is huge  support is smallPrediction: - - apriori will be slow when support is small - - conditional database is fast when support is small - - bitmap will be slow for sparse datasets - - FP-tree will be bit slow for sparse datasets, and fast for large support ・・ Database is dense and large even for bottom levels of computation  support is large ・・ #output solutions is huge  support is smallPrediction: - - apriori will be slow when support is small - - conditional database is fast when support is small - - bitmap will be slow for sparse datasets - - FP-tree will be bit slow for sparse datasets, and fast for large support

Results from FIMI 04 (sparse datasets) Conditional database is good, bitmap is slow FP-tree  large support, occurrence deliver  small support Conditional database is good, bitmap is slow FP-tree  large support, occurrence deliver  small support bitmap apriori cond. O(n) vs O(nlogn) FP-tree

Results on Dense Datasets Apriori is still slow for middle supports, FP-tree is good Apriori is still slow for middle supports, FP-tree is good apriori FP-tree cond. FP-tree, cond bitmap #nodes in FP-tree = (||D|| filtered)/6 #nodes in FP-tree = (||D|| filtered)/6

Summary on Computation ・・ We can understand the reason of efficiency from algorithmic view - - reduce the input of each iteration according to bottom wideness - - reduce the computation time for an iteration (probably, combination of conditional database, patricia tree, and occurrence deliver will be good) ・・ We can observe similarly other pattern mining problems, sequence mining, string mining, tree mining, graph mining, etc. ・・ We can understand the reason of efficiency from algorithmic view - - reduce the input of each iteration according to bottom wideness - - reduce the computation time for an iteration (probably, combination of conditional database, patricia tree, and occurrence deliver will be good) ・・ We can observe similarly other pattern mining problems, sequence mining, string mining, tree mining, graph mining, etc. Next we see closed pattern which represents some similar patterns, we begin with itemsets Next we see closed pattern which represents some similar patterns, we begin with itemsets

Modeling: Closed Itemsets Modeling: Closed Itemsets [Pasquier et. al. 1999] ・・ Usually, #frequent itemsets is huge, when we mine in depth   we want to decrease itemsets in some way ・・ There are many ways for this task, ex., giving some scores, group similar itemsets, looking at the other parameters, etc. But, we would like to approach from theory closed patterns Here we introduce closed patterns ・・ Consider the itemsets having the same denotations   we would say “they have the same information” ・ closed pattern ・ we focus only on the maximal among them, called closed pattern = (= intersection of occurrences in the denotation)

Example of Closed Itemset 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 T ＝ patterns included in at least 3 transactions at least 3 transactions {1} {2} {7} {9} {1,7} {1,9} {2,7} {2,9} {7,9} {1,7,9} {2,7,9} ・・ In general, #frequent itemsets ≦ #frequent closed itemsets Especially, “<<” holds if database has some structures (Databases with some structure tend to have huge frequent itemsets, thus this is an advantage)

Difference of #itemsets #frequent itemsets << #closed itemsets when threshold θis small #frequent itemsets << #closed itemsets when threshold θis small

Closure Extension of Itemset ・・ Usual backtracking does not work for closed itemsets, because there are possibly big gap between closed itemsets ・・ On the other hand, any closed itemset is obtained from another by “add an item and take closure (maximal)” - - closure of P is the closed itemset having the same denotation to P, and computed by taking intersection of Occ(P) This is an adjacency structure defined on closed itemsets, thus we can perform graph search on it, with using memory φ 1,3 1,2 1,2,31,2,41,3,42,3, ,42,41,42,3 1,2,3,4

PPC extension ・・ Closure extension gives us an acyclic adjacency structure for us, but it’s not enough to get a memory efficient algorithm (we need to store discovered itemsets in memory) ・・ We introduce PPC extension to obtain tree structure A closure extension P’ obtained from P+e is a PPC extension  prefixes of P’ and P are the same (smaller than e) PPC extension ・・ Any closed itemset is a PPC extension of just one other closed itemset

Example of PPC Extension ・・ closure extension   acyclic ・・ ppc extension   tree ・・ closure extension   acyclic ・・ ppc extension   tree closure extension ppc extension 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 ＝D ＝＝D ＝ φ {1,7,9} {2,7,9} {1,2,7,9} {7,9} {2,5} {2} {2,3,4,5} {1,2,7,8,9}{1,2,5,6,7,9}

Example of PPC Extension (1,2,7,9), (1,2,7), (1,2) ⊆ {1,2,5,6,7,9}, {1,2,7,8,9} (1,7,9), (1,7), (1) ⊆ {1,7,9}, {1,2,5,6,7,9}, {1,2,7,8,9} (1,2,7,9), (1,2,7), (1,2) ⊆ {1,2,5,6,7,9}, {1,2,7,8,9} (1,7,9), (1,7), (1) ⊆ {1,7,9}, {1,2,5,6,7,9}, {1,2,7,8,9} 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 ＝D ＝＝D ＝ φ {1,7,9} {2,7,9} {1,2,7,9} {7,9} {2,5} {2} {2,3,4,5} {1,2,7,8,9}{1,2,5,6,7,9}

For Efficient Computation ・・ Computation of closure takes long time ・・ We use database reduction, from the fact that “if P’ is PPC extension by P+e, and P’’ is PPC extension by P’+f then e < f ”, thus prefix is used only for intersection! 1,2,5,6,7,9 2,3,4,5 1,2,5,7,8,9 1,5,6,7 2,5,7,9 2,3,5,6,7,8 e = 5e = 5 1,2,5,6,7,9 2,5 1,2,5,7,9 1,5,6,7 2,5,7,9 2,5,6,7 1,2,5,6,7,9 2,5,7,9 × 2 5,6,7 × 2

Experiment: vs. Frequent Itemset (sparse) Computation time/itemset is very stable There is no big difference of computation time Computation time/itemset is very stable There is no big difference of computation time

Experiment: vs. Frequent Itemset (dense) Computation time/itemset is very stable There is no big difference of computation time Computation time/itemset is very stable There is no big difference of computation time

Compare to Other Methods ・・ There are roughly two methods to enumerate closed patterns frequent pattern base: enumerate all frequent patterns, and output only closed ones (+ some pruning), check closedness by keeping all discovered itemsets in memory closure base closure base: compute closed pattern by closure, and avoid the duplication by keeping all discovered itemsets in memory If #solution is small, frequent pattern base is fast, since search for checking closedness takes very short time

vs. Other Implementations (sparse) Large minimum support  frequent pattern base Small minimum support  PPC extension Large minimum support  frequent pattern base Small minimum support  PPC extension

vs. Other Implementations (dense) Small minimum support  PPC extension and database reduction are good

Extend Closed Patterns ・・ There are several mining problems for which we can introduce closed patterns (union of occurrences is unique!!) - - Ranked trees (labeled trees without siblings of the same label) - - Motifs (string with wildcards) For these problems, PPC extension also works similarly, with conditional database and occurrence deliver A BA AAB C AB??EF?H ABCDEFGH ABZZEFZH

Conclusion ・・ We overviews the techniques on frequent pattern mining as enumeration algorithms, and show that - - complexity of one iteration and bottom wideness are important ・・ We show that closed pattern is probably a valuable model, and can be enumerated efficiently ・・ Develop efficient algorithms and implementations for other basic mining problems ・・ Extend the class of problems in which closed patterns work well ・・ Develop efficient algorithms and implementations for other basic mining problems ・・ Extend the class of problems in which closed patterns work well Future works ABCD ACBD