MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:

Slides:

Advertisements

Similar presentations

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.

Advertisements

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

Mining Multiple-level Association Rules in Large Databases

A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Date ： 2014/04/15 Source ： KDD’13 Authors ： Chi Wang, Marina Danilevsky, Nihit.

Frequent Closed Pattern Search By Row and Feature Enumeration

Branch and Bound Optimization In an exhaustive search, all possible trees in a search space are generated for comparison At each node, if the tree is optimal.

Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.

Best-First Search: Agendas

Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.

Data Mining Association Analysis: Basic Concepts and Algorithms

An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.

Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.

Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.

Topic 7 Sampling And Sampling Distributions. The term Population represents everything we want to study, bearing in mind that the population is ever changing.

Upper Bounds on the Time and Space Complexity of Optimizing Additively Separable Functions Matthew J. Streeter Carnegie Mellon University Pittsburgh, PA.

Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.

33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip.

1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.

Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.

Efficient Gathering of Correlated Data in Sensor Networks

Mining High Utility Itemsets without Candidate Generation Date: 2013/05/13 Author: Mengchi Liu, Junfeng Qu Source: CIKM "12 Advisor: Jia-ling Koh Speaker:

Seongbo Shim, Yoojong Lee, and Youngsoo Shin Lithographic Defect Aware Placement Using Compact Standard Cells Without Inter-Cell Margin.

VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.

Sequential PAttern Mining using A Bitmap Representation

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

1 ENTROPY-BASED CONCEPT SHIFT DETECTION PETER VORBURGER, ABRAHAM BERNSTEIN IEEE ICDM 2006 Speaker: Li HueiJyun Advisor: Koh JiaLing Date:2007/11/6 1.

Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )

« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.

START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.

Mining Frequent Patterns without Candidate Generation.

Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.

This paper was presented at KDD ‘06 Discovering Interesting Patterns Through User’s Interactive Feedback Dong Xin Xuehua Shen Qiaozhu Mei Jiawei Han Presented.

CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.

Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.

1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001.

1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Hong.

Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.

Solving problems by searching 1. Outline Problem formulation Example problems Basic search algorithms 2.

CS 206 Introduction to Computer Science II 09 / 18 / 2009 Instructor: Michael Eckmann.

A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,

Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.

1 Efficient Discovery of Frequent Approximate Sequential Patterns Feida Zhu, Xifeng Yan, Jiawei Han, Philip S. Yu ICDM 2007.

CS 206 Introduction to Computer Science II 01 / 30 / 2009 Instructor: Michael Eckmann.

Ricochet Robots Mitch Powell Daniel Tilgner. Abstract Ricochet robots is a board game created in Germany in A player is given 30 seconds to find.

Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.

Predicting the Location and Time of Mobile Phone Users by Using Sequential Pattern Mining Techniques Mert Özer, Ilkcan Keles, Ismail Hakki Toroslu, Pinar.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {

Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)

Searching for Pattern Rules Guichong Li and Howard J. Hamilton Int'l Conf on Data Mining (ICDM),2006 IEEE Advisor ： Jia-Ling Koh Speaker ： Tsui-Feng Yen.

Gspan: Graph-based Substructure Pattern Mining

Frequent Pattern Mining

CARPENTER Find Closed Patterns in Long Biological Datasets

Data Mining Association Analysis: Basic Concepts and Algorithms

Mining Frequent Itemsets over Uncertain Databases

Farzaneh Mirzazadeh Fall 2007

FP-Growth Wenlong Zhang.

DENSE ITEMSETS JOUNI K. SEPPANEN, HEIKKI MANNILA SIGKDD2004

Summarizing Itemset Patterns: A Profile-Based Approach

Presentation transcript:

MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker: Li HueiJyun Date:2007/12/11 1

OUTLINE Introduction Pattern Fusion: Design and Overview Preliminaries Robustness of colossal patterns Pattern fusion overview Towards Colossal Patterns Why are colossal patterns favored? Catching the outliers Pattern-Fusion in Detail Evaluation Model Experimental Results Conclusions 2

INTRODUCTION The frequent mining problem, even for frequent itemset mining, has not been completely solved for the following reason: According to frequent pattern definition, any subset of a frequent itemset is frequent. This phenomenon should fail all mining algorithms which attempt to report the complete answer set. 3

INTRODUCTION Mining tasks in practice usually attach much greater importance to patterns that are larger in pattern size. E.g., longer sequences are usually of more significant meaning than shorter ones in bioinformatics. We call these large patterns colossal patterns, as distinguished from patterns with large support set. 4

INTRODUCTION When the complete mining result set is prohibitively large, yet only the colossal ones are of real interest and there are, as in most cases, merely a few of them, it is inefficient to wait forever for the mining algorithm to finish running, when it actually gets “trapped” at those mid-sized patterns. 5

INTRODUCTION Example: A 40 × 40 square table with each row being the integers from 1 to 40 in increasing order. Remove the integers on the diagonal, and this gives a 40 × 39 table: Diag 40 Add to Diag identical rows, each being the integers 41 to 79 in increasing order, to get a 60 × 39 table. Take each row as a transaction and set the support threshold at 20. It has an exponential number (i.e., ) of mid-sized closed/maximal frequent patterns of size 20, but only one that is colossal: α = (41, 42, …, 79) of size 39. 6

INTRODUCTION As a result of inherent mining models in which candidates are examined by implicitly or explicitly traversing a search tree in either a breadth-first or depth-first manner when the search tree is exponential in size at some level, such exhaustive traversal has to run with an exponential time complexity. 7

PATTERN-FUSION: DESIGN AND OVERVIEW PRELIMINARIES I : a set of items {o 1, o 2,…, o d } A nonempty subset of I is called an itemset. A transaction dataset D is a collection of itemsets D = {t 1, …, t n }, where t n for any itemset α, the set of transactions that contain α The cardinality of an itemset α: the number of items it contains |α| = |{o i |o i α}| 8

PATTERN-FUSION: DESIGN AND OVERVIEW PRELIMINARIES Support set: the set of transactions that contain a pattern D α is the support set of α A frequent itemset is called a frequent pattern For two patterns α and α’, if α α’, then α is a subpattern of α’, and α’ is a super-pattern of α. 9

PATTERN-FUSION: DESIGN AND OVERVIEW PRELIMINARIES For a pattern α, 10

PATTERN-FUSION: DESIGN AND OVERVIEW ROBUSTNESS OF COLOSSAL PATTERNS For a pattern α, let C α be the set of all its core patterns. for a specified τ 11

PATTERN-FUSION: DESIGN AND OVERVIEW ROBUSTNESS OF COLOSSAL PATTERNS 12

PATTERN-FUSION: DESIGN AND OVERVIEW ROBUSTNESS OF COLOSSAL PATTERNS 13

PATTERN-FUSION: DESIGN AND OVERVIEW ROBUSTNESS OF COLOSSAL PATTERNS 14

PATTERN-FUSION: DESIGN AND OVERVIEW ROBUSTNESS OF COLOSSAL PATTERNS Observation 1: Due to the observation that a colossal pattern has far more core patterns than a smaller-sized pattern does, given a small c, a colossal pattern therefore has far more core descendants of size c. Observation 2: A colossal pattern can be generated by merging a proper set of its core patterns. 15

PATTERN-FUSION: DESIGN AND OVERVIEW PATTERN FUSION OVERVIEW First generate a complete set of frequent patterns up to a small size, and then randomly pick a pattern, β. β would with high probability be a core-descendant of some colossal pattern α. Identify all α’s core-descendents in this complete set, and merge all of them. This would generate a much larger core-descendant of α, given us the ability to leap along a path toward α in the core-pattern tree T α Pick K patterns. The set of larger core-descendants generated would be the candidate pool for the next iteration. 16

PATTERN-FUSION: DESIGN AND OVERVIEW PATTERN FUSION OVERVIEW All core patterns of a pattern α are bounded in the metric space by a “ball” od diameter r(τ). 17

PATTERN-FUSION: DESIGN AND OVERVIEW PATTERN FUSION OVERVIEW Initial Pool: Pattern-Fusion assumes available an initial pool of small frequent patterns, which is the complete set of frequent patterns up to a small size. This initial pool can be mined with any existing efficient mining algorithm. 18

PATTERN-FUSION: DESIGN AND OVERVIEW PATTERN FUSION OVERVIEW Iterative Pattern Fusion: Take as input a user specified parameter, K, which is the maximum number of patterns to be mined. At each iteration, K seed patterns are randomly picked from current pool. For each of these K seeds, find all the patterns within a ball of a sized specified by τ. All the patterns in each “ball” are then fusion together to generate a set of super-patterns. All the super-patterns thus generated are put together as a new pool. If this pool contains more than K patterns, the next iteration begins with this pool for the new round of random drawing. 19

TOWARDS COLOSSAL PATTERNS WHY ARE COLOSSAL PATTERNS FAVORED? 20

TOWARDS COLOSSAL PATTERNS CATCHING THE OUTLIERS Combining Theorem 4 and Lemma 4, an outlier which is at edit distance d away from all others would have at least 2 d-2 -1 sets of complementary core patterns. The further way it lies, the greater the chance that it will be generated by Pattern-Fusion. 21

PATTERN-FUSION IN DETAIL 22

PATTERN-FUSION IN DETAIL 23

EVALUATION MODEL 24

EVALUATION MODEL Example: A complete set Q = {Q 1, …, Q 7 } The approximation set P = {P 1, P 2 }, where P 1 = Q 4, P 2 = Q 6 Edit(Q 1, P 1 ) = 2 and | P 1 | = 5, the approximation error of P 1 = 2/5. Edit(Q 5, P 2 ) = 1 and | P 2 | = 3, the approximation error of P 1 = 1/3. On average, any pattern in Q is at most 0.37 × 5 2 items away from some pattern in P. 25

EXPERIMENTAL RESULTS 26

EXPERIMENTAL RESULTS 27

EXPERIMENTAL RESULTS 28

EXPERIMENTAL RESULTS 29

EXPERIMENTAL RESULTS 30

CONCLUSIONS Studied the problem of efficient computation of a good approximation for the colossal frequent itemsets in the presence of an explosive number of frequent patterns. Based on the concept of core pattern, a new mining methodology, Pattern-Fusion, is introduced. An evaluation model is proposed to evaluate the approximation quality of the mining results of Pattern-Fusion against the complete answer set. This model also provides a general mechanism to compare the difference between two sets of frequent patterns. 31