Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.

Slides:

Advertisements

Similar presentations

Heuristic Search techniques

Advertisements

Data Mining Classification: Alternative Techniques

Recap: Mining association rules from large datasets

ADAPTIVE FASTEST PATH COMPUTATION ON A ROAD NETWORK: A TRAFFIC MINING APPROACH Hector Gonzalez, Jiawei Han, Xiaolei Li, Margaret Myslinska, John Paul Sondag.

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

A distributed method for mining association rules

Swarm: Mining Relaxed Temporal Moving Object Clusters

Introduction to Computer Science 2 Lecture 7: Extended binary trees

Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

Mining Multiple-level Association Rules in Large Databases

Frequent Closed Pattern Search By Row and Feature Enumeration

Resource-oriented Approximation for Frequent Itemset Mining from Bursty Data Streams SIGMOD’14 Toshitaka Yamamoto, Koji Iwanuma, Shoshi Fukuda.

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

Reducing the collection of itemsets: alternative representations and combinatorial problems.

Principal Component Analysis

Finite State Machine State Assignment for Area and Power Minimization Aiman H. El-Maleh, Sadiq M. Sait and Faisal N. Khan Department of Computer Engineering.

Mining Frequent patterns without candidate generation Jiawei Han, Jian Pei and Yiwen Yin.

Data Mining Association Analysis: Basic Concepts and Algorithms

Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees Radford M. Neal and Jianguo Zhang the winners.

SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.

Online Data Gathering for Maximizing Network Lifetime in Sensor Networks IEEE transactions on Mobile Computing Weifa Liang, YuZhen Liu.

CS 206 Introduction to Computer Science II 12 / 10 / 2008 Instructor: Michael Eckmann.

Fast Algorithms for Association Rule Mining

33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip.

AlgoDEEP 16/04/101 An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Fabio Vandin DEI - Università di Padova CS.

DEXA 2005 Quality-Aware Replication of Multimedia Data Yicheng Tu, Jingfeng Yan and Sunil Prabhakar Department of Computer Sciences, Purdue University.

Efficient Model Selection for Support Vector Machines

Mining Optimal Decision Trees from Itemset Lattices Dr, Siegfried Nijssen Dr. Elisa Fromont KDD 2007.

Sequential PAttern Mining using A Bitmap Representation

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.

1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.

Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )

MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.

A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.

On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.

Mining High Utility Itemset in Big Data

Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:

Approximate Dynamic Programming Methods for Resource Constrained Sensor Management John W. Fisher III, Jason L. Williams and Alan S. Willsky MIT CSAIL.

CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.

Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.

MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:

1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Hong.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 6.2: Classification Rules Rodney Nielsen Many.

 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.

Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.

1 Efficient Discovery of Frequent Approximate Sequential Patterns Feida Zhu, Xifeng Yan, Jiawei Han, Philip S. Yu ICDM 2007.

Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Nurjahan BegumLiudmila Ulanova Jun Wang 1 Eamonn Keogh University.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.

Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Chun Kai Chen Author ： Andrew.

University at BuffaloThe State University of New York Pattern-based Clustering How to cluster the five objects? qHard to define a global similarity measure.

Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.

Da Yan, Raymond Chi-Wing Wong, and Wilfred Ng The Hong Kong University of Science and Technology.

Xifeng Yan Philip S. Yu Jiawei Han SIGMOD 2005 Substructure Similarity Search in Graph Databases.

Presented by: Mi Tian, Deepan Sanghavi, Dhaval Dholakia

Rule Induction for Classification Using

RE-Tree: An Efficient Index Structure for Regular Expressions

Haim Kaplan and Uri Zwick

Frequent Pattern Mining

Jiawei Han Department of Computer Science

CARPENTER Find Closed Patterns in Long Biological Datasets

Objective of This Course

Integer Programming (정수계획법)

Coverage Approximation Algorithms

Integer Programming (정수계획법)

Presentation transcript:

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign

2 Outline Introduction Problem Statement and Analysis Discovering Representative Patterns Performance Study Discussion and Conclusions

3 Introduction Frequent Pattern Mining –Minimum Support: 2 (a, b, c, d) (a, b, d, e) (b, e, f) (b) : 3 (a) : 2 (a, b) : 2 (a, d) : 2 (d) : 2 (b, d) : 2 (e) : 2 (b, e) : 2 (a, b, d) : 2

4 Challenge In Frequent Pattern Mining Efficiency? –Many scaleable mining algorithms are available now Usability?—Yes –High minimum support: common sense patterns –Low minimum support: explosive number of results

5 Existing Compressing Techniques Lossless compression –Closed frequent patterns –Non-derivable frequent item-sets –... Lossy approximation –Maximal frequent patterns –Boundary cover sets –…

6 A Motivating Example A subset of frequent item-sets in accident dataset High-quality compression needs to consider both expression and support IDItem-SetsSupport P1{38,16,18,12} P2{38,16,18,12,17} P3{39,38,16,18,12,17} P4{39,16,18,12,17} P5{39,16,18,12} Expression of P1 Support of P1

7 A Motivating Example Closed frequent pattern –Report P1,P2,P3,P4,P5 –Emphasize too much on support –no compression Maximal frequent pattern –Only report P3 –Only care about the expression –Loss the information of support A desirable output: P2,P3,P4 IDItem-SetsSupport P1{38,16,18,12} P2{38,16,18,12,17} P3{39,38,16,18,12,17} P4{39,16,18,12,17} P5{39,16,18,12}161576

8 Compressing Frequent Patterns Our compressing framework –Clustering frequent patterns by pattern similarity –Pick a representative pattern for each cluster Key Problems –Need a distance function to measure the similarity between patterns –The quality of the clustering needs to be controllable –The representative pattern should be able to describe both expressions and supports of other patterns –Efficiency is always desirable

9 Distance Measure Let P1 and P2 are two closed frequent patterns, T(P) is the set of raw data which contains P, the distance between P1 and P2 is: Let T(P1)={t1,t2,t3,t4,t5}, T(P2)={t1,t2,t3,t4,t6}, then D(P1,P2)=1-4/6=1/3 D is a valid distance metric D characterizes the support, but ignore the expression

10 Representative Patterns Incorporate expression into Representative Pattern –The representative pattern should be able to express all the other patterns in the same cluster (i.e., superset) –The representative pattern Pr: {38,16,18,12,17} Representative pattern is also good w.r.t. distance –D(Pr, P1) ≤ D(P1, P2), D(Pr, P1) ≤ D(P1, P2) –Distance can be computed using support only IDItem-SetsSupport P1{38,16,18,12} P2{38,16,18,17}205310

11 Clustering Criterion General clustering approach (i.e., k-means): –Directly apply the distance measure –No guarantee on the quality of the clusters –The representative pattern may not exist in a cluster δ-clustering –For each pattern P, Find all patterns which can be expressed by P and their distance to P are within δ (δ-cover) –All patterns in the cluster can be represented by P

12 Intuitions of δ-clustering All Patterns in the cluster are supported by almost same set of transactions –Distance from any pattern to representative is bounded by δ –Distance between any two patterns is bounded by 2 *δ –The small difference between transaction sets could be noise or negligible Representative Pattern has the most informative expression

13 Pattern Compressing Problem Pattern Compression Problem –Find the minimum number of clusters (representative patterns) –All the frequent patterns are δ-covered by at least one representative pattern –Variation: support of representative pattern less than min_sup? NP-hardness: Reducible from set-covering problem Pattern CompressionSet-Covering Frequent PatternsElements Representative patternsSets Minimize number of representative patterns Minimize number of covering set

14 Discovering Representative Patterns RPglobal –Assume all the frequent patterns are mined –Directly apply greedy set-covering algorithm –Guaranteed bounds w.r.t. optimal solution RPlocal –Relax the constraints used in RPglobal –Gain in efficiency, lose in bound guarantee –Directly mine from raw data set RPcombine –Combine above two methods –Trade-off w.r.t. efficiency and performance

15 RPglobal Algorithm –At each step, find the representative pattern Pr which δ-covers the maximum number of uncovered patterns –Select Pr as new representative pattern –Mark the corresponding pattern as covered –Continue until all patterns are covered Bound: –|Cg| (|C*|) is the number of output of RPglobal (optimal) – –F is the set of frequent patterns –Set(P): set of the patterns covered by P

16 RPlocal RPglobal is expensive –Assume all the frequent pattern are pre-computed –Need to find the globally best representative pattern at each step –Need to compute the pair-wise distance between all frequent patterns Relax the constraints: RPlocal –Find a locally good representative pattern each step –Directly mine from raw data –Do not compute the distance pair-wisely

17 Local Greedy Method Principle of Local Method Bound – –|Cl|: number of output using local method –T: optimal number of patterns covering all probe patterns –Set(P): set of the patterns covered by P Global GreedyLocal Greedy Find each pattern Pr (not covered)Probe pattern P (not covered) Find all patterns covered by PrFind all patterns Pr covering P Select Pr with largest coverageSelect Pr with largest coverage and covering P

18 Mine from Raw Data Beneficial –Without storage of huge intermediate outputs –More efficient pruning methods Applicable –Utilize the internal relations during mining –FP-growth method Depth first search in Pattern- Space A pattern can only be covered by its sons or patterns visited before Probe Pattern P P’s Sons Visited Patterns covering P

19 Integrate Local Method into FP-Mining Algorithm –Follow the depth-first search in pattern space –Remember all previously discovered representative patterns –For each pattern P Not covered yet Being Visited in the second time which traversal back from its sons –Select a representative pattern using local method (with P as new probe pattern)

20 Avoid Pair-wise Comparisons Find a good representative pattern (for probe pattern P) –Strong correlations between Pattern positions, coverage of uncovered patterns and pattern length –Simple but effective heuristic: select the longest item-sets in P’s sons as a new representative pattern to cover P 4952: first visit of P, 5043: second visit of P (between 4952 and 5043 are sons of P) First time visit of P second time visit of P P’s Sons Previous Patterns

21 Efficient Implementation Non Closed Pattern –Exist a super pattern with same support Closed_Index (N bits) –Each bit remembers the consistency of an item –Aggregate the closed_index with pattern –Not closed if at least one out-pattern bit is set TransactionClosed_index (f,c,a,m,p) (f,c,a,b,m) (f,b) (f,c,a,m,p) (c,a) f does not belong to (c,a). Support of (c,a) is same as support of (f,c,a). (c,a) is not closed

22 Efficient Implementation Prune non-closed patterns –Non-closed patterns are guaranteed to be covered –Use limited bits to remember subset of items –Majority non-closed patterns are pruned by closed_index –A few left are pruned by checking the coverage of representative patterns

23 Experimental Setting Data –frequent itemset mining dataset repository ( Comparing algorithms –FPclose: an efficient algorithm to generate all closed itemsets, winner of FIMI workshop 2003 –RPglobal: first use FPclose to generate closed itemsets, then use global greedy method to find representative patterns –RPlocal: directly use local method to find representative patterns from raw data

24 Performance Study Number of Representative Patterns

25 Performance Study Running Time

26 Performance Study Quality of Representative Patterns

27 Conclusions Significant reduction of the number of output –Two orders of magnitudes of reduction for δ= 0.1 –Catch both expressions and supports –Easily extendable for compression of sequential, graph and structure data RPglobal –theoretical bound –works well on small collection of patterns RPlocal –much more efficient –Still quite good compression quality

28 Future Work Using representative patterns for association, correlation and classification Compressing frequent patterns over incrementally updated data (i.e., stream) Further compressing the representative patterns by some advanced compression models (i.e., pattern profiles)