An Efficient Method for Projected Clustering

Slides:



Advertisements
Similar presentations
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Advertisements

Mining Frequent Patterns Using FP-Growth Method Ivan Tanasić Department of Computer Engineering and Computer Science, School of Electrical.
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
PARTITIONAL CLUSTERING
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Fast Algorithms For Hierarchical Range Histogram Constructions
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang
Author:Rakesh Agrawal
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
1 FINDING FUZZY SETS FOR QUANTITATIVE ATTRIBUTES FOR MINING OF FUZZY ASSOCIATE RULES By H.N.A. Pham, T.W. Liao, and E. Triantaphyllou Department of Industrial.
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
Presented by Ho Wai Shing
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
Clustering High-Dimensional Data. Clustering high-dimensional data – Many applications: text documents, DNA micro-array data – Major challenges: Many.
CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
University at BuffaloThe State University of New York Pattern-based Clustering How to cluster the five objects? qHard to define a global similarity measure.
DB Seminar Series: The Subspace Clustering Problem By: Kevin Yip (17 May 2002)
DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Discovering Frequent Arrangements of Temporal Intervals Papapetrou, P. ; Kollios, G. ; Sclaroff, S. ; Gunopulos, D. ICDM 2005.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
What Is Cluster Analysis?
Data Transformation: Normalization
Data Mining Soongsil University
Data Mining Association Analysis: Basic Concepts and Algorithms
Frequent Pattern Mining
Byung Joon Park, Sung Hee Kim
CARPENTER Find Closed Patterns in Long Biological Datasets
Market Basket Analysis and Association Rules
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
File organization and Indexing
Chapter 11: Indexing and Hashing
Vasiljevic Vladica, FP-Growth algorithm Vasiljevic Vladica,
CSE572, CBS598: Data Mining by H. Liu
Data Mining Association Analysis: Basic Concepts and Algorithms
Indexing and Hashing Basic Concepts Ordered Indices
COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong
Frequent-Pattern Tree
CSE572, CBS572: Data Mining by H. Liu
©Jiawei Han and Micheline Kamber
FP-Growth Wenlong Zhang.
Data Transformations targeted at minimizing experimental variance
Association Rule Mining
CSE572: Data Mining by H. Liu
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
Chapter 11: Indexing and Hashing
Association Analysis: Basic Concepts
Presentation transcript:

An Efficient Method for Projected Clustering Hongyin Cui Jiang Ye School of Computing Science Simon Fraser University 2019/5/2

Introduction Clustering is a widely used technique for data mining, indexing and classification. Most clustering algorithms do not work efficiently or effectively in high dimensional spaces because of the inherent sparsity of the data. 2019/5/2

Clusters may exist in different subspaces comprised of different combinations of attributes 2019/5/2

Related Work CLIQUE: density-based and grid-based It partitions each dimension into the same number of equal length intervals It partitions a m-dimensional data space into non-overlapping rectangular units A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter A cluster is a maximal set of connected dense units within a subspace A bottom-up greedy algorithm exponential dependency on the number of dimensions 2019/5/2

Related Work (Cont.) PROCLUS: Find the best set of medoids by a “hill climbing” process. Search not just in the space of the possible medoids but also in the space of possible dimensions associated with each medoid. So it uses a locality analysis and its result may be a local optimum. 2019/5/2

Related Work (Cont.) DOC: An dense projective cluster is a pair (C, D), and C is a subset of the data set S, D is a subset of full-dimension [d] |C| must be sufficiently large, i.e. |C| ≥ |S| iD, maxpC pi - minqC qi ≤ w i[d] - D, maxpC pi - minqC qi > w It repeatedly choose pS and XS via radom sampling compute the corresponding cluster (C, D) Report the best found cluster An approximation of the optimal projective cluster. 2019/5/2

Problem Definition key observations: Often many records in a database share similar values for several attributes. Identifying and grouping together records that share similar values for some attributes can both gain useful insight into the data (projected clusters), and obtain a more parsimonious representation of the data. 2019/5/2

Problem Definition (cont.) The user can define discretization criteria by specifying the interval wi for each attribute i, or using a global interval w for all attributes. We say a group of records share a similar value on attribute i, if they have a same discretized value on i. (in a same interval) For example: Name Position Points Played Mins Penalty Mins Blake Defense 43 395 34 Borque 80 430 22 Gullimore 3 30 18 Gretzky Centre 89 458 26 Konstantinov 10 560 120 May Winger 35 290 180 Odjick 9 115 245 Tkachuk Center 82 475 160 Wotton 5 38 6 Figure 1: A fragment of the NHL Players’ Statistic Table (1996) 2019/5/2

Problem Definition (cont.) In Figure 1, suppose the discretization intervals imposed on attributes are: Position => already discrete wPoints = 10, wPlayedMins= 60, wPenaltyMins=20 Find out: {Borque, Gretzky, Tkachuk} {Points, Played Mins} {Gullimore, Wotton} {Position, Points, Played Mins, Penalty Mins} played and scored a lot Same position Played, scored & penalized sparingly 2019/5/2

Problem Definition (cont.) Let p = (p1, …, pd) be a point in Rd, [d] denotes the set of the d dimensions, and wi ≥ 0 for 0≤ i ≤d. i[d], dimension i is partitioned and pi is discretized by wi. Let S be a set of points in Rd. For any 0≤ ≤ 1, a projected cluster in S is a pair (C, D), C  S, D  [d], such that: |C| ≥ |S| jD, all points in C share an equivalent discretized value on attribute j. (in the same interval) No D’  D also satisfies the above two conditions 2019/5/2

FIPCLUS: Mining projected clusters via frequent closed itemsets Basic Steps Step 1: discretize p on each attribute. Step 2: create a transaction database. Step 3: Mining frequent closed itemsets by CLOSET algorithm, each identify one subspace. Step 4: find corresponding groups of points for each subspace via scanning DB once. 2019/5/2

FIPCLUS: Mining projected clusters via frequent closed itemsets (cont Step 1 i[d], partition dimension i and discretize pi by wi. Or Discretize pi using users specified criteria. Or ignore step 1, if users provide discretized data. 2019/5/2

FIPCLUS: Mining projected clusters via frequent closed itemsets (cont Step 2 i[d], enumerate and number each discretized value with a different integer j, and all numbers are continuous. E.g. Position={defense, center, winger}, then defense=1, center=2 and winger=3 Substitute each discretized value in [d] with an unique integer, i*d+j. The original database is transformed into a transaction database. 2019/5/2

FIPCLUS: Mining projected clusters via frequent closed itemsets (cont Step 3: based on CLOSET [Jie Pei, Jiawei Han] Definition (Frequent closed itemset): An itemset X is a closed itemset if there exists no itemset X’ such that (1) X’ is a proper superset of X, and (2) every transaction containing X also contains X’. A closed itemset X is frequent if its support passes the given support threshold. CLOSET is based on FP-tree without candidate generation. 2019/5/2

FIPCLUS: Mining projected clusters via frequent closed itemsets (cont CLOSET Input: Transaction database TDB and support threshold min_sup; Output: The complete set of frequent closed itemsets; Method: Initialization. Let FCI be the set of frequent closed itemset. Initialize FCI =  Find frequent items. Scan transaction database TDB, compute frequent item list f_list. Mine frequent closed itemsets recursively. Call CLOSET(, TDB, f_list, FCI). 2019/5/2

FIPCLUS: Mining projected clusters via frequent closed itemsets (cont CLOSET(X, DB, f_list, FCI) Parameters: X: is the frequent itemset. DB: X-conditional database, which is a subset of transactions in TDB containing X. f_list: frequent item list of DB FCI: The set of frequent closed itemsets already found. 2019/5/2

FIPCLUS: Mining projected clusters via frequent closed itemsets (cont CLOSET(X, DB, f_list, FCI) { Extract a set (Y) of items appearing in every transaction of DB, insert XY to FCI, if it is not a subset of some itemset in FCI with the same support; Build FP-tree for DB, items in Y are excluded. Directly extract frequent closed itemsets from FP-tree. i rest of f_list, form conditional database DB|i and compute local frequent item list f_listi i rest of f_list, call CLOSET(iX, DB|I, f_listi, FCI), if iX is not a subset of any frequent closed itemset in FCI with the same support. } 2019/5/2

Evaluation & Comparison Definition - more flexible and meaningful. No assumption on the distribution of C in D. different interval wi on each dimension or flexible discretization criteria. CLIQUE partition each dimension into  intervals, not flexible. Hard to determine dense threshold for each unit. PROCLUS Distance-based: has all distance-based flaws. DOC Very similar definition But global interval width w for each dimension, not flexible. 2019/5/2

Evolution & Comparison Algorithm Solve clustering problem via mining frequent itemsets more efficient, scalable and faster in large database. Runtime complexity is O(N), where N=|DB|. Typically 4, 5 scan of DB CLIQUE Bottom-up construction generate huge candidates, each of which need scan DB once. --- not efficient PROCLUS Find the best set of medoids by a “hill climbing” process. A locality analysis and its result may be a local optimum. Runtime complexity O(Nk l + Nk d), where k= the number of clusters, l = the average dimensionality of subspaces, d = the full dimensionality. -- less efficient 2019/5/2

Evaluation & Comparison DOC Find the approximation of clusters via random sampling. Not complete and quality can not be guaranteed. Runtime complexity is O(N  dc+1), where c = a constant, d= the full dimensionality, and N= |DB|. --- less efficient 2019/5/2

Conclusion We proposed FIPCLUS, which Efficiently mining projected clusters via frequent closed itemsets. Applies a compressed FP-tree structure for mining frequent closed itemset without candidate generation. Generates a much smaller set of frequent itemsets and leads to less and more interesting projected clusters. 2019/5/2

Weakness & Future Work Weakness: Future work: FIPCLUS may generate some overlapping clusters. E.g. For (C1, D1) and (C2, D2), C1={a, b, c}, D1={d1, d2, d3, d4}; C2={a, b, c, e, f}, D2={d1, d2}; Future work: Modify FIPCLUS to mine the maximal frequent itemsets to address above weakness. E.g. In above example, it only outputs (C1, D1). It is actually a tradeoff, since maximal frequent itemsets may lose some interesting clusters and information. Evaluate its effectiveness. 2019/5/2

References [1]   R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan, Automatic Subspace Clustering of High Dimensional Data for Data Mining Application [2]   C. Procopiuc, M. Jones, P. Agarwal, T.M.Murali, A Monto Carlo Algorithm for Fast Projective Clustering [3]   C. Aggarwal, C. Procopius, J. Wolf, P. Yu, J. Park, Fast Algorithm for Projected Clustering [4]   J. Pei, J. Han, R. Mao, CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets  [5]   K. Yip, D. Cheung, M. Ng, A Highly-usable Projected Clustering Algorithm for Gene Expression Profiles [6]   H.V. Jagadish, J. Madar, R. Ng, Semantic Compression and Pattern Extraction with Fascicles 2019/5/2