Download presentation
Presentation is loading. Please wait.
Published byRoberta Wilkins Modified over 9 years ago
DB Seminar Series: HARP: A Hierarchical Algorithm with Automatic Relevant Attribute Selection for Projected Clustering Presented by: Kevin Yip 20 September 2002
1 Short Summary Our own work (unpublished), supervised by Dr. Cheung and Dr. Ng Problem: to cluster datasets of very high dimensionality Assumption: clusters are formed in subspaces
2 Short Summary Previous approaches: either have special restrictions on the dataset or target clusters, or cannot determine the dimensionality of the clusters automatically Our approach: not restricted by these limitations
3 Presentation Outline Clustering Projected clustering Previous approaches to projected clustering Our approach: HARP –Concepts –Implementation: HARP.1 –Experiments Future work and conclusions
4 Clustering Goal: given a dataset D with N records and d attributes (dimensions), partition the records into k disjoint clusters such that –Intra-cluster similarity is maximized –Inter-cluster similarity is minimized
5 Clustering How to measure similarity? –Distance-based: Manhattan distance, Euclidean distance, etc. –Correlation-based: cosine correlation, Pearson correlation, etc. –Link-based (common neighbors) –Pattern-based
6 Clustering 2 common types of clustering algorithms: –Partitional: selects some representative points for each cluster, assigns all other points to their closest clusters, and then re-determines the new representative points –Hierarchical (agglomerative): repeatedly determines the two most similar clusters, and merges them
7 Clustering Partitional clustering: DatasetRepresentatives Assignment Replacement
8 Clustering Hierarchical clustering: DatasetSimilarity calculation Best merge determination Merging
9 Projected Clustering Assumption (general case): each cluster is formed in a subspace Source of figures: ORCLUS (SIGMOD 2000) Assumption (special case): each cluster has a set of relevant attributes Goal: determine the records and relevant attributes of each cluster (to “select” the “relevant attributes”. How to define “relevance”?)
10 Projected Clustering A 3-D view: Source of figure: DOC (SIGMOD 2002)
11 Projected Clustering An example dataset: PersonAgeVirus LevelBlood TypeDisease A 1350.6ABUninfected 2640.9ABUninfected 3271.1ABUninfected 4189.8OInfected 5428.6ABInfected 65311.3BInfected 7370.7ORecovered 8280.4ARecovered 9650.9BRecovered
12 Projected Clustering Projected clustering v.s. feature selection: –Feature selection selects a feature set for all records, but projected clustering selects attribute sets individually for each cluster –Feature selection is a preprocessing task, but projected clustering selects attributes during the clustering process
13 Projected Clustering Why projected clustering is important? –At high dimensionality, the data points are sparse, the distance between any two points is almost the same –There are many noise attributes that we are not interested in –High dimensionality implies high computational complexity
14 Previous Approaches (Refer to my previous DB seminar on 17 May 2002 titled “The Subspace Clustering Problem”)The Subspace Clustering Problem Grid-based dimension selection (CLIQUE, ENCLUS, MAFIA) Association rule hypergraph partitioning Context-specific Bayesian clustering Monte Carlo algorithm (DOC) Projective clustering (PROCLUS, ORCLUS)
15 Previous Approaches PROCLUS: 1.Draw medoids 2.Determine neighbors 3.Select attributes 4.Assign records 5.Replace medoids 6.Goto 2 ORCLUS: 1.Draw medoids 2.Assign records 3.Select vectors 4.Merge (reselect vectors and determine centroid) 5.Goto 2
16 Previous Approaches Summary of the limitations of previous approaches (each approach has one or more of the followings): –Produce non-disjoint clusters –Has exponential time complexity w.r.t. cluster dimensionality –Allow each attribute value be selected by only one cluster –* Unable to determine the dimensionality of each cluster automatically –Produce clusters all of the same dimensionality –* Consider only local statistical values in attribute selection –Unable to handle datasets with mixed attribute types –* Assign records to clusters regardless of their distances –Require datasets to have a lot more records than attributes
17 Our Approach: HARP Motivations: –From datasets: we want to study gene expression profile datasets (usually with thousands of genes and less than a hundred samples) –From previous algorithms: we want to develop a new algorithm that does not have any of the above limitations
18 Our Approach: HARP HARP: a Hierarchical algorithm with Automatic Relevant attribute selection for Projected clustering Special features: –Automatic attribute selection –Customizable procedures –Mutual disagreement prevention
19 Our Approach: HARP Special implementation based on attribute value density, HARP.1: –Use of global statistics in attribute selection –Generic similarity calculations that can handle both categorical and numeric attributes –Implementing all mutual disagreement mechanisms defined by HARP –Reduced time complexity by pre-clustering
20 Our Approach: HARP Basic idea: –In the partitional approaches: At the beginning, each record is assigned to a cluster by calculating distances/similarities using all attributes Very likely that some assignments are incorrect No clue to find the dimensionality of the clusters –Our approach: Allow only the “best merges” at any time
21 Out Approach: HARP Basic idea: –“Best”: a merge is permitted only if Each selected attribute of the resulting cluster has a relevance of at least dt The resulting cluster has more than mapc selected attributes The two participating clusters have a mutual disagreement not larger than md – Mapc, dt, md : threshold variables
22 Our Approach: HARP Multi-step clustering: mapc dt md Initial thresholds Cluster 1Cluster2Merge Score 2627.6 3824.3 121324.1 1518.5 … Merge score calculations Perform all possible merges d 1 imd 1 g mmd mapc dt md Threshold loosening 1 g mmd d 1 imd
23 Our Approach: HARP Expected resulting clusters: –Have all relevant attributes selected (due to mapc ) –Selected attributes have high relevance to the cluster (due to dt ) –Not biased by the participating clusters (due to md and some other mechanisms)
24 Our Approach: HARP More details: attribute relevance –Depending on the definition of the similarity measure –E.g. the density-based measure defines the relevance of an attribute to a cluster by the “compactness” of its values in the cluster. Compactness can be reflected by the variance value
25 Our Approach: HARP More details: attribute relevance AttributeA1A1 A2A2 A3A3 A4A4 Cluster (mean/variance) C1C1 4.9/0.15.1/0.12.8/0.53.1/0.5 C2C2 5.0/0.14.9/0.17.3/0.57.2/0.5 Which attributes are relevant to the clusters? –A 1, A 2 : local statistics –A 3, A 4 : global statistics
26 Our Approach: HARP More details: mutual disagreement –The two clusters participating in a merge do not agree with each other
27 Our Approach: HARP More details: mutual disagreement –Case 1: –One cluster dominates the selection of attributes 100 rec. {A 1, A 2 } 5 rec. {A 3, A 4 } 105 rec. {A 1, A 2 }
28 Our Approach: HARP More details: mutual disagreement –Case 2: –The clusters lose some information due to the merge 50 rec. {A 1, A 2 } 100 rec. {A 1, A 2 } 50 rec. {A 1, A 2, …,A6}
29 Our Approach: HARP More details: mutual disagreement –Mutual disagreement prevention: Setup the md threshold to limit the maximum disagreement on the new set of attributes Get the statistics of the loss of information in all possible merges, discard those with extraordinary high loss Add a punishment factor to the similarity score
30 Our Approach: HARP.1 HARP.1: an implementation of HARP that defines the relevance of an attribute to a cluster by its density improvement from the global density Relevance score of an attribute to a cluster: –Categorical: 1 – (1 – Mode-ratio local ) / (1 – Mode-ratio global ) –Numeric: 1 – Var local / Var global –*When Mode-ratio global = 1 or Var global = 0, the score = 0 –If C 1 and C 2 merge into C new, we can use the records of C 1 and C 2 to evaluate their “agreement” on the selected attributes of C new in a similar way.
31 Our Approach: HARP.1 Mutual disagreement calculations: –Den(C i, a ): how good is attribute a in C i –Den(C i, C new, a ): how good is the attribute a in C i, evaluated by using the properties of a in C new –Both values are in the range [0, 1]
32 Our Approach: HARP.1 Similarity score:
33 mapc dt md Threshold loosening 1 g mmd d 1 imd Our Approach: HARP.1 Multi-step clustering: Initial thresholds Cluster 1Cluster2Merge Score 2627.6 3824.3 121324.1 1518.5 … Merge score calculations Perform all possible merges 1 g mmd Baseline value for each dt variable: the global statistical value Initial and baseline values for the md variable: user parameters, default 10 and 50 With mutual disagreement prevention: 1.MD(C 1,C 2 ) <= md 2.Sum of and difference between ILoss(C 1, C new ) and ILoss(C 2, C new ) not more than a certain s.d. from mean 3.Punishment factor in similarity score Each cluster keeps a local score list (binary tree) containing merges with all other clusters. The best scores are propagated to a global score list mapc dt md d 1 imd
34 Our Approach: HARP.1 Time complexity: –Speeding up: use a fast projected clustering algorithm to pre-cluster the data Space complexity:
35 Our Approach: HARP.1 Accuracy experiments (datasets): NameTypeRec.ClassCat./Num. Attr.Avg. Rel. Attr.Outlier (%) SoybeanReal-life47435/026? VotingReal-life435216/011? MushroomReal-life8124222/015? SynCat1Synthetic500520/0125 SynMix1Synthetic500510/10125 SynNum1Synthetic50050/20125 SynCat2Synthetic500520/075 SynMix2Synthetic500510/1075 SynNum2Synthetic50050/2075
36 Our Approach: HARP.1 Accuracy experiments (results1): DatasetHARP.1PROCLUSTraditionalROCK Soybean0.0/0.0 17.3/0.0 2.1/0.0 9.2/0.0 No published result Voting6.4/13.62.1/55.6 13.8/7.9 13.1/11.3 13.1/1.9 6.2/14.5 Mushroom1.4/0.03.2/0.0 9.0/0.0 6.0/0.0 5.2/0.0 0.4/0.0 Best score: error% / outlier% Average: error% / outlier%
37 Our Approach: HARP.1 Accuracy experiments (results2): DatasetHARP.1PROCLUSTraditionalORCLUS SynCat10.0/5.03.6/1.4 6.7/3.7 2.6/26.4 5.8/5.3 N/A SynMix10.4/6.82.2/17.0 6.8/10.1 11.6/11.2 7.9/4.6 N/A SynNum10.8/5.01.8/21.4 7.2/8.3 4.4/32.0 5.9/9.2 0.4/23.8 2.31/8.15 SynCat24.0/8.411.0/31.0 25.0/14.4 17.8/23.8 28.5/5.4 N/A SynMix211.4/4.416.6/62.2 25.4/32.8 17.6/38.6 24.1/11.6 N/A SynNum218.8/4.411.6/50.8 18.7/20.7 11.6/28.0 23.3/10.9 50.8/0.0 57.2/0.0
38 Our Approach: HARP.1 Accuracy experiments (results3): –Dataset: 500 records, 200 attributes, on average 13 relevant, 5 classes –Pre-clustering: form 50 clusters
39 Our Approach: HARP.1 Scalability experiments (scaling N ):
40 Our Approach: HARP.1 Scalability experiments (scaling d ):
41 Our Approach: HARP.1 Scalability experiments (scaling average number of relevant attributes):
42 Our Approach: HARP.1 Scalability experiments (scaling N with pre- clustering):
43 Our Approach: HARP.1 Application: gene expression datasets –Lymphoma: Nature 403 (2000) –96 samples, 4026 genes, 9 classes
44 Our Approach: HARP.1 Application: gene expression datasets –Can also use genes as records and samples as attributes: E.g. use the dendrogram to produce an ordering of all genes Based on some domain knowledge, validate the ordering If the ordering is valid, the position of other genes of unknown functions can be analyzed
45 Future Work Produce more implementations based on other similarity measures Study the definition of “relevance” in gene expression datasets Consider very large datasets that cannot fit into main memory Extend the approach to solve other problems, e.g. k-NN in high dimensional space
46 Conclusions A hierarchical projected clustering algorithm, HARP, is developed with –Dynamic selection of relevant attributes –Mutual disagreement prevention –Generic similarity calculation A density-based implementation called HARP.1 is developed with –Good accuracy –Reasonable time complexity –Real applications on gene expression datasets
47 References C. C. Aggarwal, C. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park. Fast algorithms for projected clustering. In ACM SIGMOD International Conference on Management of Data, 1999. C. C. Aggarwal and P. S. Yu. Finding generalized projected clusters in high dimensional spaces. pages 70{81, 2000. R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In ACM SIGMOD International Conference on Management of Data, 1998. A. A. Alizadeh, M. B. Eisen, R. E. Davis, C. Ma, I. S. Lossos, A. Rosenwald, J. C. Boldrick, H. Sabet, T. Tran, X. Yu, J. I. Powell, L. Yang, G. E. Marti, T. Moore, J. Hudson, L. Lu, D. B. Lewis, R. Tibshirani, G. Sherlock, W. C. Chan, T. C. Greiner, D. D.Weisenburger, J. O. Armitage, R. Warnke, R. Levy, W. Wilson, M. R. Grever, J. C. Byrd, D. Botstein, P. O. Brown, and L. M. Staudt. Distinct types of di use large b-cell lymphoma identified by gene expression profiling. Nature, 403(6769):503{511, 2000. Y. Barash and N. Friedman. Context-specific bayesian clustering for gene expression data. In Annual Conference on Research in Computational Molecular Biology, 2001.
48 References C. H. Cheng, A. W.-C. Fu, and Y. Zhang. Entropy-based subspace clustering for mining numerical data. In Knowledge Discovery and Data Mining, pages 84{93, 1999. S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical attributes. In 15th International Conference on Data Engineering, 1999. E.-H. Han, G. Karypis, V. Kumar, and B. Mobasher. Clustering based on association rule hypergraphs. In 1997 SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997. G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar. Multilevel hypergraph partitioning: Applications in VLSI domain. In ACM/IEEE Design Automation Conference, 1997. H. Nagesh, S. Goil, and A. Choudhary. Maa: Efficient and scalable subspace clustering for very large data sets, 1999. C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. M. Murali. A monte carlo algorithm for fast projective clustering. In ACM SIGMOD International Conference on Management of Data, 2002. H. Wang, W. Wang, J. Yang, and P. S. Yu. Clustering by pattern similarity in large data sets. In ACM SIGMOD International Conference on Management of Data, 2002.
Similar presentations
© 2025 Inc.
All rights reserved.