Download presentation
Presentation is loading. Please wait.
Published byRudolph Reeves Modified over 9 years ago
1
CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course
2
Outline Clustering High-Dimensional Data 2 Introduction Solution Techniques Feature/Attribute Transformation Feature/Attribute Selection Subspace Clustering CLIQUE: A Dimension-Growth Subspace Clustering Method Major steps Example Strength and Weakness
3
Introduction Clustering High-Dimensional Data 3 Most clustering methods are designed for clustering low-dimensional data and encounter challenges when the dimensionality of the data grows really high (say, over 10 dimensions, or even over thousands of dimensions for some tasks) Issues: Noise Distance measure meaningless
4
What happen when dimensionality increases? Clustering High-Dimensional Data 4 Only a small number of dimensions are relevant to certain clusters producing noise and masking the real clusters. Data become increasingly sparse because the data points are likely located in different dimensional subspaces data points can be considered as all equally distanced the distance measure, which is essential for cluster analysis, becomes meaningless.
5
Solution Techniques Clustering High-Dimensional Data 5 Feature/Attribute Transformation Feature/Attribute Selection Subspace Clustering
6
Feature Transformation Clustering High-Dimensional Data 6 Examples: Principal component analysis Singular value decomposition Transform the data onto a smaller space while preserving the original relative distance between objects. They summarize data by creating linear combinations of the attributes
7
Feature Transformation Issues Clustering High-Dimensional Data 7 They do not remove any of the original attributes from analysis. The irrelevant information may mask the real clusters, even after transformation. The transformed features (attributes) are often difficult to interpret, making the clustering results less useful. Thus, feature transformation is only suited to data sets where most of the dimensions are relevant to the clustering task. Unfortunately, real-world data sets tend to have many highly correlated, or redundant, dimensions.
8
Feature Selection Clustering High-Dimensional Data 8 It is commonly used for data reduction by removing irrelevant or redundant dimensions (or attributes). Given a set of attributes, attribute subset selection finds the subset of attributes that are most relevant to the data mining task. Attribute subset selection involves searching through various attribute subsets and evaluating these subsets using certain criteria. Supervised learning: the most relevant set of attributes are found with respect to the given class labels. Unsupervised process: such as entropy analysis, which is based on the property that entropy tends to be low for data that contain tight clusters.
9
Subspace Clustering Clustering High-Dimensional Data 9 It is an extension to attribute subset selection that has shown its strength at high-dimensional clustering. It is based on the observation that different subspaces may contain different, meaningful clusters. Subspace clustering searches for groups of clusters within different subspaces of the same data set. The problem becomes how to find such subspace clusters effectively and efficiently.
10
High-Dimensional Data Clustering Approaches Clustering High-Dimensional Data 10 Dimension-Growth Subspace Clustering CLIQUE (CLustering InQUEst) Dimension-Reduction Projected Clustering PROCLUS (PROjected CLUStering) Frequent Pattern Based Clustering pCluster
11
CLIQUE: A Dimension-Growth Subspace Clustering Method 11 Clustering High-Dimensional Data
12
CLIQUE Overview 12 CLIQUE is used for the clustering of high- dimensional data present in large tables. By high- dimensional data we mean records that have many attributes. CLIQUE identifies the dense units in the subspaces of high dimensional data space, and uses these subspaces to provide more efficient clustering. Clustering High-Dimensional Data
13
Terminology 13 Unit : After forming a grid structure on the space, each rectangular cell is called a Unit. Dense: A unit is dense, if the fraction of total data points contained in the unit exceeds the input model parameter. Cluster: A cluster is defined as a maximal set of connected dense units. Clustering High-Dimensional Data
14
How Does CLIQUE Work? 14 To cluster a set of records in terms of n-attributes (n- dimensional space). MAJOR STEPS : CLIQUE partitions each subspace that has dimension 1 into the same number of equal length intervals. Using this as basis, it partitions the n-dimensional data space into non-overlapping rectangular units. CLIQUE finds dense units of higher dimensionality by finding the dense units in the subspaces. Clustering High-Dimensional Data
15
CLIQUE: Major Steps (cont.) 15 For example (in 3-dimensional space), CLIQUE finds the dense units in the 3 related PLANES (2- dimensional subspaces.) It then intersects the extension of the subspaces representing the dense units to form a candidate search space in which dense units of higher dimensionality would exist. Clustering High-Dimensional Data
16
CLIQUE: Major Steps (cont.) 16 Each maximal set of connected dense units is considered a cluster. Using this definition, the dense units in the subspaces are examined in order to find clusters in the subspaces. The information of the subspaces is then used to find clusters in the n-dimensional space. It must be noted that all cluster boundaries are either horizontal or vertical. This is due to the nature of the rectangular grid cells. Clustering High-Dimensional Data
17
Example 17 Let us say that we want to cluster a set of records that have three attributes, namely, salary, vacation and age. The data space for the this data would be 3- dimensional. Clustering High-Dimensional Data age salary vacation
18
Example (Cont.) 18 After plotting the data objects, each dimension, (i.e., salary, vacation and age) is split into intervals of equal length. Then we form a 3-dimensional grid on the space, each unit of which would be a 3-D rectangle. Now, our goal is to find the dense 3-D rectangular units. Clustering High-Dimensional Data
19
Example (Cont.) 19 To do this, we find the dense units of the subspaces of this 3-d space. So, we find the dense units with respect to age for salary. This means that we look at the salary-age plane and find all the 2-D rectangular units that are dense. We also find the dense 2-D rectangular units for the vacation-age plane. Clustering High-Dimensional Data
20
Example (Cont.) Clustering High-Dimensional Data 20
21
Example (Cont.) 21 Now let us try to visualize the dense units of the two planes on the following 3D figure : Clustering High-Dimensional Data
22
Example (Cont.) 22 We can extend the dense areas in the vacation-age plane inwards. We can extend the dense areas in the salary-age plane upwards. The intersection of these two spaces would give us a candidate search space in which 3-dimensional dense units exist. We then find the dense units in the salary-vacation plane and we form an extension of the subspace that represents these dense units. Clustering High-Dimensional Data
23
Example (Cont.) 23 Now, we perform an intersection of the candidate search space with the extension of the dense units of the salary-vacation plane, in order to get all the 3D dense units. So, What was the main idea? We used the dense units in subspaces in order to find the dense units in the 3-dimensional space. After finding the dense units, it is very easy to find clusters. Clustering High-Dimensional Data
24
Reflecting upon CLIQUE 24 Why does CLIQUE confine its search for dense units in high dimensions to the intersection of dense units in subspaces? Because the Apriori property employs prior knowledge of the items in the search space so that portions of the space can be pruned. The property for CLIQUE says that if a k- dimensional unit is dense then so are its projections in the (k-1) dimensional space. Clustering High-Dimensional Data
25
Strength and Weakness of CLIQUE 25 Strength It automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspaces. It is quite efficient. It is insensitive to the order of records in input and does not presume some canonical data distribution. It scales linearly with the size of input and has good scalability as the number of dimensions in the data increases. Weakness Obtaining meaningful clustering results is dependent on proper tuning of the grid size (which is a stable structure here) and the density threshold. The accuracy of the clustering result may be degraded. Clustering High-Dimensional Data
26
Summary Clustering High-Dimensional Data 26 Introduction Solution Techniques Feature/Attribute Transformation Feature/Attribute Selection Subspace Clustering CLIQUE: A Dimension-Growth Subspace Clustering Method Major steps Example Strength and Weakness
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.