Presentation is loading. Please wait.

Presentation is loading. Please wait.

Spatial Clustering Methods

Similar presentations


Presentation on theme: "Spatial Clustering Methods"— Presentation transcript:

1 Spatial Clustering Methods
In Data Mining GDM Ronald Treur 23 September 2003

2 Contents Spatial Clustering Considerations Clustering Algorithms
Partitioning Methods Hierarchical Methods Density-based Methods Grid-based Methods Constraint-based Analysis Conclusion

3 Spatial Clustering Spatial clustering is the process of grouping a set of objects into classes or clusters so that objects within a cluster have high similarity in comparison to another but are dissimilar to objects in other clusters.

4 Considerations Cluster analyses has been studied for many years, as a branch of statistics In order to choose a clustering algorithm that is suitable for a particular application, many factors have to be considered. These include: Application goal, quality vs speed and characteristics of the data

5 1. Application Goal Example:
Discovering good locations for setting up stores A supermarket chain might like to cluster their customers such that the sum of the distance to the cluster centre is minimized (k-means, k-medoids).

6 1. Application Goal Example: Image recognition & raster data analysis
Find natural clusters, clusters which are perceived as crowded together by the human eye (density-based)

7 2. Quality versus Speed A suitable clustering algorithm for an application must satisfy both the quality and speed requirements Size of data (compression -> lossy) Good quality, might be unable to handle large datasets

8 3. Characteristics of the Data
Types of data attributes: The similarity between two data objects is judged by the difference in their data attributes When these are numeric: Euclidian & Manhattan distances can be computed Binary, categorical and ordinal values make things much more complicated

9 3. Characteristics of the Data
Dimensionality: The dimensionality of the data refers to the number of attributes in a data object Many clustering algorithms which work well on low-dimensional data degenerate when the number of dimensions increase Increase in running time Decrease in cluster quality

10 3. Characteristics of the Data
Amount of noise in data: Some clustering algorithms are very sensitive to noise and outliers, a careful choice must be made if the data in the application contains a large amount of noise

11 Clustering Algorithms
Four general categories: Partitioning method Hierarchical method Density-based method Grid-based method

12 Partitioning Methods Partitioning algorithms had long been popular clustering algorithms before the emergence of data-mining k-means method k-medoids method Expectation maximization (EM)

13 Partitioning Algorithm
Algorithm: The generalized iterative relocation algorithm Input: The number of clusters k and a database containing n objects Output: A set of k clusters which minimize a criterion function E 1. arbitrarily choose k centres/distributions as the initial solution 2. repeat (re)compute membership of the objects according to present solution update some/all clusters centres/distribution according to new memberships of the objects 5. until (no change to E)

14 The k-means method Uses the mean value of the objects in a cluster as the cluster centre E = ΣΣ|x-mi|2 where x is the point in space representing the given object and mi is the mean of cluster Ci Relatively scalable and efficient in processing large data sets because the computational complexity of the algorithm is O(nkt)

15 The k-means method In step 3 of the algorithm, k-means assigns each object to its nearest centre, forming a new set of clusters In step 4, all the centres of these new clusters are then computed by taking the mean of all the objects in each cluster This is repeated until the criterion function E does not change after an iteration

16 Partitioning Algorithm - k means
Algorithm: The generalised iterative relocation algorithm Input: The number of clusters k and a database containing n objects Output: A set of k clusters which minimize a criterion function E 1. arbitrarily choose k centres/distributions as the initial solution 2. repeat assign each object to its nearest cluster compute all the centers of the clusters according to new memberships of the objects 5. until (no change to E)

17 The k-medoids method Unlike the k-means and EM algorithm, the k-medoids method uses the most centrally located object (medoids) in a cluster to be the cluster centre Like k-means, objects are assigned to its nearest centre Less sensitive to noise and outliers Higher running time

18 The k-medoids method Initialisation: k objects are randomly selected to be cluster centres Step 3 is not used since this step is already handled in step 4 of the algorithm At most one centre will be changed in step 4 for each iteration This change must result in a decrease in the criterion function

19 The k-medoids method Replaces one medoid with one non- medoid as long as the quality of the resulting clustering is improving Replacement medoids are selected randomly Use Partitioning Around Medoids (PAM) CLARA and CLARANS are methods to handle larger data

20 CLARANS

21 Expectation Maximization
Instead of representing each cluster using a single point, the EM algorithm represents each cluster using a probability distribution. A d-dimensional Gaussian distribution representing a cluster Ci is parameterized by the mean of the cluster ui, and a d x d covariance matrix Mi

22 Hierarchical Methods Create a hierarchical decomposition of the given set of data objects forming a dendrogram - a tree which splits the database recursively into smaller subsets The dendrogram can be formed “bottom-up” (agglomerative) and “top-down” (divisive)

23 Hierarchical Methods Early methods: AGlomerative NESting (AGNES) and DIvisia ANAlysis (DIANA) often result in erroneous clusters More recent methods, CURE and CHAMELEON utilizes a more complex principle. Less errors are made Other approaches refine results afterwards using iterative relocation

24 AGNES and DIANA AGENS: Bottom-up, start by placing each object in a single cluster and then merge these into larger and larger clusters untill all objects are in a single cluster DIANA: Top-down, the exact reverse of Bottom-up. Start with a single cluster and break it down

25 AGNES and DIANA The algorithms are simple and often encounter difficulties regarding the selection of merge and split points. Such a decision is critical because once a group of objects is merged or split, the process at the next step will operate on the newly generated clusters Do not scale well

26 BIRCH Balanced Iterative Reducing and Clustering using Hierarchies
Compress data into many small subclusters Perform clustering on subclusters Due to compression, clustering can be performed in main memory and the algorithm only needs to scan the database once

27 CURE Clustering Using REpresentatives
Like AGNES but uses a much more sophisticated principle when merging clusters Instead of a single centroid, a fixed number of well-scattered objects are selected to represent each cluster The selected representative objects are shrunk towards there their cluster centres

28 CURE

29 CHAMELEON Similar to CURE, but
Two clusters will be merged if the inter-connectivity and closeness of the merged cluster is very similar to the inter-connectivity and closeness of the two individual clusters before merging

30 CHAMELEON To form initial subclusters, first create a graph G = (V,E) where each node v εV represents a data object, and a weighted edge (vi, vj) exists between two nodes vi and vj, if vj is one of the k-nearest neighbours of vi The weight of each edge in G represents the closeness between the two data objects it connects

31 CHAMELEON Use graph partitioning algorithm to recursively partition G into many small unconnected subgraphs by doing a min-cut on G at each level of recursion min-cut: partitioning of G into roughly two parts of equal size such that the total weight of the edges being cut is minimized

32 CHAMELEON

33 CHAMELEON It has been shown that CHAMELEON is more effective than CURE
The processing cost for high-dimensional data may require O(n2) time for n objects in the worst case

34 Density-based Methods
Cluster methods that are based on a distance measure between objects have certain difficulties finding clusters with arbitrary shapes Density method: Regard clusters as dense regions of objects in the data space which are separated by regions of low density Density based methods can be used to filter out noise

35 DBSCAN Density-based clustering algorithm that grows regions with sufficiently high density into clusters Requires two parameters ε and Minpts The neighbourhood within a radius ε of a given object is called the ε-neighbourhood An object with at least Minpts of objects within its ε-neighbourhood is called a core object

36 DBSCAN The clustering follows the following rules:
An object can belong to a cluster if and only if it lies within the ε-neighbourhood of some core object in the cluster A core object o within the ε-neighbourhood of another core object p must belong to the same cluster as p A non-core object q within the ε-neighbourhood of some core objects p1,..pi, i > 0 must belong to the same cluster as at least one of the core objects p1,..pi A non-core object r which dows not lie within the ε-neighbourhood of any core object is considered to be noise

37 DBSCAN

38 DENCLUE Based on a set of density functions
Build on the following ideas: The influence of each data point can be formally modeled using a mathematical function (influence function) which describes the impact of the data point within its neighbourhood

39 DENCLUE Build on the following ideas: (cont.)
The overall density of the data space can be modeled as the sum of the influence functions of all data points Clusters can then be determined mathematically by identifying density attractors, where the density attractors are local maxima of the overall density function

40 Grid-based Methods Density based methods like DBSCAN and OPTICS are index-based methods that face a breakdown in efficiency when the number of dimensions is high To enhance the efficiency of clustering, a grid-based clustering approach uses a grid data structure. It quantizes the space into a finite number of cells

41 Grid-based Methods Main advantage: Its fast processing time which is typically independent of the number of data objects, but dependent on only the number of cells in each dimension in the quantized space Examples: STatistical INformation Grid (STING) CLustering In QUEst (CLIQUE)

42 STING Grid-based multiresolution data structure in which the spatial area is divided into rectangular cells There are usually several levels of rectangular cells to correspond with different level of resolution. These form a hierarchical structure Statistical information about the attributes in each grid cell (such as mean, maximum and minimum values) are precomputed and stored

43 STING

44 STING Statistical parameters of higher-level cells can be easily computed from the parameters of the lower-level cells. These parameters include: count (number of objects) mean, standard deviation, minimum, maximum type of distribution the attribute value in the cell follows

45 STING Data are loaded bottom-up, starting at the bottom-most level (the level with the highest resolution) To perform clustering users must supply the density-level as an input parameter Using this parameter a top-down, grid-based method is used to find regions with sufficient density by adopting the following procedure:

46 STING A layer within the hierarchical structure is determined from which the query-answering process is to start. This layer typically contains a small number of cells. For each cell in the current layer we compute the confidence interval that the cell will be relevant to the result of the clustering. Cells that do not meet the confidence level will be removed from further consideration

47 STING The relevant cells are then refined to finer solutions by repeating this procedure at the next level of the structure. This process is repeated until the bottom layer is reached. At this time, if the query specification is met, the regions of relevant cells that satisfy the query are returned. Otherwise, the data that fall into the relevant cells are retrieved, and further processed until they meet the requirements

48 STING Advantages of STING:
The grid-based computation is query-independent since statistical information stored in each cell represents summary information, independent of the query The grid structure facilitates parallel processing and incremental updating The method is very efficient

49 STING Disadvantages of STING:
The quality depends on the granularity of the lowest level. If it is very fine, the cost of processing will increase; however if it is to coarse it may reduce the quality of cluster analysis It does not consider spatial relationship between children and their neighbouring cells for construction of the parent cell

50 WaveCluster Multiresolution clustering algorithm that first summarizes the data by imposing a multidimensional grid structure on the data space. It the uses a wavelet transformation to transform the original feature space, finding dense region in the transformed space

51 WaveCluster Each grid cell summarizes the information of a group of points which map into the cell A wavelet transform is a signal processing technique that decomposes a signal into different frequency sub-bands. Natural clusters in the data become more distinguishable Clusters can then be identified by searching for dense regions in the new domain

52 WaveCluster Grid-based and density-based algorithm.
Conforms with all of the requirements of a good clustering algorithm. In experimental studies, WaveCluster was found to outperform BIRCH, CLARANS and DBSCAN in terms of both efficiency and clustering quality

53 CLIQUE Integrates density-based and grid-based clustering
CLIQUE is able to discover clusters in subsets of the data. Useful for clustering high-dimensional data which are usually very sparse and do not form clusters in the full-dimensional space

54 CLIQUE The data space is partitioned into non-overlapping rectangular units by equally space partitions along each dimension A unit is dense if the fraction of total data points contained within exceeds an input model parameter A cluster is defined as a maximal set of connected dense units

55 Constraint-based Cluster Analysis
Most spatial clustering algorithms provide very few avenues for users to specify real life constraints An approach that provide the ability to incorporate real life constraints: Clustering with obstructed distance (COD)

56 COD

57 COD A set P of n points {p1,…, pn} and a set O of m non-intersecting obstacles {o1,…,on} are given in a two-dimensional region, R, with each obstacle oi represented by a simple polygon The distance d(p,q), between any two points, is defined as the shortest Euclidian path from p to q without cutting through obstacles

58 Conclusion Partitioning methods make use of a technique called iterative relocation to improve the clustering quality from an initial solution Such methods tend to find clusters that are of spherical shape and similar in size They are useful for applications like facility allocation where the objective is to minimize the sum of distances from the data objects to the cluster centres

59 Conclusion Hierarchical methods fix the the membership of an object once it has been allocated to a cluster Early methods like AGNES and DIANA tend to produce erroneous results. BIRCH, CURE and CHAMELEON are successful in improving the quality of clustering

60 Conclusion Instead of using distance to judge the membership of data objects, density-based clustering algorithms make use of the density of data points within the region to discover clusters DBSCAN is very sensitive to the two input parameters. OPTICS tries to overcome this problem.

61 Conclusion Grid-based clustering methods increase the efficiency of clustering by approximating the dense regions of the clustering space Done by quantizing it into a finite number of cells and identifying cells that contain more than a number of points as dense Clusters are then formed by connecting these dense cells

62 Conclusion Constraint-based clustering allows us to define real life constraints Two types of real life constraints are: Physical constraints Operational constraints

63 Questions ?


Download ppt "Spatial Clustering Methods"

Similar presentations


Ads by Google