Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.

Data Mining -Cluster Analysis

What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters.

Cluster Analysis Terminology 1.Unsupervised classification 2.Aims to decompose or partition a data set in to clusters. 3.Inter-cluster (between-groups) distance is maximized and intra-cluster (within-group) distance is minimized. 4. Its Different to classification as there are no predefined classes, no training data, may not even know how many clusters.

Cluster Analysis Features ✔ Not a new field, a branch of statistics ✔ Many old algorithms are available ✔ Renewed interest due to data mining and machine learning ✔ New algorithms are being developed, in particular to deal with large amounts of data ✔ Evaluating a clustering method is difficult

Cluster Analysis Desired Features 1.Scalability 2.Only one scan of the dataset 3.Ability to stop and resume 4.Minimal input parameters 5.Robustness 6.Ability to discover different cluster shapes 7.Different data types 8.Result independent of data input order

Types of Data ● Numerical data - continuous variables on a roughly linear scale e.g. marks, age, weight. ● Units can affect the analysis. ● Eg:Distance in metres is obviously a larger number than in kilometres. May need to scale or normalise data ● Binary data – many variables are binary e.g. gender, married or not, u/g or p/g. ● Nominal data - similar to binary but can take more than two states e.g. colour, staff position. ● Ordinal data - similar to nominal but the different values are ordered in a meaningful sequence. ● Ratio-scaled data - nonlinear scale data

Distance ● A simple, well-understood concept. ● Distance has the following properties (assume x and y to be vectors): ✔ Distance is always positive ✔ Distance x to x is zero ✔ Distance x to y cannot be greater than the sum of distance x to z and z to y ✔ Distance x to y is the same as from y to x. ● Examples: absolute value of the difference ∑|x-y| ● (the Manhattan distance ) or ∑(x-y)**2 (the Euclidean distance).

Distance Measure Methods Three common distance measures are; 1.Manhattan Distance or the absolute value of the difference ● D(x, y) = ∑|x-y| 2.Euclidean distance ● D(x, y) = ( ∑(x-y)2 ) ½ 3. Maximum difference or Chebychev distance ● D(x, y) = maxi |xi – yi|

Examples Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8): (a) Compute the Euclidean distance between the two objects. (b) Compute the Manhattan distance between the two objects.

10 1.Partitioning methods – given n objects, make k (  n) partitions (clusters) of data and use iterative relocation method.  It is assumed that each cluster has at least one object and each object belongs to only one cluster. 2.Hierarchical methods - start with one cluster and then split it in smaller clusters or start with each object in an individual cluster and then try to merge similar clusters.

Contd… 1.1 Density-based methods - for each data point in a cluster, at least a minimum number of points must exist within a given radius 1.2. K-Means Clustering: 3.Grid-based methods - object space is divided into a grid 4.Model-based methods - a model is assumed, perhaps based on a probability distribution

The K-Means Clustering The most commonly used method. 1.Select the number of clusters and let this number be k. 2.Pick k-seeds as a centroids of k- clusters. The seeds may be picked up randomly. 3.Compute the Euclidean distance of each object in the dataset from each of the centroids. 4.Allocate each object to the cluster it is nearest to based on the distance computed in the previous step

The K-Means Clustering 5.Compute the centroids of clusters by computing the means of the attribute values of the objects in each cluster. 5.Check if the stopping criterion has been met.(if cluster member ship is un changed).if yes go to step 7,if not go to step 3. 5.(optional) one may decide to stop at this stage or to split a cluster or combine two clusters until a stopping criterion is met.

Example

 Let the three seeds be the first three records:  Now we compute the distances between the objects based on the four attributes.  K-Means requires Euclidean distances but we use the sum of absolute differences as well as to show how the distance metric can change the results. StudentAgeMark1Mark2Mark3 S118737557 S218798575 S32370 52

Distance

Distance Measuring

Comments  Euclidean Distance used the example since that is what K- Means requires.  Manhattan Distance used the Method since that is s called K- Median method  The results of using Manhattan distance and Euclidean distance are quite different  The last iteration changed the cluster membership of S3 from C3 to C1.

Comments on the K-Means Method Strengths 1.Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. 2.Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Weaknesses Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex shapes

Quite different than partitioning. Involves gradually merging different objects into clusters (called agglomerative) or dividing large clusters into smaller ones (called divisive). ● Types: ● Agglomerative : ● Involves gradually merging different objects into clusters(bottom-up-approach) ● Divisive : ● Dividing large clusters into smaller ones(top-down-approach)

The algorithm normally starts with each cluster consisting of a single data point. Using a measure of distance, the algorithm merges two clusters that are nearest, thus reducing the number of clusters. The process continues until all the data points are in one cluster. The algorithm requires that we be able to compute distances between two objects and also between two clusters.

Distance between Clusters ● Single Link method – nearest neighbor - distance between two clusters is the distance between the two closest points, one from each cluster. ● Complete Linkage method – furthest neighbor – distance between two clusters is the distance between the two furthest points, one from each cluster.

Distance between Clusters ● Centroid method – distance between two clusters is the distance between the two centroids or the two canters of gravity of the two clusters. ● Unweighted pair-group average method –distance between two clusters is the average distance between all pairs of objects in the two clusters. ● This means p*n distances need to be computed if p and n are the number of objects in each of the clusters

Consider two clusters Each with 5 objects. Need the smallest distance to be. computed. 1/11/201824 A A A A B B B B B 5.Clustering Analysis/D.S.Jagli Single Link

Complete Link Consider two clusters each with 5 objects. Need the largest distance to be computed. 1/11/201825 A A A A B B B B B A 5.Clustering Analysis/D.S.Jagli

Centroid Link Consider two clusters. Need the distance between the centroids to be computed. 1/11/201826 A A A A B B B B B A C C 5.Clustering Analysis/D.S.Jagli

Average Link Consider two clusters each with 4 objects. Need 16 distances to be computed and average found. 1/11/201827 A A A A B B B B 5.Clustering Analysis/D.S.Jagli

1/11/20185.Clustering Analysis/D.S.Jagli28 1.Each object is assigned to a cluster of its own 2.Build a distance matrix by computing distances between every pair of objects 3.Find the two nearest clusters 4.Merge them and remove them from the list 5.Update matrix by including distance of all objects from the new cluster 6.Continue until all objects belong to one cluster Algorithm

1/11/201829 Example 5.Clustering Analysis/D.S.Jagli

1/11/201830 Distance Matrix 5.Clustering Analysis/D.S.Jagli

Example ● The matrix gives distance of each object with every other object. ● The smallest distance of 8 is between object s4 and object s8. They are combined and put where object 4 was. ● Compute distances from this cluster and update the distance matrix.

Updated Matrix

Contd.... ● The smallest distance now is 15 between the objects S5 and S6. They are combined in a cluster and S5 and S6 are removed. ● Compute distances from this cluster and update the distance matrix.

Updated Matrix

Contd... ● Look at shortest distance again. ● S3 and S7 are at a distance 16 apart. We merge them and put C3 where S3 was. ● The updated distance matrix is given on the next slide. It shows that C2 and S1 have the smallest distance and are then merged in the next step.

Result will be like this

Example 2 ● The application Of AGNES (agglomerative nesting), an agglomerative hierarchical clustering method,And DIANA (divisive analysis), a divisive hierarchical clustering method. ● A data set of five objects, a, b, c, d, e.

Agglomerative and divisive hierarchical clustering on data objects fa, b, c, d, eg.

More on Hierarchical Clustering Methods Major weakness of agglomerative clustering methods 1.do not scale well: time complexity of at least O(n2), where n is the number of total objects 2.can never undo (backtrack) what was done previously 3.Integration of hierarchical with distance-based clustering BIRCH (1996): uses CF-tree (clustering feature tree) and incrementally adjusts the quality of sub-clusters CURE (1998): selects well-scattered (representative) points from the cluster and then shrinks them towards the center of the cluster by a specified fraction CHAMELEON (1999): hierarchical clustering using dynamic modeling

BIRCH (1996) Birch: Balanced Iterative Reducing and Clustering using Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96) Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering Phase 1: scan DB to build an initial in-memory CF tree (a multi- level compression of the data that tries to preserve the inherent clustering structure of the data) Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scans Weakness: handles only numeric data, and sensitive to the order of the data record.

3 values are stored to represent the cluster (CF) N – number of points in a cluster LS – linear sum of points in a cluster SS – square sum of points in a cluster CF are additive f points in a cluster CF are additive

Conclusion ● A clustering algorithm taking consideration of I/O costs, memory limitation ● Utilize local information (each clustering decision is made without scanning all data points) ● Not every data point is equally important for clustering purpose

Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.

Similar presentations

Presentation on theme: "Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.

Similar presentations

Presentation on theme: "Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster."— Presentation transcript:

Similar presentations

About project

Feedback