Basic techniques for cluster detection

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Copyright Jiawei Han, modified by Charles Ling for CS411a
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Clustering.
What is Cluster Analysis?
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering.
Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering, DBSCAN The EM Algorithm
Clustering Basic Concepts and Algorithms
Clustering Categorical Data The Case of Quran Verses
PARTITIONAL CLUSTERING
Cluster Analysis Adriano Joaquim de O Cruz ©2002 NCE/UFRJ
Clustering: Introduction Adriano Joaquim de O Cruz ©2002 NCE/UFRJ
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Techniques: Clustering
Introduction to Bioinformatics
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
What is Cluster Analysis
Cluster Analysis (1).
What is Cluster Analysis?
What is Cluster Analysis?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
DATA MINING CLUSTERING K-Means.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.
Clustering.
Chapter 2: Getting to Know Your Data
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Data Mining – Algorithms: K Means Clustering
What Is Cluster Analysis?
Data Mining: Basic Cluster Analysis
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Data Mining K-means Algorithm
Data Mining Chapter 4 Cluster Analysis Part 1
CSE 5243 Intro. to Data Mining
Selected Topics in AI: Data Clustering
Revision (Part II) Ke Chen
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Revision (Part II) Ke Chen
DATA MINING Introductory and Advanced Topics Part II - Clustering
CSCI N317 Computation for Scientific Applications Unit Weka
Data Mining – Chapter 4 Cluster Analysis Part 2
MIS2502: Data Analytics Clustering and Segmentation
What Is Good Clustering?
MIS2502: Data Analytics Clustering and Segmentation
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
SEEM4630 Tutorial 3 – Clustering.
Presentation transcript:

Basic techniques for cluster detection Chapter Four Basic techniques for cluster detection

Chapter Overview The problem of cluster detection Measuring proximity between data objects The K-means cluster detection method The agglomeration cluster detection method Performance issues of the basic methods Cluster evaluation and interpretation Undertaking a clustering task in Weka

Problem of Cluster Detection What is cluster detection? Cluster: a group of objects known as members The centre of a cluster is known as the centroid Members of a cluster are similar to each other Members of different clusters are different Clustering is a process of discovering clusters : centroids

Problem of Cluster Detection Outputs of cluster detection process Assigned cluster tag for members of a cluster Cluster summary: size, centroid, variations, etc. Cluster 2: Size: 5 Centroid:(130, 51) Variation: bodyHeight = 10, bodyWeight = 14.48 Cluster 1: Size: 6 Centroid:(154, 90) Variation: bodyHeight = 5.16 bodyWeight = 5.32

Problem of Cluster Detection Basic elements of a clustering solution A sensible measure for similarity, e.g. Euclidean An effective and efficient clustering algorithm, e.g. K-means A goodness-of-fit function for evaluating the quality of resulting clusters, e.g. SSE ? ? Internal variation Inter-cluster distance Good or Bad?

Problem of Cluster Detection Requirements for clustering solutions Scalability Able to deal with different types of attributes Able to discover clusters of arbitrary shapes Minimal requirements for domain knowledge to determine input parameters Able to deal with noise and outliers Insensitive to order of input data records Able to deal with high dimensionality Incorporation of user-specified constraints Interpretability and usability

Measures of Proximity Basics Proximity between two data objects is represented by either similarity or dissimilarity Similarity: a numeric measure of the degree of alikeness, dissimilarity: numeric measure of the degree of difference between two objects Similarity measure and dissimilarity measure are often convertible; normally dissimilarity is preferred Measure of dissimilarity: Measuring the difference between values of the corresponding attributes Combining the measures of the differences

Measures of Proximity Distance function Metric properties of function d: d(x, y)  0 and d(x, x) = 0, for all data objects x and y d(x, y) = d(y, x), for all data objects x and y d(x, y)  d(x, z) + d(z, y), for all data objects x, y and z Difference of values for a single attribute is directly related to the domain type of the attribute. It is important to consider which operations are applicable. Some measure is better than no measure at all.

Measures of Proximity Difference between Attribute Values Difference between nominal values If two names are the same, the difference is 0; otherwise the maximum e.g. diff(“John”, “John”) = 0, diff(“John”, “Mary”) =  Same for difference between binary values e.g. diff(Yes, No) =  Difference between ordinal values Different degree of proximity can be compared e.g. diff(A, B) < diff(A, D). Converting ordinal values to consecutive integers e.g. A: 5, B: 4, C: 3, D: 2, E:1. A – B  1 and A – D  3 Distance measure for interval and ratio attributes Difference between values that may be unknown diff(NULL, v) = |v|, diff(NULL, NULL) = 

Measures of Proximity Distance between data objects Ratio of mismatched features for nominal attributes Given two data objects i and j of p nominal attributes. Let m represent the number of attributes where the values of the two objects match. e.g.

Measures of Proximity Distance between data objects Minkowski function for interval/ratio attributes q p j x i d ) | ... (| , ( 2 1 - + = Special cases: Manhattan distance (q = 1) Euclidean distance (q = 2) Supremum/Chebyshev (q = )

Measures of Proximity Distance between data objects Minkowski function for interval/ratio attributes (example) Manhattan Euclidean No. of Trans 10 20 30 40 50 Tenure Revenue 200 400 600 800 1000 Chebyshev

Measures of Proximity Distance between data objects For binary attributes Given two data objects i and j of p binary attributes, f00 : the number of attributes where i is 0 and j is 0 f01 : the number of attributes where i is 0 and j is 1 f10 : the number of attributes where i is 1 and j is 0 f11 : the number of attributes where i is 1 and j is 1 Simple mismatch coefficient (SMC) for symmetric values: Jaccard coefficient is defined for asymmetric values:

Measures of Proximity Distance between data objects For binary attributes (example) SMC  not that different; JC  very different: two-word (out of 3) difference SMC  very similar; JC  still quite different: one word (out of 2) difference

Measures of Proximity Similarity between data objects Cosine similarity function Treating two data objects as vectors Similarity is measured as the angle  between the two vectors Similarity is 1 when  = 0, and 0 when  = 90 Similarity function: i j 

Measures of Proximity Similarity between data objects Cosine similarity function (illustrated) Given two data objects: x = (3, 2, 0, 5), and y = (1, 0, 0, 0) Since, x  y = 3*1 + 2*0 + 0*0 + 5*0 = 3 ||x|| = sqrt(32 + 22 + 02 + 52)  6.16 ||y|| = sqrt(12 + 02 + 02 + 02) = 1 Then, the similarity between x and y: cos(x, y) = 3/(6.16 * 1) = 0.49 The dissimilarity between x and y: 1 – cos(x,y) = 0.51

Measures of Proximity Distance between data objects Combining heterogeneous attributes Based on the principle of ratio of mismatched features For the kth attribute, compute the dissimilarity dk in [0,1] Set the indicator variable k as follows: k = 0, if the kth attribute is an asymmetric binary attribute and both objects have value 0 for the attribute k = 1, otherwise Compute the overall distance between i and j as:

Measures of Proximity Distance between data objects Attribute scaling When: on the same attribute when data from different data sources are merged on different attributes when data is projected into the N-space Normalising variables into comparable ranges: divide each value by the mean divide each value by the range z-score Attribute weighting The weighted overall dissimilarity function:

K-means, a Basic Clustering Method Outline of main steps Define the number of clusters (k) Choose k data objects randomly to serve as the initial centroids for the k clusters Assign each data object to the cluster represented by its nearest centroid Find a new centroid for each cluster by calculating the mean vector of its members Undo the memberships of all data objects. Go back to Step 3 and repeat the process until cluster membership no longer changes or a maximum number of iterations is reached.

K-means, a Basic Clustering Method Illustration of the method:

K-means, a Basic Clustering Method Strengths & weaknesses Strengths Simple and easy to implement Quite efficient Weaknesses Need to specify the value of k, but we may not know what the value should be beforehand Sensitive to the choice of initial k centroids: the result can be non-deterministic Sensitive to noise Applicable only when mean is meaningful to the given data set

K-means, a Basic Clustering Method Overcoming the weaknesses: Using cluster quality to determine the value of k Improving how the initial k centroids are chosen Running the clustering a number of times and select the result with highest quality Using hierarchical clustering to locate the centres Finding centres that are farther apart Dealing with noise Removing outliers before clustering? K-medoid method, using the nearest data object to the virtual centre as the centroid. When mean cannot be defined, K-mode method, calculating mode instead of mean for the centre of the cluster.

K-means, a Basic Clustering Method Value of k and cluster quality Scree plot Cluster errors (e.g. SSE) Number of clusters

K-means, a Basic Clustering Method Choosing initial k centroids Running the clustering many times (only trial and error) Using hierarchical clustering to locate the centres (why partition based?) Finding centres that are farther apart

K-means, a Basic Clustering Method K-medoid: Bisecting K-means

The Agglomeration Method Outline of main steps Take all n data objects as individual clusters and build a n x n dissimilarity matrix. The matrix stores the distance between any pair of data objects. While the number of clusters > 1 do: Find a pair of data objects/clusters with the minimum distance Merge the two data objects/clusters into a bigger cluster Replace the entries in the matrix for the original clusters or objects by the cluster tag of the newly formed cluster Re-calculate relevant distances and update the matrix

The Agglomeration Method Illustration of the method

The Agglomeration Method Illustration of the method (dendrogram) # of clusters 1 2 3 4 5 6 7 8 9 10

The Agglomeration Method Agglomeration schemes Single link: the distance between two closest points Complete link: the distance between two farthest points Group average: the average of all pair-wise distances Centroids: the distance between the centroids

The Agglomeration Method Strengths and weaknesses Strengths Deterministic results Multiple possible versions of clustering No need to specify the value of a k beforehand Can create clusters of arbitrary shapes (single-link) Weaknesses Does not scale up for large data sets Cannot undo membership like the K-means Problems with agglomeration schemes (see Chapter 5)

Cluster Evaluation & Interpretation Cluster quality Principle: High-level similarity/low-level variation within a cluster High-level dissimilarity between clusters The measures Cohesion: sum of squared errors (SSE), and sum of SSEs for all clusters (WC) Separation: sum of distances between clusters (BC) Combining the cohesion and separation, the ratio BC/WC is a good indicator of overall quality. Ck: cluster k rk: centroid of Ck

Cluster Evaluation & Interpretation Cluster quality illustrated Cluster c2 Cluster c1  C1 is a better quality cluster than C2.

Cluster Evaluation & Interpretation Using cluster quality for clustering With K-means: Add an outer loop for different values of K (from low to high) At an iteration, conduct K-means clustering using the current K Measure the overall cluster quality and decide whether the resulting cluster quality acceptable If not, increase the value of K by 1 and repeat the process With agglomeration: Traverse the hierarchy level by level from the root At a level, evaluate the overall quality of clusters If the quality is acceptable, take the clusters at the level as the final result. If not, move to the next level and repeat the process.

Cluster Evaluation & Interpretation Cluster tendency Cluster tendency: do clusters really exist? Measures for tendency: Quality measure: when BC and WC are similar, it means clusters do not exist. Use Hopkins statistic P: a set of n randomly generated data points S: a sample of n data points from the data set tp: the nearest neighbour of point p in S tm: the nearest neighbour of point m in P

Cluster Evaluation & Interpretation Cluster interpretation Within cluster How values of the clustering attributes are distributed How values of supplementary attributes are distributed Outside cluster Exceptions and anomalies Between cluster Comparative view Value distributions for the population Value distributions for the cluster Value distributions for the population Value distributions for the cluster

K-means & Agglomeration in Weka Clustering in Weka: Preprocess page Specify “No Class” Specify all attributes for clustering

K-means & Agglomeration in Weka Clustering in Weka: Cluster page 2. Set parameters 1. Choose a Clustering Solution 4. Observe results 3. Execute the chosen solution 5. Select “Visualise Cluster Assignment”

K-means & Agglomeration in Weka Clustering in Weka: SimpleKMeans Specify the distance function used Specify the value of K Specify the max. number of iterations Specify the random seed affecting the initial random selection of K centroids

K-means & Agglomeration in Weka Clustering in Weka: SimpleKMeans Save membership into a file Visualise Cluster membership

K-means & Agglomeration in Weka Clustering in Weka: Agglomeration Tree-shaped Dendrogram Select Cobweb

Chapter Summary A clustering solution must provide a sensible proximity function, effective algorithm and a cluster evaluation function Proximity is normally measured by a distance function that combines measures of value differences upon attributes The K-Means method continues to refine prototype partitions until membership changes no longer occur The agglomeration method constructs all possible groupings of individual data objects into a hierarchy of clusters Good clustering results mean high similarity among members of a cluster and low similarity between members of different clusters Normal procedure of clustering in Weka is explained

References Read Chapter 4 of Data Mining Techniques and Applications Useful further references Tan, P-N., Steinbach, M. and Kumar, V. (2006), Introduction to Data Mining, Addison-Wesley, Chapters 2 (section 2.4) and 8.