Clustering.

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Clustering II.
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
What is Cluster Analysis?
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Clustering Categorical Data The Case of Quran Verses
PARTITIONAL CLUSTERING
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
Cluster Analysis Adriano Joaquim de O Cruz ©2002 NCE/UFRJ
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Midterm topics Chapter 2 Data Data preprocessing Measures of similarity/dissimilarity Chapter.
Data Mining Techniques: Clustering
IT 433 Data Warehousing and Data Mining Hierarchical Clustering Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department.
What is Cluster Analysis?
6-1 ©2006 Raj Jain Clustering Techniques  Goal: Partition into groups so the members of a group are as similar as possible and different.
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
4. Ad-hoc I: Hierarchical clustering
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis (1).
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Clustering Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
Data Mining Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현. POSTECH IE PASTACLUSTER ANALYSIS Definition Cluster analysis is a technigue used for combining observations.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Clustering.
Chapter 2: Getting to Know Your Data
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Clustering.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Data Mining: Basic Cluster Analysis
Clustering CSC 600: Data Mining Class 21.
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
CSE 5243 Intro. to Data Mining
Selected Topics in AI: Data Clustering
Clustering.
Hierarchical and Ensemble Clustering
CSE572, CBS598: Data Mining by H. Liu
Clustering and Multidimensional Scaling
DATA MINING Introductory and Advanced Topics Part II - Clustering
Data Mining: Clustering
Hierarchical and Ensemble Clustering
CSCI N317 Computation for Scientific Applications Unit Weka
CSE572, CBS572: Data Mining by H. Liu
Data Mining – Chapter 4 Cluster Analysis Part 2
What Is Good Clustering?
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
SEEM4630 Tutorial 3 – Clustering.
CSE572: Data Mining by H. Liu
Hierarchical Clustering
Data Mining Cluster Analysis: Basic Concepts and Algorithms
What is Cluster Analysis?
Presentation transcript:

Clustering

Clustering Techniques   Partitioning methods Hierarchical methods Density-based Grid-based Model-based

Types of Data Data Matrix x11 … x1f … x1p . . . xi1 … xjf … xip . . . . . . xi1 … xjf … xip . . . xj1 … xjf … xjp Dissimilarity Matrix d(2,1) 0 d(3,1) d(3,2) 0 . . . d(n,1) d(n,2) … 0 d(i, j) – difference or dissimilarity between objects

Interval Scaled Variables   Standardization mean absolute deviation sf sf = 1/n ( |x1f - mf |+ |x2f - mf | + … + |xnf - mf | ) standardized measurement zif = x1f - mf / sf Compute dissimilarity between objects Euclidean distance d(i, j) = √ |xi1 - xj1 |2+ | xi2 - xj1|2 + … + | xip - xjp|2 Manhattan (city-block) distance d(i, j) = |xi1 - xj1 |+ | xi2 - xj1| + … + | xip - xjp| Minkowski distance d(i, j) = ( |xi1 - xj1 |p+ | xi2 - xj1|p + … + | xip - xjp|p )1/p

  Binary Variables There are only two states: 0 (absent) or 1 (present). Ex. smoker: yes or no. Computing dissimilarity between binary variables: Dissimilarity matrix (contingency table) if all attributes have the same weight    Object i 1 Sum q r q+r s t s+t q+s r+t p d(i, j) = r+s / q+r+s asymmetric attributes   d(i, j) = r+s / q+r+s+t symmetric attributes  

Nominal Variables Ordinal Variables   Nominal Variables Generalization of binary variable where it can take more than two states. Ex. color: red, green, blue. d(i, j) = p - m / p m – number of matches p – total number of attributes Weights can be used: assign greater weight to the matches in variables having a larger number of states. Ordinal Variables Resemble nominal variables except the states are ordered in meaningful sequence. Ex. medal: gold, silver, bronze. Replace xif by rif  {1, …, Mf} The value of f for the ith object is xif, and f has Mf ordered states, representing the ranking 1, …, Mf. Replace each xif by its corresponding rank.

Variables of Mixed Types   p (f) (f)  ij dij f=1 d(i, j) = p (f)  ij (f) where the indicator ij = 0 if either xif or xjf is missing, or xif = xjf = 0 and variable f is asymmetric binary; otherwise ij = 1. The contribution of variable f to the dissimilarity is dependent on its type: (f) (f) If f is binary or nominal: dij = 0 if xif = xif; otherwise dij = 1. (f) |xif- xjf| If f is interval-based: dij = , where h runs maxhxhf – mixhxhf over all non missing objects for variable f. If f is ordinal or ratio-scaled: compute the ranks rif and rif-1 zif = Mf - 1 and treat zif as interval-scaled.

Clustering Methods 1. Partitioning (k-number of clusters) 2. Hierarchical (hierarchical decomposition of objects)   TV – trees of order k Given: set of N - vectors Goal: divide these points into maximum I disjoint clusters so that points in each cluster are similar with respect to maximal number of coordinates (called active dimensions). TV-tree of order 2: (two clusters per node) Procedure: Divide set of N points into Z clusters maximizing the total number of active dimensions. For each cluster repeat the same procedure. Density-based methods Can find clusters of arbitrary shape. Can grow (given cluster) as long as density in the neighborhood exceeds some threshold (for each point, neighborhood of given radius contains minimum some number of points).

Squared error criterion Partitioning methods   1. K-means method (n objects to k clusters) Cluster similarity measured in regard to mean value of objects ina cluster (cluster’s center of gravity) Select randomly k-points (call them means) Assign each object to nearest mean Compute new mean for each cluster Repeat until criterion function converges K E =   | p - mi |2 i=1 pCi This method is sensitive to outliers. 2. K-medoids method Instead of mean, take a medoid (most centrally located object in a cluster) Squared error criterion We try to minimize

Hierarchical Methods   Agglomerative hierarchical clustering (bottom-up strategy) Each object placed in a separate cluster, and then we merge these clusters until certain termination conditions are satisfied. Divisive hierarchical clustering (top-down strategy) Distance between clusters: Minimum distance: dmin(Ci, Cj) = minpCi , p’Cj | p – p’ | Maximum distance: dmax(Ci, Cj) = maxpCi , p’Cj | p – p’ | Mean distance: dmean(Ci, Cj) = | mi – mj | Average distance: davg(Ci, Cj) = 1/ninj pCi p’Cj | p – p’ |

Distance Between Clusters   Cluster: Km = {tm1, … , tmn}     N Centroid: Cm =  tmi / N i=1   N Radius: Rm =  (tmi - Cm)2 / N N N Diameter: Dm =   (tmi - tmj)2 / N (N-1) i=1 j=1  Distance Between Clusters Single Link (smallest distance) Dis(Ki , Kj) = min{Dis(ti , tj) : ti  Ki , tj  Kj } Complete Link (largest distance) Dis(Ki , Kj) = max{Dis(ti , tj) : ti  Ki , tj  Kj } Average Dis(Ki , Kj) = mean{Dis(ti , tj) : ti  Ki , tj  Kj } Centroid Distance Dis(Ki , Kj) = Dis(Ci , Cj) Cluster 1 Cluster 2

Distances (threshold) Hierarchical Algorithms Single Link Technique (find maximal connected components in a graph) Distances (threshold) A B C D E 3 1 2 4 5 A B C D 1 A B C D 1 2 Dendogram A B C D E 1 2 3 Threshold level

Complete Link Technique   Complete Link Technique (looks for cliques – maximal graphs in which there is an edge between any two vertices) Distances (threshold) 1 2 3 4 …. B C A D 1 E 3 A B C D 1 E Dendogram E A B D C 1 3 5 (5, {EABCD}) (3, {EAB}, {DC}) (1, {AB}, {DC}, {E}) (0, {E}, {A}, {B}, {C}, {D})

Partitioning Algorithms   Minimum Spanning Tree (MST) Given: n – points k – clusters Algorithm: Start with complete graph Remove largest inconsistent edge (its weight is much larger than average weight of all adjacent edges) Repeat 5 6 10 12 80

Squared Error Cluster: Ki = {ti1, … , tin} Center of cluster: Ci N   Squared Error Cluster: Ki = {ti1, … , tin} Center of cluster: Ci N Squared Error: SEKi =  ||tij – Ci||2 j=1 Collection of clusters: K = {K1, … , Kk} k Squared Error for K: SEk =  SEKi i=1 Given: k – number of clusters threshold Algorithm: Repeat Choose k points randomly (called centers) Assign each item to the cluster which has the closest center Calculate new center for each cluster Calculate squared error Until Difference between old error and new one is below specified threshold

CURE (Clustering Using Representatives)   CURE (Clustering Using Representatives)  Idea: handling clusters of different shapes Algorithm: Constant number of points are chosen from each cluster These points are shrunk toward the cluster’s centroid Clusters with closest pair of representative points are merged     Center

Examples related to clustering