1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.

Slides:



Advertisements
Similar presentations
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
PARTITIONAL CLUSTERING
CS690L: Clustering References:
Data Mining Techniques: Clustering
Clustering II.
4. Clustering Methods Concepts Partitional (k-Means, k-Medoids)
Clustering.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.
Clustering II.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Cluster Analysis.
Segmentação (Clustering) (baseado nos slides do Han)
1 Chapter 8: Clustering. 2 Searching for groups Clustering is unsupervised or undirected. Unlike classification, in clustering, no pre- classified data.
What is Cluster Analysis?
What is Cluster Analysis?
Cluster Analysis.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
What is Cluster Analysis?
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
UIC - CS 5941 Chapter 5: Clustering. UIC - CS 5942 Searching for groups Clustering is unsupervised or undirected. Unlike classification, in clustering,
Clustering Unsupervised learning Generating “classes”
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Cluster Analysis Part I
9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)
More on Microarrays Chitta Baral Arizona State University.
Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Unsupervised Learning. Supervised learning vs. unsupervised learning.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
COMP Data Mining: Concepts, Algorithms, and Applications 1 K-means Arbitrarily choose k objects as the initial cluster centers Until no change,
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
1 Efficient and Effective Clustering Methods for Spatial Data Mining Raymond T. Ng, Jiawei Han Pavan Podila COSC 6341, Fall ‘04.
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Data Mining Algorithms
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering Analysis CS 685: Special Topics in Data Mining Jinze Liu.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Mr. Idrissa Y. H. Assistant Lecturer, Geography & Environment Department of Social Sciences School of Natural & Social Sciences State University of Zanzibar.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
1 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
1 Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Density-Based.
COMP24111 Machine Learning K-means Clustering Ke Chen.
CLARANS: A Method for Clustering Objects for Spatial Data Mining IEEE Transactions on Knowledge and Data Enginerring, 2002 Raymond T. Ng et al. 22 MAR.
Data Mining Comp. Sc. and Inf. Mgmt. Asian Institute of Technology
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 10 —
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Slides by Eamonn Keogh (UC Riverside)
What Is the Problem of the K-Means Method?
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Topic 3: Cluster Analysis
CSE 5243 Intro. to Data Mining
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Data Mining: Clustering
Clustering Wei Wang.
Topic 5: Cluster Analysis
Presentation transcript:

1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared Error Squared Error of a cluster m i is the mean (centroid) of C i Squared Error of a clustering

2 Example of Square Error of Cluster C i ={P1, P2, P3} P1 = (3, 7) P2 = (2, 3) P3 = (7, 5) m i = (4, 5) |d(P1, m i )| 2 =(3-4) 2 +(7-5) 2 =5 |d(P2, m i )| 2 =8 |d(P3, m i )| 2 =9 Error (C i )=5+8+9=22 P3 P2 P1 mimi

3 Example of Square Error of Cluster C j ={P4, P5, P6} P4 = (4, 6) P5 = (5, 5) P6 = (3, 4) m j = (4, 5) |d(P4, m j )| 2 =(4-4) 2 +(6-5) 2 =1 |d(P5, m j )| 2 =1 |d(P6, m j )| 2 =1 Error (C j )=1+1+1=3 P5 P6 P4 mjmj

4 Partitioning Algorithms: Basic Concepts  Global optimal: examine all possible partitions k n possible partitions, too expensive!  Heuristic methods: k-means and k-medoids k-means (MacQueen’67): Each cluster is represented by center of cluster k-medoids (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects (medoid) in cluster

5 K-means  Initialization Arbitrarily choose k objects as the initial cluster centers (centroids)  Iteration until no change For each object O i  Calculate the distances between O i and the k centroids  (Re)assign O i to the cluster whose centroid is the closest to O i Update the cluster centroids based on current assignment

6 k-Means Clustering Method cluster mean current clusters new clusters objects relocated

7 Example  For simplicity, 1 dimensional objects and k=2.  Objects: 1, 2, 5, 6,7  K-means: Randomly select 5 and 6 as initial centroids; => Two clusters {1,2,5} and {6,7}; meanC1=8/3, meanC2=6.5 => {1,2}, {5,6,7}; meanC1=1.5, meanC2=6 => no change. Aggregate dissimilarity = 0.5^ ^2 + 1^2 + 1^2 = 2.5

8 Variations of k-Means Method  Aspects of variants of k-means Selection of initial k centroids  E.g., choose k farthest points Dissimilarity calculations  E.g., use Manhattan distance Strategies to calculate cluster means  E.g., update the means incrementally

9 Strengths of k-Means Method  Strength Relatively efficient for large datasets  O(tkn) where n is # objects, k is # clusters, and t is # iterations; normally, k, t <<n Often terminates at a local optimum  global optimum may be found using techniques such as deterministic annealing and genetic algorithms

10 Weakness of k-Means Method  Weakness Applicable only when mean is defined, then what about categorical data?  k-modes algorithm Unable to handle noisy data and outliers  k-medoids algorithm Need to specify k, number of clusters, in advance  Hierarchical algorithms  Density-based algorithms

11 k-modes Algorithm  Handling categorical data: k-modes (Huang’98) Replacing means of clusters with modes  Given n records in cluster, mode is record made up of most frequent attribute values  In the example cluster, mode = (<=30, medium, yes, fair) Using new dissimilarity measures to deal with categorical objects

12 A Problem of K-means  Sensitive to outliers Outlier: objects with extremely large (or small) values  May substantially distort the distribution of the data + + Outlier

13 k-Medoids Clustering Method  k-medoids: Find k representative objects, called medoids PAM (Partitioning Around Medoids, 1987) CLARA (Kaufmann & Rousseeuw, 1990) CLARANS (Ng & Han, 1994): Randomized sampling k-meansk-medoids

14 PAM (Partitioning Around Medoids) (1987)  PAM (Kaufman and Rousseeuw, 1987)  Arbitrarily choose k objects as the initial medoids  Until no change, do (Re)assign each object to the cluster with the nearest medoid Improve the quality of the k-medoids (Randomly select a nonmedoid object, O random, compute the total cost of swapping a medoid with O random)  Work for small data sets (100 objects in 5 clusters)  Not efficient for medium and large data sets

15 Swapping Cost  For each pair of a medoid m and a non-medoid object h, measure whether h is better than m as a medoid  Use the squared-error criterion Compute E h -E m Negative: swapping brings benefit  Choose the minimum swapping cost

16 Four Swapping Cases  When a medoid m is to be swapped with a non-medoid object h, check each of other non-medoid objects j j is in cluster of m  reassign j  Case 1: j is closer to some k than to h; after swapping m and h, j relocates to cluster represented by k  Case 2: j is closer to h than to k; after swapping m and h, j is in cluster represented by h j is in cluster of some k, not m  compare k with h  Case 3: j is closer to some k than to h; after swapping m and h, j remains in cluster represented by k  Case 4: j is closer to h than to k; after swapping m and h, j is in cluster represented by h

17 PAM Clustering: Total swapping cost TC mh =  j C jmh j m h k k m hj k m h j Case 2 Case 3 h m k j Case 1 Case 4 C jmh = d(j, h)  d(j,k) < 0

18 Complexity of PAM  Arbitrarily choose k objects as the initial medoids  Until no change, do (Re)assign each object to the cluster with the nearest medoid Improve the quality of the k- medoids  For each pair of medoid m and non-medoid object h Calculate the swapping cost TC mh =  j C jmh O(1) O((n-k) 2 *k) O((n-k)*k) O((n-k) 2 *k) (n-k)*k times O(n-k)

19 Strength and Weakness of PAM  PAM is more robust than k-means in the presence of outliers because a medoid is less influenced by outliers or other extreme values than a mean  PAM works efficiently for small data sets but does not scale well for large data sets O(k(n-k) 2 ) for each iteration where n is # of data objects, k is # of clusters  Can we find the medoids faster?

20 CLARA (Clustering Large Applications) (1990)  CLARA (Kaufmann and Rousseeuw in 1990) Built in statistical analysis packages, such as S+  It draws multiple samples of data set, applies PAM on each sample, gives best clustering as output  Handle larger data sets than PAM (1,000 objects in 10 clusters)  Efficiency and effectiveness depends on the sampling

21 CLARA - Algorithm  Set mincost to MAXIMUM;  Repeat q times // draws q samples Create S by drawing s objects randomly from D; Generate the set of medoids M from S by applying the PAM algorithm; Compute cost(M,D) If cost(M, D)<mincost Mincost = cost(M, D); Bestset = M; Endif;  Endrepeat;  Return Bestset;

22 Complexity of CLARA  Set mincost to MAXIMUM;  Repeat q times Create S by drawing s objects randomly from D; Generate the set of medoids M from S by applying the PAM algorithm; Compute cost(M,D) If cost(M, D)<mincost Mincost = cost(M, D); Bestset = M; Endif ;  Endrepeat;  Return Bestset; O(1) O((s-k) 2 *k) O((n-k)*k) O(1) O((s-k) 2 *k+(n-k)*k)

23 Strengths and Weaknesses of CLARA  Strength: Handle larger data sets than PAM (1,000 objects in 10 clusters)  Weakness: Efficiency depends on sample size A good clustering based on samples will not necessarily represent a good clustering of whole data set if sample is biased

24 CLARANS (“Randomized” CLARA) (1994)  CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han’94)  CLARANS draws sample in solution space dynamically  A solution is a set of k medoids  The solutions space contains solutions in total  The solution space can be represented by a graph where every node is a potential solution, i.e., a set of k medoids

25 Graph Abstraction  Every node is a potential solution (k-medoid)  Every node is associated with a squared error  Two nodes are adjacent if they differ by one medoid  Every node has k(n  k) adjacent nodes {O 1,O 2,…,O k } {O k+1,O 2,…,O k } {O k+n,O 2,…,O k } … n-k neighbors for one medoid k(n  k) neighbors for one node …

26 Graph Abstraction: CLARANS  Start with a randomly selected node, check at most m neighbors randomly  If a better adjacent node is found, moves to node and continue; otherwise, current node is local optimum; re- starts with another randomly selected node to search for another local optimum  When h local optimum have been found, returns best result as overall result

27 CLARANS N NN C C N NN < Local minimum … Compare no more than maxneighbor times numlocal Best Node Local minimum … … …

28 CLARANS - Algorithm  Set mincost to MAXIMUM;  For i=1 to h do // find h local optimum Randomly select a node as the current node C in the graph; J = 1; // counter of neighbors Repeat Randomly select a neighbor N of C; If Cost(N,D)<Cost(C,D) Assign N as the current node C; J = 1; Else J++; Endif; Until J > m Update mincost with Cost(C,D) if applicableEnd for;  End For  Return bestnode;

29 Graph Abstraction (k-means, k-modes, k-medoids)  Each vertex is a set of k-representative objects (means, modes, medoids)  Each iteration produces a new set of k-representative objects with lower overall dissimilarity  Iterations correspond to a hill descent process in a landscape (graph) of vertices

30 Comparison with PAM  Search for minimum in graph (landscape)  At each step, all adjacent vertices are examined; the one with deepest descent is chosen as next k-medoids  Search continues until minimum is reached  For large n and k values (n=1,000, k=10), examining all k(n  k) adjacent vertices is time consuming; inefficient for large data sets  CLARANS vs PAM For large and medium data sets, it is obvious that CLARANS is much more efficient than PAM For small data sets, CLARANS outperforms PAM significantly

31 When n=80, CLARANS is 5 times faster than PAM, while the cluster quality is the same.

32 Comparision with CLARA  CLARANS vs CLARA CLARANS is always able to find clusterings of better quality than those found by CLARA; CLARANS may use much more time than CLARA When the time used is the same, CLARANS is still better than CLARA

33

34 Hierarchies of Co-expressed Genes and Coherent Patterns The interpretation of co-expressed genes and coherent patterns mainly depends on the domain knowledge

35 A Subtle Situation  To split or not to split? It ’ s a question. group A group A 1 group A 2