1 CLUSTERING ALGORITHMS  Number of possible clusterings Let X={x 1,x 2,…,x N }. Question: In how many ways the N points can be assigned into m groups?

Slides:



Advertisements
Similar presentations
Discrimination and Classification. Discrimination Situation: We have two or more populations  1,  2, etc (possibly p-variate normal). The populations.
Advertisements

Clustering.
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Agglomerative Hierarchical Clustering 1. Compute a distance matrix 2. Merge the two closest clusters 3. Update the distance matrix 4. Repeat Step 2 until.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
K Means Clustering , Nearest Cluster and Gaussian Mixture
Machine Learning and Data Mining Clustering
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
IT 433 Data Warehousing and Data Mining Hierarchical Clustering Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department.
Clustering and Dimensionality Reduction Brendan and Yifang April
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
1 CLUSTERING  Basic Concepts In clustering or unsupervised learning no training data, with class labeling, are available. The goal becomes: Group the.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Tree Clustering & COBWEB. Remember: k-Means Clustering.
Cluster Analysis: Basic Concepts and Algorithms
Unsupervised Learning and Data Mining
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Cluster Analysis (1).
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering Unsupervised learning Generating “classes”
Clustering Algorithms Mu-Yu Lu. What is Clustering? Clustering can be considered the most important unsupervised learning problem; so, as every other.
Radial Basis Function Networks
Presented by Tienwei Tsai July, 2005
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
Image Segmentation Seminar III Xiaofeng Fan. Today ’ s Presentation Problem Definition Problem Definition Approach Approach Segmentation Methods Segmentation.
Hierarchical Clustering
1 (Chapter 15): Concatenated codes Simple (classical, single-level) concatenation Length of concatenated code: n 1 n 2 Dimension of concatenated code:
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 24 Nov 2, 2005 Nanjing University of Science & Technology.
Neural Networks - Lecture 81 Unsupervised competitive learning Particularities of unsupervised learning Data clustering Neural networks for clustering.
Clustering Algorithms Presented by Michael Smaili CS 157B Spring
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
381 Self Organization Map Learning without Examples.
Fuzzy C-Means Clustering
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
1 CLUSTER VALIDITY  Clustering tendency Facts  Most clustering algorithms impose a clustering structure to the data set X at hand.  However, X may not.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
An Improved Algorithm for Decision-Tree-Based SVM Sindhu Kuchipudi INSTRUCTOR Dr.DONGCHUL KIM.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
Unsupervised Classification
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Semi-Supervised Clustering
More on Clustering in COSC 4335
Clustering CSC 600: Data Mining Class 21.
Hierarchical Clustering
Hierarchical clustering approaches for high-throughput data
Hierarchical and Ensemble Clustering
Information Organization: Clustering
DATA MINING Introductory and Advanced Topics Part II - Clustering
Hierarchical and Ensemble Clustering
CSE572, CBS572: Data Mining by H. Liu
Text Categorization Berlin Chen 2003 Reference:
CLUSTERING ALGORITHMS
CSE572: Data Mining by H. Liu
Presentation transcript:

1 CLUSTERING ALGORITHMS  Number of possible clusterings Let X={x 1,x 2,…,x N }. Question: In how many ways the N points can be assigned into m groups? Answer:  Examples:

2  A way out:  Consider only a small fraction of clusterings of X and select a “sensible” clustering among them Question 1: Which fraction of clusterings is considered? Question 2: What “sensible” means? The answer depends on the specific clustering algorithm and the specific criteria to be adopted.

3 MAJOR CATEGORIES OF CLUSTERING ALGORITHMS  Sequential: A single clustering is produced. One or few sequential passes on the data.  Hierarchical: A sequence of (nested) clusterings is produced.  Agglomerative Matrix theory Graph theory  Divisive  Combinations of the above (e.g., the Chameleon algorithm.)

4  Cost function optimization. For most of the cases a single clustering is obtained.  Hard clustering (each point belongs exclusively to a single cluster): Basic hard clustering algorithms (e.g., k -means) k -medoids algorithms Mixture decomposition Branch and bound Simulated annealing Deterministic annealing Boundary detection Mode seeking Genetic clustering algorithms  Fuzzy clustering (each point belongs to more than one clusters simultaneously).  Possibilistic clustering (it is based on the possibility of a point to belong to a cluster).

5  Other schemes:  Algorithms based on graph theory ( e.g., Minimum Spanning Tree, regions of influence, directed trees).  Competitive learning algorithms (basic competitive learning scheme, Kohonen self organizing maps).  Subspace clustering algorithms.  Binary morphology clustering algorithms.

6  The common traits shared by these algorithms are:  One or very few passes on the data are required.  The number of clusters is not known a-priori, except (possibly) an upper bound, q.  The clusters are defined with the aid of An appropriately defined distance d (x,C) of a point from a cluster. A threshold Θ associated with the distance. SEQUENTIAL CLUSTERING ALGORITHMS

7  Basic Sequential Clustering Algorithm (BSAS) m=1 \{number of clusters}\ C m ={x 1 } For i=2 to N  Find C k : d(x i,C k )=min 1  j  m d(x i,C j )  If (d(x i,C k )>Θ) AND (m<q) then om=m+1 oC m ={x i }  Else oC k =C k  {x i } oWhere necessary, update representatives (*)  End {if} End {for} (*) When the mean vector m C is used as representative of the cluster C with n c elements, the updating in the light of a new vector x becomes m C new =(n C m C + x) / (n C +1)

8  Remarks: The order of presentation of the data in the algorithm plays important role in the clustering results. Different order of presentation may lead to totally different clustering results, in terms of the number of clusters as well as the clusters themselves. In BSAS the decision for a vector x is reached prior to the final cluster formation. BSAS perform a single pass on the data. Its complexity is O(N). If clusters are represented by point representatives, compact clusters are favored.

9  Estimating the number of clusters in the data set: Let BSAS(Θ) denote the BSAS algorithm when the dissimilarity threshold is Θ. For Θ=a to b step c  Run s times BSAS(Θ), each time presenting the data in a different order.  Estimate the number of clusters m Θ, as the most frequent number resulting from the s runs of BSAS(Θ). Next Θ Plot m Θ versus Θ and identify the number of clusters m as the one corresponding to the widest flat region in the above graph. Θ

10  MBSAS, a modification of BSAS In BSAS a decision for a data vector x is reached prior to the final cluster formation, which is determined after all vectors have been presented to the algorithm. MBSAS deals with the above drawback, at the cost of presenting the data twice to the algorithm. MBSAS consists of:  A cluster determination phase (first pass on the data), which is the same as BSAS with the exception that no vector is assigned to an already formed cluster. At the end of this phase, each cluster consists of a single element.  A pattern classification phase (second pass on the data), where each one of the unassigned vector is assigned to its closest cluster.  Remarks: In MBSAS, a decision for a vector x during the pattern classification phase is reached taking into account all clusters. MBSAS is sensitive to the order of presentation of the vectors. MBSAS requires two passes on the data. Its complexity is O(N).

11  The maxmin algorithm Let W be the set of all points that have been chosen to form clusters up to the current iteration step. The formation of clusters is carried out as follows: For each x  X-W determine d x =min z  W d(x,z) Determine y: d y =max x  X-W d x If d y is greater than a prespecified threshold then  this vector forms a new cluster else  the cluster determination phase of the algorithm terminates. End {if} After the formation of the clusters, each unassigned vector is assigned to its closest cluster.

12  Remarks: The maxmin algorithm is more computationally demanding than MBSAS. However, it is expected to produce better clustering results.

13  Refinement stages The problem of closeness of clusters: “In all the above algorithms it may happen that two formed clusters lie very close to each other”.  A simple merging procedure (A) Find C i, C j ( i<j ) such that d(C i,C j )=min k,r=1,…,m,k≠r d(C k,C r ) If d(C i,C j )  M 1 then \{ M 1 is a user-defined threshold \}  Merge C i, C j to C i and eliminate C j.  If necessary, update the cluster representative of C i.  Rename the clusters C j+1,…,C m to C j,…,C m-1, respectively.  m=m-1  Go to (A) Else  Stop End {if}

14  The problem of sensitivity to the order of data presentation: “A vector x may have been assigned to a cluster C i at the current stage but another cluster C j may be formed at a later stage that lies closer to x ”  A simple reassignment procedure For i=1 to N  Find C j such that d(x i,C j )=min k=1,…,m d(x i,C k )  Set b(i)=j \{ b(i) is the index of the cluster that lies closet to x i \} End {for} For j=1 to m  Set C j ={x i  X: b(i)=j}  If necessary, update representatives End {for}

15  A two-threshold sequential scheme (TTSAS)  The formation of the clusters, as well as the assignment of vectors to clusters, is carried out concurrently (like BSAS and unlike MBSAS)  Two thresholds Θ 1 and Θ 2 ( Θ 1 <Θ 2 ) are employed  The general idea is the following: If the distance d(x,C) of x from its closest cluster, C, is greater than Θ 2 then:  A new cluster represented by x is formed. Else if d(x,C)<Θ 1 then  x is assigned to C. Else  The decision is postponed to a later stage. End {if} The unassigned vectors are presented iteratively to the algorithm until all of them are classified.

16  Remarks: In practice, a few passes (  2 ) of the data set are required. TTSAS is less sensitive to the order of data presentation, compared to BSAS.