Data Clustering Methods

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Clustering.
Clustering.
PARTITIONAL CLUSTERING
CLUSTERING PROXIMITY MEASURES
Clustering (1) Clustering Similarity measure Hierarchical clustering Model-based clustering Figures from the book Data Clustering by Gan et al.
Clustering and Dimensionality Reduction Brendan and Yifang April
Institute of Intelligent Power Electronics – IPE Page1 Introduction to Basics of Genetic Algorithms Docent Xiao-Zhi Gao Department of Electrical Engineering.
Discrete geometry Lecture 2 1 © Alexander & Michael Bronstein
1 CLUSTERING  Basic Concepts In clustering or unsupervised learning no training data, with class labeling, are available. The goal becomes: Group the.
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Segmentation Graph-Theoretic Clustering.
Clustering.
Distance Measures Tan et al. From Chapter 2.
Cluster Analysis (1).
What is Cluster Analysis?
What is Cluster Analysis?
A Global Geometric Framework for Nonlinear Dimensionality Reduction Joshua B. Tenenbaum, Vin de Silva, John C. Langford Presented by Napat Triroj.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Distance Measures Tan et al. From Chapter 2. Similarity and Dissimilarity Similarity –Numerical measure of how alike two data objects are. –Is higher.
Clustering Unsupervised learning Generating “classes”
Clustering Algorithms Mu-Yu Lu. What is Clustering? Clustering can be considered the most important unsupervised learning problem; so, as every other.
Evaluating Performance for Data Mining Techniques
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Projective Texture Atlas for 3D Photography Jonas Sossai Júnior Luiz Velho IMPA.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Segmentation Techniques Luis E. Tirado PhD qualifying exam presentation Northeastern University.
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
CSIE Dept., National Taiwan Univ., Taiwan
Chapter 8 The k-Means Algorithm and Genetic Algorithm.
Digital Image Processing CCS331 Relationships of Pixel 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Line detection Assume there is a binary image, we use F(ά,X)=0 as the parametric equation of a curve with a vector of parameters ά=[α 1, …, α m ] and X=[x.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
VQ for ASR 張智星 多媒體資訊檢索實驗室 清華大學 資訊工程系.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Neural Networks - Lecture 81 Unsupervised competitive learning Particularities of unsupervised learning Data clustering Neural networks for clustering.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Clustering.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
A new initialization method for Fuzzy C-Means using Fuzzy Subtractive Clustering Thanh Le, Tom Altman University of Colorado Denver July 19, 2011.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.
Clustering Usman Roshan CS 675. Clustering Suppose we want to cluster n vectors in R d into two groups. Define C 1 and C 2 as the two groups. Our objective.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Data Mining K-means Algorithm
Clustering (3) Center-based algorithms Fuzzy k-means
Similarity and Dissimilarity
Dr. Unnikrishnan P.C. Professor, EEE
Segmentation Graph-Theoretic Clustering.
Roberto Battiti, Mauro Brunato
Multivariate Statistical Methods
Fast Marching and Level Set for Shape Recovery
Presentation transcript:

Data Clustering Methods Docent Xiao-Zhi Gao Department of Electrical Engineering and Automation

Data Clustering Data clustering is for data organization, data compression, and model construction Clustering partitions a data set into groups such as similarity within a group is larger than that among groups Similarity needs to be defined Metric of difference between two input vectors

Clusters in Data Data need to be normalized into a hypercube beforehand

Similarity?

Similarity Similarity can be defined as distances between two vectors in the data space There are a few choices Euclidean distance (real-values) Hamming distance (binary or symbols) Manhattan distance (any)

Euclidean Distance Euclidean distance between two vectors is defined as:

Hamming Distance Hamming distance is the number of positions at which the corresponding symbols of two vectors are different For example, "toned" and "roses" is 3 "1011101" and "1001001" is 2 "2173896" and "2233796" is 3

Manhattan Distance Manhattan distance (city block distance) is equal to the length of all paths connecting the two vectors along all segments Taxicab geometry

K-Means Clustering Method K-means clustering method partitions a collection of n vectors into c groups Gi, i=1, 2, ..., c, and finds the cluster centers in these groups so as to minimize a given dissimilarity measurement

K-Means Clustering Method The dissimilarity measurement (cost function) can be calculated using Euclidean distance in K-means clustering method

K-Means Clustering Method The binary membership matrix U is cxn martrix defined as follows: Xj belongs to group i, if ci is the closest center among all the centers

K-Means Clustering Method To minimize the cost function J, the optimal center of a group should be the mean of all the vectors in that group:

K-Means Clustering Method K-means clustering method is an iterative algorithm to find cluster centers

K-Means Clustering Method There is no guarantee that it can converge to an optimal solution Optimization methods might be used to deal with cost function J The performance of k-means clustering method depends on the initial cluster centers Front-end methods should be employed to find good initial centers

K-Means Clustering Method K-means clustering method might have problems with clusters of different densities non-globular shapes K-means clustering method is a ’hard’ data clustering approach Data should belong to clusters to degrees Fuzzy k-means method

Clusters of Different Densities

Clusters of Non-globular Shapes

Butterfly Data

Mountain Clustering Method Mountain clustering method (Yager, 1994) approximates clusters based on density measure of data Mountain clustering method can be used either as a stand-alone algorithm or for obtaining initial clusters of other data clustering approaches

Mountain Clustering Method Step 1: Form a grid in the data space, and the intersections of the grid line are considered as center candidates of clustering, denoted as a set V Not necessarily evenly spaced A fine gridding is needed, but can increase computation burden

Mountain Clustering Method Step 2: Construct mountain functions representing data density measure. The height of the mountain function at v is:

Mountain Clustering Method Each input vector x contributes to the heights of mountain functions at v The contribution is inversely proportional to their distances d(x, v) Mountain function is a measure of data density (higher if more data points are located nearby)

Mountain Clustering Method Step 3: Select cluster centers and destruct mountain functions The points with the largest mountain heights are selected as cluster centers

Mountain Clustering Method The just-identified centers are often surrounded by input data with high density The effects of just-identified centers should be eliminated The mountain functions are revised by substracting a scaled Gaussian function

Mountain Functions 0.02 0.1 0.2 may affect the smoothness of mountain functions

Mountain Destruction Cluster centers are selected, and mountains are destructed sequentially

Subtractive Clustering Mountain clustering method is simple but time consuming with growth of dimensions of data Replace grid points with data points in mountain clustering, and we can get subtractive clustering (Chiu, 1994) Only data points are considered as cluster center candidates

Subtractive Clustering The density measure of data point The density measure of each data point is revised sequentially

Conclusions Three typical off-line data clustering methods are introduced They often operate in the batch mode The prototypes characterizing data sets found by the data clustering methods can be used as ’codebooks’

An Application Example

Computer Exercises I

Computer Exercises II