K-means clustering CS281B Winter02 Yan Wang and Lihua Lin.

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
PARTITIONAL CLUSTERING
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Data Mining Techniques: Clustering
Introduction to Bioinformatics
2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang Assistant Professor, Department of Computer Science and Information.
Clustering II.
Iterative Optimization of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.
1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Birch: An efficient data clustering method for very large databases
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Evaluating Performance for Data Mining Techniques
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
DATA MINING CLUSTERING K-Means.
Data Clustering 1 – An introduction
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Clustering Methods K- means. K-means Algorithm Assume that K=3 and initially the points are assigned to clusters as follows. C 1 ={x 1,x 2,x 3 }, C 2.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Technological Educational Institute Of Crete Department Of Applied Informatics and Multimedia Intelligent Systems Laboratory 1 CLUSTERS Prof. George Papadourakis,
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Data Extraction using Image Similarity CIS 601 Image Processing Ajay Kumar Yadav.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Computer Graphics and Image Processing (CIS-601).
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
COMP Data Mining: Concepts, Algorithms, and Applications 1 K-means Arbitrarily choose k objects as the initial cluster centers Until no change,
CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
By Timofey Shulepov Clustering Algorithms. Clustering - main features  Clustering – a data mining technique  Def.: Classification of objects into sets.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Mr. Idrissa Y. H. Assistant Lecturer, Geography & Environment Department of Social Sciences School of Natural & Social Sciences State University of Zanzibar.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Data Mining and Text Mining. The Standard Data Mining process.
CLUSTERING EE Class Presentation. TOPICS  Clustering basic and types  K-means, a type of Unsupervised clustering  Supervised clustering type.
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
COMP24111 Machine Learning K-means Clustering Ke Chen.
Big data classification using neural network
What Is Cluster Analysis?
Slides by Eamonn Keogh (UC Riverside)
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
Basic machine learning background with Python scikit-learn
Topic 3: Cluster Analysis
K-means and Hierarchical Clustering
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
Junheng, Shengming, Yunsheng 11/09/2018
Topic 5: Cluster Analysis
Clustering The process of grouping samples so that the samples are similar within each group.
Presentation transcript:

K-means clustering CS281B Winter02 Yan Wang and Lihua Lin

What are clustering algorithms? What is clustering ? Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. Example: The balls of same color are clustered into a group as shown below : Thus, we see clustering means grouping of data or dividing a large data set into smaller data sets of some similarity. CS281B Winter02 Yan Wang and Lihua Lin

What is a clustering algorithm ? A clustering algorithm attempts to find natural groups of components (or data) based on some similarity. The clustering algorithm also finds the centroid of a group of data sets. The centroid of a cluster is a point whose parameter values are the mean of the parameter values of all the points in the clusters. A clustering algorithm attempts to find natural groups of components (or data) based on some similarity. The clustering algorithm also finds the centroid of a group of data sets. To determine cluster membership, most algorithms evaluate the distance between a point and the cluster centroids. The output from a clustering algorithm is basically a statistical description of the cluster centroids with the number of components in each cluster. Definition: The centroid of a cluster is a point whose parameter values are the mean of the parameter values of all the points in the clusters. CS281B Winter02 Yan Wang and Lihua Lin

What is the common metric for clustering techniques ? Generally, the distance between two points is taken as a common metric to assess the similarity among the components of a population. The most commonly used distance measure is the Euclidean metric which defines the distance between two points p= ( p1, p2, ....) and q = ( q1, q2, ....) as : CS281B Winter02 Yan Wang and Lihua Lin

Uses of clustering algorithms Engineering sciences: pattern recognition, artificial intelligence, cybernetics etc. Typical examples to which clustering has been applied include handwritten characters, samples of speech, fingerprints, and pictures. Life sciences (biology, botany, zoology, entomology, cytology, microbiology): the objects of analysis are life forms such as plants, animals, and insects. Information, policy and decision sciences: the various applications of clustering analysis to documents include votes on political issues, survey of markets, survey of products, survey of sales programs, and R & D. Engineering sciences: pattern recognition, artificial intelligence, cybernetics etc. Typical examples to which clustering has been applied include handwritten characters, samples of speech, fingerprints, and pictures. Life sciences (biology, botany, zoology, entomology, cytology, microbiology): the objects of analysis are life forms such as plants, animals, and insects. The clustering analysis may range from developing complete taxonomies to classification of the species into subspecies. The subspecies can be further classified into subspecies. Information, policy and decision sciences: the various applications of clustering analysis to documents include votes on political issues, survey of markets, survey of products, survey of sales programs, and R & D. CS281B Winter02 Yan Wang and Lihua Lin

Types of clustering algorithms The various clustering concepts available can be grouped into two broad categories : Hierarchial methods – Minimal Spanning Tree Method (Fig) Nonhierarchial methods – K-means Algorithm The various clustering concepts available can be grouped into two broad categories : Hierarchial methods -- Minimal Spanning Tree Method These methods include those techniques where the input data are not partitioned into the desired number of classes in a single step. Instead, a series of successive fusions of data are performed until the final number of clusters is obtained. Nonhierarchial methods --K-means Algorithm These methods include those techniques in which a desired number of clusters is assumed at the start. Points are allocated among clusters so that a particular clustering criterion is optimized. A possible criterion is the minimization of the variability within clusters, as measured by the sum of the variance of each parameter that characterizes a point. CS281B Winter02 Yan Wang and Lihua Lin

K-Means Clustering Algorithm Definition: This nonheirarchial method initially takes the number of components of the population equal to the final required number of clusters. In this step itself the final required number of clusters is chosen such that the points are mutually farthest apart. Next, it examines each component in the population and assigns it to one of the clusters depending on the minimum distance. The centroid's position is recalculated everytime a component is added to the cluster and this continues until all the components are grouped into the final required number of clusters. CS281B Winter02 Yan Wang and Lihua Lin

K-Means Clustering Algorithm CS281B Winter02 Yan Wang and Lihua Lin

The Parameters and options for the k-means algorithm Initialization: Different init Methods Distance Measure:There are different distance measures that can be used. (Manhattan distance & Euclidean distance). Termination: k-means should terminate when no more pixels are changing classes. Quality: the quality of the results provided by k-means classification Parallelism: There are several ways to parallelize the k-means algorithm What to do with dead classes:A class is "dead" if no pixels belong to it. Variants: one pass on-the-fly calculation of means Number of classes: Number of classes is usually given as an input variable. CS281B Winter02 Yan Wang and Lihua Lin

Comments on the K-means Methods Strength of the K-means: Relatively efficient: O(tkn), where n is the number of objects, k is the number of clusters, and t is number of iterations. Normally, k,t << n. Often terminates at a local optimum. Weakness of the k-means: Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance. Unable tom handle noisy data and outlines. Not suitable to discover clusters with non-convex shapes. CS281B Winter02 Yan Wang and Lihua Lin

Direct k-means clustering algorithm CS281B Winter02 Yan Wang and Lihua Lin

Demo (I) 2 Initial Clusters CS281B Winter02 Yan Wang and Lihua Lin

Demo (I) 2-means Clustering CS281B Winter02 Yan Wang and Lihua Lin

Demo (II) – Init Method: Random CS281B Winter02 Yan Wang and Lihua Lin

Demo (II) – Init Method: Linear CS281B Winter02 Yan Wang and Lihua Lin

Demo (II) – Init Method: Cube CS281B Winter02 Yan Wang and Lihua Lin

Demo (II) – Init Method: Statistics CS281B Winter02 Yan Wang and Lihua Lin

Demo (II) – Init Method: Possibility CS281B Winter02 Yan Wang and Lihua Lin