Slides by Eamonn Keogh (UC Riverside)

Slides:



Advertisements
Similar presentations
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Advertisements

Hierarchical Clustering, DBSCAN The EM Algorithm
PARTITIONAL CLUSTERING
K Means Clustering , Nearest Cluster and Gaussian Mixture
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.
Clustering II.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
1 Chapter 8: Clustering. 2 Searching for groups Clustering is unsupervised or undirected. Unlike classification, in clustering, no pre- classified data.
1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.
K-means clustering CS281B Winter02 Yan Wang and Lihua Lin.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
COMP Data Mining: Concepts, Algorithms, and Applications 1 K-means Arbitrarily choose k objects as the initial cluster centers Until no change,
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Initial slides by Eamonn Keogh Clustering. Organizing data into classes such that there is high intra-class similarity low inter-class similarity Finding.
Classification Problem Class F1 F2 F3 F
Mr. Idrissa Y. H. Assistant Lecturer, Geography & Environment Department of Social Sciences School of Natural & Social Sciences State University of Zanzibar.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
1 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
COMP24111 Machine Learning K-means Clustering Ke Chen.
Data Mining – Algorithms: K Means Clustering
Gilad Lerman Math Department, UMN
Clustering (1) Clustering Similarity measure Hierarchical clustering
Data Mining Comp. Sc. and Inf. Mgmt. Asian Institute of Technology
What Is Cluster Analysis?
DATA MINING Spatial Clustering
Web-Mining Agents Clustering
Clustering CSC 600: Data Mining Class 21.
Machine Learning Clustering: K-means Supervised Learning
Data Science Algorithms: The Basic Methods
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Data Mining K-means Algorithm
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Clustering in Ratemaking: Applications in Territories Clustering
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
Topic 3: Cluster Analysis
CSE 5243 Intro. to Data Mining
Latent Variables, Mixture Models and EM
Exam on Friday Bring a scantron. It will be similar to the midterm
John Nicholas Owen Sarah Smith
TOP DM 10 Algorithms C4.5 C 4.5 Research Issue:
CSSE463: Image Recognition Day 23
Revision (Part II) Ke Chen
Clustering.
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Revision (Part II) Ke Chen
DATA MINING Introductory and Advanced Topics Part II - Clustering
Clustering Wei Wang.
CSC 558 – Data Analytics II, Prep for assignment 1 – Instance-based (lazy) machine learning January 2018.
Text Categorization Berlin Chen 2003 Reference:
Junheng, Shengming, Yunsheng 11/09/2018
Topic 5: Cluster Analysis
EM Algorithm and its Applications
What is Cluster Analysis?
Presentation transcript:

Slides by Eamonn Keogh (UC Riverside) Clustering Slides by Eamonn Keogh (UC Riverside)

Squared Error 10 1 2 3 4 5 6 7 8 9 Objective Function

Algorithm k-means 1. Decide on a value for k. 2. Initialize the k cluster centers (randomly, if necessary). 3. Decide the class memberships of the N objects by assigning them to the nearest cluster center. 4. Re-estimate the k cluster centers, by assuming the memberships found above are correct. 5. If none of the N objects changed membership in the last iteration, exit. Otherwise goto 3.

K-means Clustering: Step 1 Algorithm: k-means, Distance Metric: Euclidean Distance 5 4 k1 k2 k3 3 2 1 1 2 3 4 5

K-means Clustering: Step 2 Algorithm: k-means, Distance Metric: Euclidean Distance 5 4 k1 3 k2 2 1 k3 1 2 3 4 5

K-means Clustering: Step 3 Algorithm: k-means, Distance Metric: Euclidean Distance 5 4 k1 3 2 k3 k2 1 1 2 3 4 5

K-means Clustering: Step 4 Algorithm: k-means, Distance Metric: Euclidean Distance 5 4 k1 3 2 k3 k2 1 1 2 3 4 5

K-means Clustering: Step 5 Algorithm: k-means, Distance Metric: Euclidean Distance k1 k2 k3

Comments on the K-Means Method Strength Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex shapes

The K-Medoids Clustering Method Find representative objects, called medoids, in clusters PAM (Partitioning Around Medoids, 1987) starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering PAM works effectively for small data sets, but does not scale well for large data sets

EM Algorithm (Mixture Model) Initialize K cluster centers Iterate between two steps Expectation step: assign points to clusters (Bayes) Maximation step: estimate model parameters (optimization) probability that di is in class cj = probability of class ck

Iteration 1 The cluster means are randomly assigned

Iteration 2

Iteration 5

Iteration 25

Nearest Neighbor Clustering What happens if the data is streaming… Nearest Neighbor Clustering Not to be confused with Nearest Neighbor Classification Items are iteratively merged into the existing clusters that are closest. Incremental Threshold, t, used to determine if items are added to existing clusters or a new cluster is created.

10 1 2 3 4 5 6 7 8 9 10 Threshold t 1 t 2

New data point arrives… 10 1 2 3 4 5 6 7 8 9 10 New data point arrives… It is within the threshold for cluster 1, so add it to the cluster, and update cluster center. 1 3 2

New data point arrives… It is not within the threshold for cluster 1, so create a new cluster, and so on.. 10 1 2 3 4 5 6 7 8 9 10 4 1 3 2 Algorithm is highly order dependent… It is difficult to determine t in advance…

How can we tell the right number of clusters? In general, this is a unsolved problem. However there are many approximate methods. In the next few slides we will see an example. 10 1 2 3 4 5 6 7 8 9 For our example, we will use the familiar katydid/grasshopper dataset. However, in this case we are imagining that we do NOT know the class labels. We are only clustering on the X and Y axis values.

When k = 1, the objective function is 873.0 2 3 4 5 6 7 8 9 10

When k = 2, the objective function is 173.1 4 5 6 7 8 9 10

When k = 3, the objective function is 133.6 2 3 4 5 6 7 8 9 10

We can plot the objective function values for k equals 1 to 6… The abrupt change at k = 2, is highly suggestive of two clusters in the data. This technique for determining the number of clusters is known as “knee finding” or “elbow finding”. 1.00E+03 9.00E+02 8.00E+02 7.00E+02 6.00E+02 Objective Function 5.00E+02 4.00E+02 3.00E+02 2.00E+02 1.00E+02 0.00E+00 k 1 2 3 4 5 6 Note that the results are not always as clear cut as in this toy example