Initial slides by Eamonn Keogh Clustering. Organizing data into classes such that there is high intra-class similarity low inter-class similarity Finding.

Slides:

Advertisements

Similar presentations

Principles of Density Estimation

Advertisements

Hierarchical Clustering, DBSCAN The EM Algorithm

Introduction to Bioinformatics

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.

Pattern recognition Professor Aly A. Farag

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.

What is Clustering? Also called unsupervised learning, sometimes called classification by statisticians and sorting by psychologists and segmentation by.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction

Unsupervised Learning

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

What is Cluster Analysis?

Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.

Clustering & Dimensionality Reduction 273A Intro Machine Learning.

Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.

Radial Basis Function Networks

Clustering Unsupervised learning Generating “classes”

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.

Clustering Slides by Eamonn Keogh.

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.

START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.

Nearest Neighbor (NN) Rule & k-Nearest Neighbor (k-NN) Rule Non-parametric : Can be used with arbitrary distributions, No need to assume that the form.

Non-Parameter Estimation 主講人：虞台文. Contents Introduction Parzen Windows k n -Nearest-Neighbor Estimation Classification Techiques – The Nearest-Neighbor.

1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.

MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.

Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)

1 Clustering: K-Means Machine Learning , Fall 2014 Bhavana Dalvi Mishra PhD student LTI, CMU Slides are based on materials from Prof. Eric Xing,

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.

Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.

Chapter 13 (Prototype Methods and Nearest-Neighbors )

1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.

Classification Problem Class F1 F2 F3 F

Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.

The next 6 pages are mini reviews of your ML classification skills.

Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.

COMP24111 Machine Learning K-means Clustering Ke Chen.

Introduction to Clustering

Semi-Supervised Clustering

Web-Mining Agents Clustering

Clustering CSC 600: Data Mining Class 21.

Slides by Eamonn Keogh (UC Riverside)

Nonparametric Density Estimation – k-nearest neighbor (kNN) 02/20/17

Non-Parameter Estimation

Data Mining K-means Algorithm

Outline Parameter estimation – continued Non-parametric methods.

K Nearest Neighbor Classification

DATA MINING Introductory and Advanced Topics Part II - Clustering

Clustering Wei Wang.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

LECTURE 16: NONPARAMETRIC TECHNIQUES

CSE572: Data Mining by H. Liu

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

ECE – Pattern Recognition Lecture 10 – Nonparametric Density Estimation – k-nearest-neighbor (kNN) Hairong Qi, Gonzalez Family Professor Electrical.

Presentation transcript:

Initial slides by Eamonn Keogh Clustering

Organizing data into classes such that there is high intra-class similarity low inter-class similarity Finding the class labels and the number of classes directly from the data (in contrast to classification). More informally, finding natural groupings among objects. What is Clustering? Also called unsupervised learning, sometimes called classification by statisticians and sorting by psychologists and segmentation by people in marketing

Intuitions behind desirable distance measure properties D(A,B) = D(B,A) Symmetry Otherwise you could claim “Alex looks like Bob, but Bob looks nothing like Alex.” D(A,A) = 0 Constancy of Self-Similarity Otherwise you could claim “Alex looks more like Bob, than Bob does.” D(A,B) = 0 If A=B Positivity (Separation) Otherwise there are objects in your world that are different, but you cannot tell apart. D(A,B)  D(A,C) + D(B,C) Triangular Inequality Otherwise you could claim “Alex is very like Carl, and Bob is very like Carl, but Alex is very unlike Bob.”

Peter Piter Pioter Piotr Substitution (i for e) Insertion (o) Deletion (e) Edit Distance Example It is possible to transform any string Q into string C, using only Substitution, Insertion and Deletion. Assume that each of these operators has a cost associated with it. The similarity between two strings can be defined as the cost of the cheapest transformation from Q to C. Note that for now we have ignored the issue of how we can find this cheapest transformation How similar are the names “Peter” and “Piotr”? Assume the following cost function Substitution1 Unit Insertion1 Unit Deletion1 Unit D( Peter,Piotr ) is 3 Piotr Pyotr Petros Pietro Pedro Pierre Piero Peter

Pedro (Portuguese) Petros (Greek), Peter (English), Piotr (Polish), Peadar (Irish), Pierre (French), Peder (Danish), Peka (Hawaiian), Pietro (Italian), Piero (Italian Alternative), Petr (Czech), Pyotr (Russian) Cristovao (Portuguese) Christoph (German), Christophe (French), Cristobal (Spanish), Cristoforo (Italian), Kristoffer (Scandinavian), Krystof (Czech), Christopher (English) Miguel (Portuguese) Michalis (Greek), Michael (English), Mick (Irish!) A Demonstration of Hierarchical Clustering using String Edit Distance Piotr Pyotr Petros Pietro Pedro Pierre Piero Peter Peder Peka Peadar Michalis Michael Miguel Mick Cristovao Christopher Christophe Christoph Crisdean Cristobal Cristoforo Kristoffer Krystof Since we cannot test all possible trees we will have to use heuristic search of all possible trees. We could do this.. Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together. Top-Down (divisive): Starting with all the data in a single cluster, consider every possible way to divide the cluster into two. Choose the best division and recursively operate on both sides.

We can look at the dendrogram to determine the “correct” number of clusters. In this case, the two highly separated subtrees are highly suggestive of two clusters. (Things are rarely this clear cut, unfortunately)

Outlier One potential use of a dendrogram is to detect outliers The single isolated branch is suggestive of a data point that is very different to all others

Partitional Clustering Nonhierarchical, each instance is placed in exactly one of K nonoverlapping clusters. Since only one set of clusters is output, the user normally has to input the desired number of clusters K.

Squared Error Objective Function

Algorithm k-means 1. Decide on a value for k. 2. Initialize the k cluster centers (randomly, if necessary). 3. Decide the class memberships of the N objects by assigning them to the nearest cluster center. 4. Re-estimate the k cluster centers, by assuming the memberships found above are correct. 5. If none of the N objects changed membership in the last iteration, exit. Otherwise goto 3.

K-means Clustering: Step 1 Algorithm: k-means, Distance Metric: Euclidean Distance k1k1 k2k2 k3k3

K-means Clustering: Step 2 Algorithm: k-means, Distance Metric: Euclidean Distance k1k1 k2k2 k3k3

K-means Clustering: Step 3 Algorithm: k-means, Distance Metric: Euclidean Distance k1k1 k2k2 k3k3

K-means Clustering: Step 4 Algorithm: k-means, Distance Metric: Euclidean Distance k1k1 k2k2 k3k3

K-means Clustering: Step 5 Algorithm: k-means, Distance Metric: Euclidean Distance k1k1 k2k2 k3k3

Comments on the K-Means Method Strength – Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. – Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Weakness – Applicable only when mean is defined, then what about categorical data? Need to extend the distance measurement. Ahmad, Dey: A k-mean clustering algorithm for mixed numeric and categorical data, Data & Knowledge Engineering, Nov – Need to specify k, the number of clusters, in advance – Unable to handle noisy data and outliers – Not suitable to discover clusters with non-convex shapes – Tends to build clusters of equal size

EM Algorithm

18 Processing : EM Initialization – Initialization : Assign random value to parameters

19 Processing : the E-Step – Expectation : Pretend to know the parameter Assign data point to a component Mixture of Gaussians

20 Processing : the M-Step (1/2) – Maximization : Fit the parameter to its set of points Mixture of Gaussians

Iteration 1 The cluster means are randomly assigned

Iteration 2

Iteration 5

Iteration 25

Comments on the EM K-Means is a special form of EM EM algorithm maintains probabilistic assignments to clusters, instead of deterministic assignments, and multivariate Gaussian distributions instead of means Does not tend to build clusters of equal size Source:

Nearest Neighbor Clustering Not to be confused with Nearest Neighbor Classification Items are iteratively merged into the existing clusters that are closest. Incremental Threshold, t, used to determine if items are added to existing clusters or a new cluster is created. What happens if the data is streaming…

10 Threshold t t 1 2

10 New data point arrives… It is within the threshold for cluster 1, so add it to the cluster, and update cluster center

10 New data point arrives… It is not within the threshold for cluster 1, so create a new cluster, and so on Algorithm is highly order dependent… It is difficult to determine t in advance…

How can we tell the right number of clusters? In general, this is a unsolved problem. However there are many approximate methods. In the next few slides we will see an example. For our example, we will use the familiar katydid/grasshopper dataset. However, in this case we are imagining that we do NOT know the class labels. We are only clustering on the X and Y axis values.

When k = 1, the objective function is 873.0

When k = 2, the objective function is 173.1

When k = 3, the objective function is 133.6

0.00E E E E E E E E E E E We can plot the objective function values for k equals 1 to 6… The abrupt change at k = 2, is highly suggestive of two clusters in the data. This technique for determining the number of clusters is known as “knee finding” or “elbow finding”. Note that the results are not always as clear cut as in this toy example k Objective Function

Goals Estimate class-conditional densities Estimate posterior probabilities

R Density Estimation n samples Assume p(x) is continuous & R is small + Randomly take n samples, let K denote the number of samples inside R.

Density Estimation R n samples + Assume p(x) is continuous & R is small Let k R denote the number of samples in R.

R Density Estimation n samples + What items can be controlled? How? Use subscript n to take sample size into account. To this, we should have We hope

Two Approaches Parzen Windows – Control V n k n -Nearest-Neighbor – Control k n R n samples What items can be controlled? How?

Two Approaches Parzen Windows k n -Nearest-Neighbor

Parzen Windows 1 1 1

Window Function hnhn hnhn hnhn

hnhn hnhn hnhn

Parzen-Window Estimation hnhn hnhn hnhn k n : # samples inside hypercube centered at x.

Generalization Requirement Set x/h n =u. The window is not necessarily a hypercube. h n is a important parameter. It depends on sample size.

Interpolation Parameter h n  0  n (x) is a Dirac delta function.

Example Parzen-window estimations for five samples

Convergence Conditions To assure convergence, i.e., and we have the following additional constraints:

Illustrations One dimension case:

Illustrations One dimension case:

Illustrations Two dimension case:

Classification Example Smaller windowLarger window

Choosing the Window Function V n must approach zero when n , but at a rate slower than 1/n, e.g., The value of initial volume V 1 is important. In some cases, a cell volume is proper for one region but unsuitable in a different region.

k n -Nearest-Neighbor Estimation Let the cell volume depend on the training data. To estimate p(x), we can center a cell about x and let it grow until it captures k n samples, where is some specified function of n, e.g.,

Example k n =5

Example

Estimation of A Posteriori Probabilities x P n (  i |x)=?

Estimation of A Posteriori Probabilities x P n (  i |x)=?

Estimation of A Posteriori Probabilities The value of V n or k n can be determined base on Parzen window or k n -nearest-neighbor technique.

Classification: The Nearest-Neighbor Rule x  A set of labeled prototypes x’x’ Classify as

The Nearest-Neighbor Rule Voronoi Tessellation

Error Rate x Baysian (optimum): Optimum:

Error Rate x x’x’ 1-NN Suppose the true class for x is 

Error Rate x x’x’ 1-NN ? As n ,

Error Rate x x’x’ 1-NN

Error Bounds 1-NN Bayesian Consider the most complex classification case:

Error Bounds 1-NN Bayesian Consider the opposite case: Maximized this term to find the upper bound This term is minimum when all elements have the same value i.e., Minimized this term

Error Bounds 1-NN Bayesian Consider the opposite case:

Error Bounds 1-NN Bayesian Consider the opposite case: The nearest-neighbor rule is a suboptimal procedure. The error rate is never worse than twice the Bayes rate.

Error Bounds

Classification: The k -Nearest-Neighbor Rule k = 5 Assign pattern to the class wins the majority.

Error Bounds

Computation Complexity The computation complexity of the nearest-neighbor algorithm (both in time and space) has received a great deal of analysis. Require O(dn) space to store n prototypes in a training set. – Editing, pruning or condensing To search the nearest neighbor for a d-dimensional test point x, the time complexity is O(dn). – Partial distance – Search tree

Partial Distance Using the following fact to early throw far-away prototypes

Editing Nearest Neighbor Given a set of points, a Voronoi diagram is a partition of space into regions, within which all points are closer to some particular node than to any other node.

Delaunay Triangulation If two Voronoi regions share a boundary, the nodes of these regions are connected with an edge. Such nodes are called the Voronoi neighbors (or Delaunay neighbors).

The Decision Boundary The circled prototypes are redundant.

The Edited Training Set

Editing: The Voronoi Diagram Approach Compute the Delaunay triangulation for the training set. Visit each node, marking it if all its Delaunay neighbors are of the same class as the current node. Delete all marked nodes, exiting with the remaining ones as the edited training set.

Spectral Clustering Acknowledgements for subsequent slides to Xiaoli Fern CS 534: Machine Learning /cs534/

Spectral Clustering

How to Create the Graph?

Motivations / Objectives

Graph Terminologies

Graph Cut

Min Cut Objective

Normalized Cut

Optimizing Ncut Objective

Solving Ncut

2-way Normalized Cuts

Creating Bi-Partition Using 2 nd Eigenvector

K-way Partition?

Spectral Clustering