Clustering Algorithms

Slides:



Advertisements
Similar presentations
Distance Measures Hierarchical Clustering k -Means Algorithms
Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering.
Clustering Categorical Data The Case of Quran Verses
Similarity and Distance Sketching, Locality Sensitive Hashing
Clustering Francisco Moreno Extractos de Mining of Massive Datasets
High Dimensional Search Min-Hashing Locality Sensitive Hashing
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Clustering Shannon Quinn (with thanks to J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University)
Unsupervised Learning with Artificial Neural Networks The ANN is given a set of patterns, P, from space, S, but little/no information about their classification,
Data Mining Techniques: Clustering
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Bioinformatics
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or.
Clustering.
1 Clustering Distance Measures Hierarchical Clustering k -Means Algorithms.
1 Machine Learning: Symbol-based 10d More clustering examples10.5Knowledge and Learning 10.6Unsupervised Learning 10.7Reinforcement Learning 10.8Epilogue.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.
1 Machine Learning: Symbol-based 9c 9.0Introduction 9.1A Framework for Symbol-based Learning 9.2Version Space Search 9.3The ID3 Decision Tree Induction.
Distance Measures LS Families of Hash Functions S-Curves
Cluster Analysis (1).
What is Cluster Analysis?
1 Clustering. 2 The Problem of Clustering uGiven a set of points, with a notion of distance between points, group the points into some number of clusters,
Clustering Algorithms
CLUSTERING Eitan Lifshits Big Data Processing Seminar Prof. Amir Averbuch Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman, Jeffery.
Clustering Algorithms
Clustering Unsupervised learning Generating “classes”
CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2014.
Clustering 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 8: Clustering Mining Massive Datasets.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Clustering.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
DATA MINING LECTURE 5 Similarity and Distance Recommender Systems.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Clustering Distance Measures Hierarchical Clustering k -Means Algorithms.
Clustering.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Jeffrey D. Ullman Stanford University. 2  Generalized LSH is based on some kind of “distance” between points.  Similar points are “close.”  Example:
1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.
Analysis of Massive Data Sets Prof. dr. sc. Siniša Srbljić Doc. dr. sc. Dejan Škvorc Doc. dr. sc. Ante Đerek Faculty of Electrical Engineering and Computing.
Clustering Shannon Quinn (with thanks to J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University)
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
CSE 554 Lecture 8: Alignment
Similarity Measures for Text Document Clustering
Section 6.1 Inner Products.
Semi-Supervised Clustering
Clustering Hierarchical /Agglomerative and Point- Assignment Approaches Measures of “Goodness” for Clusters BFR Algorithm CURE Algorithm Jeffrey D. Ullman.
Clustering CSC 600: Data Mining Class 21.
Lecture 2-2 Data Exploration: Understanding Data
Data Mining K-means Algorithm
More LSH Application: Entity Resolution Application: Similar News Articles Distance Measures LS Families of Hash Functions LSH for Cosine Distance Jeffrey.
Similarity and Distance Recommender Systems
More LSH LS Families of Hash Functions LSH for Cosine Distance Special Approaches for High Jaccard Similarity Jeffrey D. Ullman Stanford University.
Clustering CS246: Mining Massive Datasets
Section 4.7 Inverse Trig Functions
Clustering and Multidimensional Scaling
Information Organization: Clustering
Data Mining – Chapter 4 Cluster Analysis Part 2
Text Categorization Berlin Chen 2003 Reference:
SEEM4630 Tutorial 3 – Clustering.
Introduction to Machine learning
Near Neighbor Search in High Dimensional Data (1)
By Josh, Spencer, and Lacey
Presentation transcript:

Clustering Algorithms Applications Hierarchical Clustering k -Means

The Problem of Clustering Given a set of points, with a notion of distance between points, group the points into some number of clusters, such that: Inter-cluster distances are maximized Intra-cluster distances are minimized

Example: Clustering CD’s Intuitively: music divides into categories, and customers prefer a few categories. But what are categories really? Represent a CD by the customers who bought it. Similar CD’s have similar sets of customers, and vice-versa.

The Space of CD’s Think of a space with one dimension for each customer. Values in a dimension may be 0 or 1 only. A CD’s point in this space is (x1, x2,…, xk), where xi = 1 iff the i th customer bought the CD.

Space of CD’s – (2) For Amazon, the dimension count is tens of millions. An alternative: use minhashing/LSH to get Jaccard similarity between “close” CD’s. 1 minus Jaccard similarity can serve as a (non-Euclidean) distance.

Example: Clustering Documents Represent a document by a vector (x1, x2,…, xk), where xi = 1 iff the i th word (in some order) appears in the document. Documents with similar sets of words may be about the same topic.

Example: DNA Sequences Objects are sequences of {C,A,T,G}. Distance between sequences is edit distance, Minimum number of inserts and deletes needed to turn one into the other. Note there is a “distance,” but no convenient space in which points “live.”

Distance Measures A Euclidean space has some number of real-valued dimensions. Based on locations of points in such a space. A Non-Euclidean distance is based on properties of points, but not their “location” in a space.

Axioms of a Distance Measure d is a distance measure if it is a function from pairs of objects to real numbers such that: d(x,y) > 0. d(x,y) = 0 iff x = y. d(x,y) = d(y,x). d(x,y) < d(x,z) + d(z,y) (triangle inequality ).

Some Euclidean Distances L2 norm : d(x,y) = square root of the sum of the squares of the differences between x and y in each dimension. L1 norm : sum of the differences in each dimension. Manhattan distance = distance if you had to travel along coordinates only. x = (5,5) y = (9,8) L2-norm: dist(x,y) = (42+32) = 5 L1-norm: 4+3 = 7 4 3 5

Non-Euclidean Distances Jaccard distance for sets = 1 minus ratio of sizes of intersection and union. Cosine distance = angle between vectors from the origin to the points in question. Edit distance = number of inserts and deletes to change one string into another.

Jaccard Distance for Sets (Bit-Vectors) Example: p1 = 10111; p2 = 10011. Size of intersection = 3; size of union = 4, Jaccard similarity (not distance) = 3/4. d(x,y) = 1 – (Jaccard similarity) = 1/4.

Why J.D. Is a Distance Measure d(x,x) = 0 because xx = xx. d(x,y) = d(y,x) because union and intersection are symmetric. d(x,y) > 0 because |xy| < |xy|. d(x,y) < d(x,z) + d(z,y) trickier... (1 - |x z|/|x z|) + (1 - |y z|/|y z|)  1 - |x y|/|x y| Remember: |a b|/|a b| = probability that minhash(a) = minhash(b) 1 - |a b|/|a b| = prob. that minhash(a)  minhash(b). Claim: prob[mh(x)mh(y)]  prob[mh(x)mh(z)] + prob[mh(z)mh(y)] Proof: Whenever mh(x)mh(y), at least one of mh(x)mh(z) and mh(z)mh(y) must be true.

Clustering with JD {a,b,d,e} {d,e,f} {a,b,c} {b,c,e,f} Similarity threshold = 1/3; distance < 2/3

Cosine Distance Think of a point as a vector from the origin (0,0,…,0) to its location. Two points’ vectors make an angle, whose cosine is the normalized dot-product of the vectors: p1.p2/|p2||p1|. Example: p1 = 00111; p2 = 10011. p1.p2 = 2; |p1| = |p2| = 3. cos() = 2/3;  is about 48 degrees. d (p1, p2) =  = arccos(p1.p2/|p2||p1|)

Why C.D. Is a Distance Measure d(x,x) = 0 because arccos(1) = 0. d(x,y) = d(y,x) by symmetry. d(x,y) > 0 because angles are chosen to be in the range 0 to 180 degrees. Triangle inequality: physical reasoning. If I rotate an angle from x to z and then from z to y, I can’t rotate less than from x to y.

Edit Distance The edit distance of two strings is the number of inserts and deletes of characters needed to turn one into the other. Example x = abcde ; y = bcduve. Turn x into y by deleting a, then inserting u and v after d. d(x,y) = 3. Or, longest common subsequence: LCS(x,y) = bcde. Note: d(x,y) = |x| + |y| - 2|LCS(x,y)| = 5 + 6 –2*4 = 3 = edit dist.

Why Edit Distance Is a Distance Measure d(x,x) = 0 because 0 edits suffice. d(x,y) = d(y,x) because insert/delete are inverses of each other. d(x,y) > 0: no notion of negative edits. Triangle inequality: changing x to z and then to y is one way to change x to y.

Methods of Clustering Hierarchical (Agglomerative): Point Assignment: Initially, each point in cluster by itself. Repeatedly combine the two “nearest” clusters into one. Point Assignment: Maintain a set of clusters. Place points into their “nearest” cluster.

Hierarchical Clustering Two important questions: How do you determine the “nearness” of clusters? How do you represent a cluster of more than one point?

Hierarchical Clustering – (2) Key problem: as you build clusters, how do you represent the location of each cluster, to tell which pair of clusters is closest? Euclidean case: each cluster has a centroid = average of its points. Measure intercluster distances by distances of centroids.

Example (5,3) o (1,2) o (2,1) o (4,1) o (0,0) o x (1.5,1.5) (5,0)

And in the Non-Euclidean Case? The only “locations” we can talk about are the points themselves. I.e., there is no “average” of two points. Approach 1: clustroid = point “closest” to other points. Treat clustroid as if it were centroid, when computing intercluster distances.

“Closest” Point? Possible meanings: Smallest maximum distance to the other points. Smallest average distance to other points. Smallest sum of squares of distances to other points. Etc., etc.

Example clustroid 1 2 6 4 3 clustroid 5 intercluster distance

Other Approaches to Defining “Nearness” of Clusters Approach 2: intercluster distance = minimum of the distances between any two points, one from each cluster. Approach 3: Pick a notion of “cohesion” of clusters, e.g., maximum distance from the clustroid. Merge clusters whose union is most cohesive.

k – Means Algorithm(s) Assumes Euclidean space. Start by picking k, the number of clusters. Initialize clusters by picking one point per cluster. Example: pick one point at random, then k -1 other points, each as far away as possible from the previous points.

Populating Clusters For each point, place it in the cluster whose current centroid it is nearest. After all points are assigned, fix the centroids of the k clusters. Optional: reassign all points to their closest centroid. Sometimes moves points between clusters.

Example: Assigning Clusters Reassigned points Clusters after first round 2 4 x 6 7 5 x 3 1 8

Getting k Right Try different k, looking at the change in the average distance to centroid, as k increases. Average falls rapidly until right k, then changes little. k Average distance to centroid Best value of k

Example: Picking k x Too few; x xx x many long x x distances x x x to centroid. x x x x x x x x x x x x x x x x x x x x x x x x x x x

Example: Picking k x x xx x Just right; x x distances x x x rather short. x x x x x x x x x x x x x x x x x x x x x x x x x x

Example: Picking k x Too many; x xx x little improvement x x in average distance. x x x x x x x x x x x x x x x x x x x x x x x x x x x