1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.

Slides:



Advertisements
Similar presentations
Distance Measures Hierarchical Clustering k -Means Algorithms
Advertisements

Similarity and Distance Sketching, Locality Sensitive Hashing
Similarity and Distance Sketching, Locality Sensitive Hashing
Clustering Francisco Moreno Extractos de Mining of Massive Datasets
High Dimensional Search Min-Hashing Locality Sensitive Hashing
Quadratic & Polynomial Functions
CLUSTERING PROXIMITY MEASURES
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
Data Mining Data quality
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Clustering Shannon Quinn (with thanks to J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University)
Computational Geometry & Collision detection
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or.
Clustering.
6 6.1 © 2012 Pearson Education, Inc. Orthogonality and Least Squares INNER PRODUCT, LENGTH, AND ORTHOGONALITY.
Extension of DGIM to More Complex Problems
1 Clustering Distance Measures Hierarchical Clustering k -Means Algorithms.
My Favorite Algorithms for Large-Scale Data Mining
CURE Algorithm Clustering Streams
1 Machine Learning: Symbol-based 9c 9.0Introduction 9.1A Framework for Symbol-based Learning 9.2Version Space Search 9.3The ID3 Decision Tree Induction.
Distance Measures LS Families of Hash Functions S-Curves
1 Methods for High Degrees of Similarity Index-Based Methods Exploiting Prefixes and Suffixes Exploiting Length.
1 More Stream-Mining Counting How Many Elements Computing “Moments”
Distance Measures Tan et al. From Chapter 2.
Finding Similar Items. Set Similarity Problem: Find similar sets. Motivation: Many things can be modeled/represented as sets Applications: –Face Recognition.
Cluster Analysis (1).
1 Clustering. 2 The Problem of Clustering uGiven a set of points, with a notion of distance between points, group the points into some number of clusters,
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
1 Near-Neighbor Search Applications Matrix Formulation Minhashing.
Clustering Algorithms
Finding Similar Items.
1 Near-Neighbor Search Applications Matrix Formulation Minhashing.
Clustering Algorithms
1 More Clustering CURE Algorithm Non-Euclidean Approaches.
1 Locality-Sensitive Hashing Basic Technique Hamming-LSH Applications.
Memory Aid Help.  b 2 = c 2 - a 2  a 2 = c 2 - b 2  “c” must be the hypotenuse.  In a right triangle that has 30 o and 60 o angles, the longest.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Similarity and Distance Sketching, Locality Sensitive Hashing
A Conjecture for the Packing Density of 2413 Walter Stromquist Swarthmore College (joint work with Cathleen Battiste Presutti, Ohio University at Lancaster)
Clustering 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 8: Clustering Mining Massive Datasets.
One of the positive or negative numbers 1, 2, 3, etc., or zero. Compare whole numbers.
1 Applications of LSH (Locality-Sensitive Hashing) Entity Resolution Fingerprints Similar News Articles.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing.
DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.
DATA MINING LECTURE 5 Similarity and Distance Recommender Systems.
A rule that combines two vectors to produce a scalar.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
1 Clustering Distance Measures Hierarchical Clustering k -Means Algorithms.
Clustering.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Jeffrey D. Ullman Stanford University. 2  Generalized LSH is based on some kind of “distance” between points.  Similar points are “close.”  Example:
1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures.
Optimization Indiana University July Geoffrey Fox
Pattern Recognition Mathematic Review Hamid R. Rabiee Jafar Muhammadi Ali Jalali.
Pattern Recognition Mathematic Review Hamid R. Rabiee Jafar Muhammadi Ali Jalali.
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
Clustering Shannon Quinn (with thanks to J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University)
Similarity Measures for Text Document Clustering
Clustering Hierarchical /Agglomerative and Point- Assignment Approaches Measures of “Goodness” for Clusters BFR Algorithm CURE Algorithm Jeffrey D. Ullman.
Clustering Algorithms
More LSH Application: Entity Resolution Application: Similar News Articles Distance Measures LS Families of Hash Functions LSH for Cosine Distance Jeffrey.
Similarity and Distance Recommender Systems
More LSH LS Families of Hash Functions LSH for Cosine Distance Special Approaches for High Jaccard Similarity Jeffrey D. Ullman Stanford University.
Clustering CS246: Mining Massive Datasets
Section 7.12: Similarity By: Ralucca Gera, NPS.
Game Programming Algorithms and Techniques
Near Neighbor Search in High Dimensional Data (1)
Presentation transcript:

1 Clustering Preliminaries Applications Euclidean/Non-Euclidean Spaces Distance Measures

2 The Problem of Clustering uGiven a set of points, with a notion of distance between points, group the points into some number of clusters, so that members of a cluster are in some sense as close to each other as possible.

3 Example x x x x x x x xx x x x x x x x x x x x x x x x x x x

4 Problems With Clustering uClustering in two dimensions looks easy. uClustering small amounts of data looks easy. uAnd in most cases, looks are not deceiving.

5 The Curse of Dimensionality uMany applications involve not 2, but 10 or 10,000 dimensions. uHigh-dimensional spaces look different: almost all pairs of points are at about the same distance.

6 Example: Curse of Dimensionality uAssume random points within a bounding box, e.g., values between 0 and 1 in each dimension. uIn 2 dimensions: a variety of distances between 0 and uIn 10,000 dimensions, the difference in any one dimension is distributed as a triangle.

7 Example – Continued uThe law of large numbers applies. uActual distance between two random points is the sqrt of the sum of squares of essentially the same set of differences.

8 Example High-Dimension Application: SkyCat uA catalog of 2 billion “sky objects” represents objects by their radiation in 7 dimensions (frequency bands). uProblem: cluster into similar objects, e.g., galaxies, nearby stars, quasars, etc. uSloan Sky Survey is a newer, better version.

9 Example: Clustering CD’s (Collaborative Filtering) uIntuitively: music divides into categories, and customers prefer a few categories. wBut what are categories really? uRepresent a CD by the customers who bought it. uSimilar CD’s have similar sets of customers, and vice-versa.

10 The Space of CD’s uThink of a space with one dimension for each customer. wValues in a dimension may be 0 or 1 only. uA CD’s point in this space is (x 1, x 2,…, x k ), where x i = 1 iff the i th customer bought the CD. wCompare with boolean matrix: rows = customers; cols. = CD’s.

11 Space of CD’s – (2) uFor Amazon, the dimension count is tens of millions. uAn alternative: use minhashing/LSH to get Jaccard similarity between “close” CD’s. u1 minus Jaccard similarity can serve as a (non-Euclidean) distance.

12 Example: Clustering Documents uRepresent a document by a vector (x 1, x 2,…, x k ), where x i = 1 iff the i th word (in some order) appears in the document. wIt actually doesn’t matter if k is infinite; i.e., we don’t limit the set of words. uDocuments with similar sets of words may be about the same topic.

13 Aside: Cosine, Jaccard, and Euclidean Distances uAs with CD’s we have a choice when we think of documents as sets of words or shingles: 1.Sets as vectors: measure similarity by the cosine distance. 2.Sets as sets: measure similarity by the Jaccard distance. 3.Sets as points: measure similarity by Euclidean distance.

14 Example: DNA Sequences uObjects are sequences of {C,A,T,G}. uDistance between sequences is edit distance, the minimum number of inserts and deletes needed to turn one into the other. uNote there is a “distance,” but no convenient space in which points “live.”

15 Distance Measures uEach clustering problem is based on some kind of “distance” between points. uTwo major classes of distance measure: 1.Euclidean 2.Non-Euclidean

16 Euclidean Vs. Non-Euclidean uA Euclidean space has some number of real-valued dimensions and “dense” points. wThere is a notion of “average” of two points. wA Euclidean distance is based on the locations of points in such a space. uA Non-Euclidean distance is based on properties of points, but not their “location” in a space.

17 Axioms of a Distance Measure ud is a distance measure if it is a function from pairs of points to real numbers such that: 1.d(x,y) > 0. 2.d(x,y) = 0 iff x = y. 3.d(x,y) = d(y,x). 4.d(x,y) < d(x,z) + d(z,y) (triangle inequality ).

18 Some Euclidean Distances uL 2 norm : d(x,y) = square root of the sum of the squares of the differences between x and y in each dimension. wThe most common notion of “distance.” uL 1 norm : sum of the differences in each dimension. wManhattan distance = distance if you had to travel along coordinates only.

19 Examples of Euclidean Distances x = (5,5) y = (9,8) L 2 -norm: dist(x,y) =  ( ) = 5 L 1 -norm: dist(x,y) = 4+3 =

20 Another Euclidean Distance  L ∞ norm : d(x,y) = the maximum of the differences between x and y in any dimension.  Note: the maximum is the limit as n goes to ∞ of what you get by taking the n th power of the differences, summing and taking the n th root.

21 Non-Euclidean Distances uJaccard distance for sets = 1 minus ratio of sizes of intersection and union. uCosine distance = angle between vectors from the origin to the points in question. uEdit distance = number of inserts and deletes to change one string into another.

22 Jaccard Distance for Sets (Bit-Vectors) uExample: p 1 = 10111; p 2 = uSize of intersection = 3; size of union = 4, Jaccard similarity (not distance) = 3/4. ud(x,y) = 1 – (Jaccard similarity) = 1/4.

23 Why J.D. Is a Distance Measure ud(x,x) = 0 because x  x = x  x. ud(x,y) = d(y,x) because union and intersection are symmetric. ud(x,y) > 0 because |x  y| < |x  y|. ud(x,y) < d(x,z) + d(z,y) trickier – next slide.

24 Triangle Inequality for J.D. 1 - |x  z| |y  z| > 1 -|x  y| |x  z| |y  z| |x  y| uRemember: |a  b|/|a  b| = probability that minhash(a) = minhash(b). uThus, 1 - |a  b|/|a  b| = probability that minhash(a)  minhash(b).

25 Triangle Inequality – (2) uClaim: prob[minhash(x)  minhash(y)] < prob[minhash(x)  minhash(z)] + prob[minhash(z)  minhash(y)] uProof: whenever minhash(x)  minhash(y), at least one of minhash(x)  minhash(z) and minhash(z)  minhash(y) must be true.

26 Similar Sets and Clustering uWe can use minhashing + LSH to find quickly those pairs of sets with low Jaccard distance. uWe can cluster sets (points) using J.D. uBut we only know some distances – the low ones. uThus, clusters are not always connected components.

27 Example: Clustering + J.D. {a,b,c} {b,c,e,f} {d,e,f} {a,b,d,e} Similarity threshold = 1/3; distance < 2/3

28 Cosine Distance uThink of a point as a vector from the origin (0,0,…,0) to its location. uTwo points’ vectors make an angle, whose cosine is the normalized dot- product of the vectors: p 1.p 2 /|p 2 ||p 1 |. wExample: p 1 = 00111; p 2 = wp 1.p 2 = 2; |p 1 | = |p 2 | =  3.  cos(  ) = 2/3;  is about 48 degrees.

29 Cosine-Measure Diagram p1p1 p2p2 p 1.p 2  |p 2 | d (p 1, p 2 ) =  = arccos(p 1.p 2 /|p 2 ||p 1 |)

30 Why C.D. Is a Distance Measure ud(x,x) = 0 because arccos(1) = 0. ud(x,y) = d(y,x) by symmetry. ud(x,y) > 0 because angles are chosen to be in the range 0 to 180 degrees. uTriangle inequality: physical reasoning. If I rotate an angle from x to z and then from z to y, I can’t rotate less than from x to y.

31 Edit Distance uThe edit distance of two strings is the number of inserts and deletes of characters needed to turn one into the other. Equivalently: u d(x,y) = |x| + |y| - 2|LCS(x,y)|. wLCS = longest common subsequence = any longest string obtained both by deleting from x and deleting from y.

32 Example: LCS ux = abcde ; y = bcduve. uTurn x into y by deleting a, then inserting u and v after d. wEdit distance = 3. uOr, LCS(x,y) = bcde. uNote: |x| + |y| - 2|LCS(x,y)| = –2*4 = 3 = edit distance.

33 Why Edit Distance Is a Distance Measure ud(x,x) = 0 because 0 edits suffice. ud(x,y) = d(y,x) because insert/delete are inverses of each other. ud(x,y) > 0: no notion of negative edits. uTriangle inequality: changing x to z and then to y is one way to change x to y.

34 Variant Edit Distances uAllow insert, delete, and mutate. wChange one character into another. uMinimum number of inserts, deletes, and mutates also forms a distance measure. uDitto for any set of operations on strings. wExample: substring reversal OK for DNA sequences