Distance and Similarity Measures

Slides:



Advertisements
Similar presentations
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Advertisements

Prediction Modeling for Personalization & Recommender Systems Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Clustering Basic Concepts and Algorithms
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Data Mining Classification: Alternative Techniques
1 CS 391L: Machine Learning: Instance Based Learning Raymond J. Mooney University of Texas at Austin.
CLUSTERING PROXIMITY MEASURES
Jeff Howbert Introduction to Machine Learning Winter Collaborative Filtering Nearest Neighbor Approach.
Distance and Similarity Measures
Data Mining Techniques: Clustering
Navneet Goyal. Instance Based Learning  Rote Classifier  K- nearest neighbors (K-NN)  Case Based Resoning (CBR)
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Distance Measures Tan et al. From Chapter 2.
Cluster Analysis (1).
Recommender systems Ram Akella November 26 th 2008.
Distance Measures Tan et al. From Chapter 2. Similarity and Dissimilarity Similarity –Numerical measure of how alike two data objects are. –Is higher.
Distance and Similarity Measures
Advanced Multimedia Text Classification Tamara Berg.
Supervised Learning and k Nearest Neighbors Business Intelligence for Managers.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.
K Nearest Neighborhood (KNNs)
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Chapter 2: Getting to Know Your Data
Recommender Systems Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Credits to Bing Liu (UIC) and Angshul Majumdar.
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
CS Machine Learning Instance Based Learning (Adapted from various sources)
Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Getting to Know Your Data Peixiang Zhao.
Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Chapter 14 – Association Rules and Collaborative Filtering © Galit Shmueli and Peter Bruce 2016 Data Mining for Business Analytics (3rd ed.) Shmueli, Bruce.
Chapter 2: Getting to Know Your Data
Data Mining: Concepts and Techniques
Chapter 2: Getting to Know Your Data
Lecture 2-2 Data Exploration: Understanding Data
Data Mining – Algorithms: Instance-Based Learning
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Instance Based Learning
Classification Nearest Neighbor
COP 6726: New Directions in Database Systems
Lecture Notes for Chapter 2 Introduction to Data Mining
Similarity and Dissimilarity
Instance Based Learning (Adapted from various sources)
School of Computer Science & Engineering
K Nearest Neighbor Classification
Collaborative Filtering Nearest Neighbor Approach
Section 7.12: Similarity By: Ralucca Gera, NPS.
Classification Nearest Neighbor
Clustering and Multidimensional Scaling
Representation of documents and queries
Text Categorization Assigning documents to a fixed set of categories
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
From frequency to meaning: vector space models of semantics
COSC 4335: Other Classification Techniques
Nearest Neighbor Classifiers
Data Mining: Concepts and Techniques — Chapter 2 —
What Is Good Clustering?
Data Transformations targeted at minimizing experimental variance
Nearest Neighbors CSC 576: Data Mining.
Data Mining Classification: Alternative Techniques
Nearest Neighbor Classifiers
CSE4334/5334 Data Mining Lecture 7: Classification (4)
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Data Mining: Concepts and Techniques — Chapter 2 —
Presentation transcript:

Distance and Similarity Measures Bamshad Mobasher DePaul University

Distance or Similarity Measures Many data mining and analytics tasks involve the comparison of objects and determining in terms of their similarities (or dissimilarities) Clustering Nearest-neighbor search, classification, and prediction Characterization and discrimination Automatic categorization Correlation analysis Many of todays real-world applications rely on the computation similarities or distances among objects Personalization Recommender systems Document categorization Information retrieval Target marketing

Similarity and Dissimilarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1] Dissimilarity (e.g., distance) Numerical measure of how different two data objects are Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies Proximity refers to a similarity or dissimilarity

Distance or Similarity Measures Measuring Distance In order to group similar items, we need a way to measure the distance between objects (e.g., records) Often requires the representation of objects as “feature vectors” An Employee DB Term Frequencies for Documents Feature vector corresponding to Employee 2: <M, 51, 64000.0> Feature vector corresponding to Document 4: <0, 1, 0, 3, 0, 0>

Distance or Similarity Measures Properties of Distance Measures: for all objects A and B, dist(A, B) ³ 0, and dist(A, B) = dist(B, A) for any object A, dist(A, A) = 0 dist(A, C) £ dist(A, B) + dist (B, C) Representation of objects as vectors: Each data object (item) can be viewed as an n-dimensional vector, where the dimensions are the attributes (features) in the data Example (employee DB): Emp. ID 2 = <M, 51, 64000> Example (Documents): DOC2 = <3, 1, 4, 3, 1, 2> The vector representation allows us to compute distance or similarity between pairs of items using standard vector operations, e.g., Cosine of the angle between vectors Manhattan distance Euclidean distance Hamming Distance

Data Matrix and Distance Matrix Conceptual representation of a table Cols = features; rows = data objects n data points with p dimensions Each row in the matrix is the vector representation of a data object Distance (or Similarity) Matrix n data points, but indicates only the pairwise distance (or similarity) A triangular matrix Symmetric

Proximity Measure for Nominal Attributes If object attributes are all nominal (categorical), then proximity measure are used to compare objects Can take 2 or more states, e.g., red, yellow, blue, green (generalization of a binary attribute) Method 1: Simple matching m: # of matches, p: total # of variables Method 2: Convert to Standard Spreadsheet format For each attribute A create M binary attribute for the M nominal states of A Then use standard vector-based similarity or distance metrics

Proximity Measure for Binary Attributes Object j A contingency table for binary data Distance measure for symmetric binary variables Distance measure for asymmetric binary variables Jaccard coefficient (similarity measure for asymmetric binary variables) Object i

Normalizing or Standardizing Numeric Data Z-score: x: raw value to be standardized, μ: mean of the population, σ: standard deviation the distance between the raw score and the population mean in units of the standard deviation negative when the value is below the mean, “+” when above Min-Max Normalization

Common Distance Measures for Numeric Data Consider two vectors Rows in the data matrix Common Distance Measures: Manhattan distance: Euclidean distance: Distance can be defined as a dual of a similarity measure

Example: Data Matrix and Distance Matrix Distance Matrix (Manhattan) Distance Matrix (Euclidean)

Distance on Numeric Data: Minkowski Distance Minkowski distance: A popular distance measure where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and h is the order (the distance so defined is also called L-h norm) Note that Euclidean and Manhattan distances are special cases h = 1: (L1 norm) Manhattan distance h = 2: (L2 norm) Euclidean distance

Vector-Based Similarity Measures In some situations, distance measures provide a skewed view of data E.g., when the data is very sparse and 0’s in the vectors are not significant In such cases, typically vector-based similarity measures are used Most common measure: Cosine similarity Dot product of two vectors: Cosine Similarity = normalized dot product the norm of a vector X is: the cosine similarity is:

Vector-Based Similarity Measures Why divide by the norm? Example: X = <2, 0, 3, 2, 1, 4> ||X|| = SQRT(4+0+9+4+1+16) = 5.83 X* = X / ||X|| = <0.343, 0, 0.514, 0.343, 0.171, 0.686> Now, note that ||X*|| = 1 So, dividing a vector by its norm, turns it into a unit-length vector Cosine similarity measures the angle between two unit length vectors (i.e., the magnitude of the vectors are ignored).

Example Application: Information Retrieval Documents are represented as “bags of words” Represented as vectors when used computationally A vector is an array of floating point (or binary in case of bit maps) Has direction and magnitude Each vector has a place for every term in collection (most are sparse) Document Ids a document vector nova galaxy heat actor film role A 1.0 0.5 0.3 B 0.5 1.0 C 1.0 0.8 0.7 D 0.9 1.0 0.5 E 1.0 1.0 F 0.7 G 0.5 0.7 0.9 H 0.6 1.0 0.3 0.2 I 0.7 0.5 0.3

Documents & Query in n-dimensional Space Documents are represented as vectors in the term space Typically values in each dimension correspond to the frequency of the corresponding term in the document Queries represented as vectors in the same vector-space Cosine similarity between the query and documents is often used to rank retrieved documents

Example: Similarities among Documents Consider the following document-term matrix Dot-Product(Doc2,Doc4) = <3,1,4,3,1,2,0,1> * <0,1,0,3,0,0,2,0> 0 + 1 + 0 + 9 + 0 + 0 + 0 + 0 = 10 Norm (Doc2) = SQRT(9+1+16+9+1+4+0+1) = 6.4 Norm (Doc4) = SQRT(0+1+0+9+0+0+4+0) = 3.74 Cosine(Doc2, Doc4) = 10 / (6.4 * 3.74) = 0.42

Correlation as Similarity In cases where there could be high mean variance across data objects (e.g., movie ratings), Pearson Correlation coefficient is the best option Pearson Correlation Often used in recommender systems based on Collaborative Filtering

Distance-Based Classification Basic Idea: classify new instances based on their similarity to or distance from instances we have seen before also called “instance-based learning” Simplest form of MBR: Rote Learning learning by memorization save all previously encountered instance; given a new instance, find one from the memorized set that most closely “resembles” the new one; assign new instance to the same class as the “nearest neighbor” more general methods try to find k nearest neighbors rather than just one but, how do we define “resembles?” MBR is “lazy” defers all of the real work until new instance is obtained; no attempt is made to learn a generalized model from the training set less data preprocessing and model evaluation, but more work has to be done at classification time

Nearest Neighbor Classifiers Basic idea: If it walks like a duck, quacks like a duck, then it’s probably a duck Training Records Test Record Compute Distance Choose k of the “nearest” records

K-Nearest-Neighbor Strategy Given object x, find the k most similar objects to x The k nearest neighbors Variety of distance or similarity measures can be used to identify and rank neighbors Note that this requires comparison between x and all objects in the database Classification: Find the class label for each of the k neighbor Use a voting or weighted voting approach to determine the majority class among the neighbors (a combination function) Weighted voting means the closest neighbors count more Assign the majority class label to x Prediction: Identify the value of the target attribute for the k neighbors Return the weighted average as the predicted value of the target attribute for x

K-Nearest-Neighbor Strategy K-nearest neighbors of a record x are data points that have the k smallest distance to x

Combination Functions Voting: the “democracy” approach poll the neighbors for the answer and use the majority vote the number of neighbors (k) is often taken to be odd in order to avoid ties works when the number of classes is two if there are more than two classes, take k to be the number of classes plus 1 Impact of k on predictions in general different values of k affect the outcome of classification we can associate a confidence level with predictions (this can be the % of neighbors that are in agreement) problem is that no single category may get a majority vote if there is strong variations in results for different choices of k, this an indication that the training set is not large enough

Voting Approach - Example Will a new customer respond to solicitation? Using the voting method without confidence Using the voting method with a confidence

KNN for Document Categorization   T1 T2 T3 T4 T5 T6 T7 T8 Cat DOC1 2 4 3 1 Cat1 DOC2 DOC3 Cat2 DOC4 DOC5 DOC6 DOC7 DOC8 ?

KNN for Document Categorization Using Cosine Similarity to find K=3 neighbors:   T1 T2 T3 T4 T5 T6 T7 T8 Norm Sim(D8,Di) DOC1 2 4 3 1 5.83 0.61 DOC2 5.74 0.12 DOC3 5.29 0.84 DOC4 2.45 0.79 DOC5 4.47 0.00 DOC6 4.12 0.73 DOC7 6.16 0.72 DOC8 5.66 E.g.: Sim(D8,D7) = (D8  D7) / (Norm(D8).Norm(D7) = (3x2+1x1+0x3+4x4+1x0+0x2+2x0+1x2) / (5.66 x 6.16) = 25 / 34.87 = 0.72

KNN for Document Categorization   T1 T2 T3 T4 T5 T6 T7 T8 Cat Sim(D8,Di) DOC1 2 4 3 1 Cat1 0.61 DOC2 0.12 DOC3 Cat2 0.84 DOC4 0.79 DOC5 0.00 DOC6 0.73 DOC7 0.72 DOC8 5.66 Simple voting: Cat for DOC 8 = Cat2 with confidence 2/3 = 0.67 Weighted voting: Cat for DOC 8 = Cat2 Confidence: (0.84 + 0.73) / (0.84 + 0.79 + 0.73) = 0.66

Combination Functions Weighted Voting: not so “democratic” similar to voting, but the vote some neighbors counts more “shareholder democracy?” question is which neighbor’s vote counts more? How can weights be obtained? Distance-based closer neighbors get higher weights “value” of the vote is the inverse of the distance (may need to add a small constant) the weighted sum for each class gives the combined score for that class to compute confidence, need to take weighted average Heuristic weight for each neighbor is based on domain-specific characteristics of that neighbor Advantage of weighted voting introduces enough variation to prevent ties in most cases helps distinguish between competing neighbors

KNN and Collaborative Filtering Collaborative Filtering Example A movie rating system Ratings scale: 1 = “hate it”; 7 = “love it” Historical DB of users includes ratings of movies by Sally, Bob, Chris, and Lynn Karen is a new user who has rated 3 movies, but has not yet seen “Independence Day”; should we recommend it to her? Will Karen like “Independence Day?”

Collaborative Filtering (k Nearest Neighbor Example) Prediction K is the number of nearest neighbors used in to find the average predicted ratings of Karen on Indep. Day. Example computation: Pearson(Sally, Karen) = ( (7-5.33)*(7-4.67) + (6-5.33)*(4-4.67) + (3-5.33)*(3-4.67) ) / SQRT( ((7-5.33)2 +(6-5.33)2 +(3-5.33)2) * ((7- 4.67)2 +(4- 4.67)2 +(3- 4.67)2)) = 0.85

Collaborative Filtering (k Nearest Neighbor) In practice a more sophisticated approach is used to generate the predictions based on the nearest neighbors To generate predictions for a target user a on an item i: ra = mean rating for user a u1, …, uk are the k-nearest-neighbors to a ru,i = rating of user u on item I sim(a,u) = Pearson correlation between a and u This is a weighted average of deviations from the neighbors’ mean ratings (and closer neighbors count more)