The Yelp Dataset Challenge

Slides:



Advertisements
Similar presentations
The essentials managers need to know about Excel
Advertisements

Bison Management Suppose you take over the management of a certain Bison population. The population dynamics are similar to those of the population we.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Types of Algorithms.
Overview What is Dynamic Programming? A Sequence of 4 Steps
Computational Methods for Management and Economics Carla Gomes Module 8b The transportation simplex method.
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
EMB1006 The Binary System There is no 2 Jonathan-Lee Jones.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Lecture 21: Spectral Clustering
Support Vector Machines and Kernel Methods
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Lecture 4 Sept 8 Complete Chapter 3 exercises Chapter 4.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Robust fitting Prof. Noah Snavely CS1114
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
1 1.1 © 2012 Pearson Education, Inc. Linear Equations in Linear Algebra SYSTEMS OF LINEAR EQUATIONS.
Excel 2007 Part (2) Dr. Susan Al Naqshbandi
Sampling : Error and bias. Sampling definitions  Sampling universe  Sampling frame  Sampling unit  Basic sampling unit or elementary unit  Sampling.
Lecture for Week Spring.  Numbers can be represented in many ways. We are familiar with the decimal system since it is most widely used in everyday.
Overview: Humans are unique creatures. Everything we do is slightly different from everyone else. Even though many times these differences are so minute.
Chapter 6.7 Determinants. In this chapter all matrices are square; for example: 1x1 (what is a 1x1 matrix, in fact?), 2x2, 3x3 Our goal is to introduce.
PHP meets MySQL.
Dynamic Programming. Well known algorithm design techniques:. –Divide-and-conquer algorithms Another strategy for designing algorithms is dynamic programming.
Lecture 20: Cluster Validation
Support Vector Machines and Kernel Methods Machine Learning March 25, 2010.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Types of Algorithms. 2 Algorithm classification Algorithms that use a similar problem-solving approach can be grouped together We’ll talk about a classification.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
What is Matrix Multiplication? Matrix multiplication is the process of multiplying two matrices together to get another matrix. It differs from scalar.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Linear Equations in Linear Algebra
AP CSP: Cleaning Data & Creating Summary Tables
Semi-Supervised Clustering
CHP - 9 File Structures.
Other Kinds of Arrays Chapter 11
Chapter 5 Decisions. Chapter 5 Decisions ssential uestion: How are Boolean expressions or operators used in everyday life?
Types of Algorithms.
What do we now know about number systems?
Example Fill in the grid so that every row, column and box contains each of the numbers 1 to 9:
Unit-2 Divide and Conquer
Types of Algorithms.
Logistic Regression & Parallel SGD
Lial/Hungerford/Holcomb/Mullins: Mathematics with Applications 11e Finite Mathematics with Applications 11e Copyright ©2015 Pearson Education, Inc. All.
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Coding Concepts (Basics)
MATLAB Programming Indexing Copyright © Software Carpentry 2011
Applied Combinatorics, 4th Ed. Alan Tucker
Parallelization of Sparse Coding & Dictionary Learning
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
Solving Linear Equations
Ensemble learning.
Linear Equations in Linear Algebra
Types of Algorithms.
ECE 352 Digital System Fundamentals
Trevor Brown DC 2338, Office hour M3-4pm
DETERMINANT MATH 80 - Linear Algebra.
Simplex method (algebraic interpretation)
The Discrete Kalman Filter
Use of SQL – The Patricia database
Decision Trees Jeff Storey.
Module 4 Loops and Repetition 9/19/2019 CSE 1321 Module 4.
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

The Yelp Dataset Challenge Predicting User Ratings

Overview The Yelp Dataset Creating the Data Matrix Creating the Training Data Grouping and Completing Results and Problems Further Ideas

The Yelp Dataset Creating the Data Matrix Creating the Training Data Grouping and Completing Results and Problems Further Ideas

The Yelp Business Dataset 61,184 Businesses in 10 Cities Extracted Data Business IDs City Categories Average Rating (compare to results)

MATLAB code to find Restaurants in Phoenix string = 'Phoenix'; for i = 1:n %n is the total number of businesses if findstr(string,city_business{i}) Phoenix = [Phoenix; i]; %has the indices for all Phoenix businesses end … %getting the business IDs store in PhoenixBusID (13601x1) %Now we find which of those businesses are 'restaurants' string2 = '"Restaurants"'; for i = 1:n if findstr(string2,categories_business2{i}) Restaurants = [Restaurants; i]; %indices for all Restaurants %Find the Restaurants in Phoenix PhoenixRest = intersect(Phoenix,Restaurants);  2653 %Next, find all the business ID's the restaurants in Phoenix

The Yelp Review Dataset 1,562,264 Reviews Idea Filter out Restaurants in Phoenix using Business IDs Extracted Data User IDs User Ratings

MATLAB: Ratings of Restaurants in Phoenix for i=1:2653 … for k=j:1569111 %optimized for last business in Phoenix if findstr(stringm,business_id_review{k}) record1 = [record1;k]; label = [label;i]; else flag = 1; break end  3 hours later…141200 ratings from 45,867 users (3:1 ratio  BAD)

Idea: every person is a linear combination of a group similar people Given user ratings 𝑏= 2 5 0 3 0 and a group of similar users 𝐴= 0 4 4 0 2 4 0 4 2 2 2 1 0 4 0 3 0 5 4 4 1 0 0 1 4 , Delete all 0 rows in 𝑏: 𝑏 𝑛𝑒𝑤 = 2 5 3 (here, row 3 and 5). Delete the same rows in 𝐴: 𝐴 𝑛𝑒𝑤 = 0 4 4 0 2 4 0 4 2 2 3 0 5 4 4 . Solve 𝐴 𝑛𝑒𝑤 𝛼= 𝑏 𝑛𝑒𝑤 using LASSOpos and get 𝛼= 1.1600 0.5000 0 0 0 . Go back to the original matrix 𝐴: A𝛼= 𝑏 𝑐𝑜𝑚𝑝 where 𝑏 𝑐𝑜𝑚𝑝 = 2.000 4.6400 2.8200 3.4800 1.1600 round to get 𝑏 𝑐𝑜𝑚𝑝 = 2 5 3 3 1 . Finally, if necessary replace falsely created entries in 𝑏 𝑐𝑜𝑚𝑝 . Say why we choose LASSO: because it gives us a linear combination of A to get b, then we can use this as a means for Matrix completion

The Yelp Dataset Creating the Data Matrix Creating the Training Data Grouping and Completing Results and Problems Further Ideas

Creating the Data Matrix Challenge: never lose track of indices Need to know the business to be able to validate 3:1 ratio needs to get better Step 1: Find all users >20 entries (596) Corresponding to 2234 businesses < 1% filled  BAD Mention that some users had multiple reviews per one restaurant – decision: overwrite those since last one is latest one

Improving the Data Matrix Trim: Restaurant > 20, Users > 15 198 Restaurants, 274 Users 86% zeros (sort of bad) BAD results 198∗274 Trim: Restaurant > 45, Users > 26 48 Restaurants, 111 Users only 66% zeros MUCH better results 48∗111 users users business business

A small part of the Bad Matrix

A small part of the Good Matrix

The Yelp Dataset Creating the Data Matrix Creating the Training Data Grouping and Completing Results and Problems Further Ideas

Creating the Training Sets x1 = find(BestMatrix(:,k)>0); %find the indices of ratings in column k %and find what those ratings are x1sratings = BestMatrix(x1,k); % create matrix with only places the user k rated … x1_others_trim_index = find(x1_countotherreviews >5); %5 or more same ratings %find which users are most equal if abs(x1sratings(j)-x1_others_trim(j,i))==0 %find where same ratings %count on how many reviews they agree  then find most un-like users  unique to each user

The Yelp Dataset Creating the Data Matrix Creating the Training Data Grouping and Completing Results and Problems Further Ideas

Grouping Semi- and Unsupervised Spectral Clustering Create Kernel Matrix Use Training points (unique to each user) each user gets individual Binary Classification % Kernel Matrix % for i = 1:n for j = 1:n v = -(norm(BestMatrix(:,i)- BestMatrix(:,j),2)^2); W(i,j) = exp(v/(2*sigma^2));  we have a choice of sigma end % Construct L L = diag(sum(W,2),0) - W;

Supervised Unsupervised for i = 1:column % Compute second eigenvector ingroup = …;  the training data YES inopposite = …;  the training data NO b = zeros(n,1); b(indgroup) = 1; b(inopposite) = -1; M = diag( abs(b), 0); lambda = 1000; g = (L + lambda*M) \ (lambda*M*b); Cest1 = double(g>0)+1; userfriends{i} = find(Cest1==2); end % Compute second eigenvector [V,E] = eigs(L,2,'sm'); f = V(:,1); Cest = double(f>0)+1; Group1 = find(Cest == 1); Group2 = find(Cest == 2);

LASSOpos.m One change to the last sub-function of LASSO.m: function m = PROX(z, lambda, dt) c = lambda*dt; m = (z-c).*double(z -c > 0) %+ (z+c).*double(z + c < 0);  delete  Only positive entries in 𝜶 Perform Lasso on every user Assemble into New Matrix The matrix is completed

Summing up the Approach Training Points  Semi-Supervised  Lasso Unsupervised  Lasso Lasso (no grouping) Remember from the homework that unsupervised spectral clustering gave really bad results

The Yelp Dataset Creating the Data Matrix Creating the Training Data Grouping and Completing Results and Problems Further Ideas

Verifying Results Note: there is no original matrix to compare to Pick random user, delete given entries complete the user, compare to given entries Taking average rating of completed compare to Current Average Rating of given Yelp Data for Business Average Rating of used data (before completion) Average Rating of User Unlike in class, we do not have the actual matrix that we complete;

Errors for BAD Matrix Deleting given entries of a Random User Calculate difference: actual vs created entries (25 ratings) Semi-Supervised Learning Average of 1.44 points off from each given rating vs. RANDOM predictions are 1.52 points off LASSO only slightly better results Recall BAD matrix has 86% zeros i.e. 16% entries; mention that it should be given that we always use LASSO Given 25 reviews, the difference was off by 36, random is 38; mention that this means that still a bunch are correct but many are VERY far off

Errors for GOOD Matrix Deleting given entries of a Random User Calculate difference: actual vs created entries (28 ratings) Semi-Supervised Learning Average of 0.7857 points off from each given rating Control: RANDOM predictions are 1.5 points off Unsupervised Learning Average of 0.9286 points off from each given review LASSO only 0.92 points off from each given review Recall: Good matrix has 66% zeros, i.e. 34% entries

Values from example User (Semi-Supervised) 5 3 4 2 5 3 4 2 Left Column: True Value Right Column: Predicted Rating Exact Off by 1 Off by > 1

Errors for GOOD Matrix Comparing Averages of User Ratings Semi-Supervised Average of original vs predicted Ratings of all Users Before: 3.9370 After: 3.4746 Difference: 0.4642 We noted: the averages are always a little bit less; we overwrite 7’s etc

Errors for GOOD Matrix Comparing Averages of Business Ratings Semi-Supervised Average of original vs predicted Average of Business Example: Pane Bianco (Pizza Place in Phoenix) Yelp: 4.0 Before (using our users): 4.12 After: 3.64 Difference: 0.36 and 0.48 Current on yelp means including all users that we threw out

The Yelp Dataset Acquiring the Data Matrix Creating the Training Data Grouping and Completing Results and Problems Further Ideas

Further Ideas Take predictions from smaller matrix, use to make bigger matrix make less sparse, complete again Apply to different cities and compare results Use just good ratings, overwrite bad ratings with 0 Go back to certain points where subjective decision were made, use different approach i.e. how to make training set Look at quality of review (in original yelp data)

The End Presentation by Daniel Hallman & Maike Scherer