The Yelp Dataset Challenge

The Yelp Dataset Challenge
Predicting User Ratings

Overview The Yelp Dataset Creating the Data Matrix
Creating the Training Data Grouping and Completing Results and Problems Further Ideas

The Yelp Dataset Creating the Data Matrix Creating the Training Data Grouping and Completing Results and Problems Further Ideas

The Yelp Business Dataset
61,184 Businesses in 10 Cities Extracted Data Business IDs City Categories Average Rating (compare to results)

MATLAB code to find Restaurants in Phoenix
string = 'Phoenix'; for i = 1:n %n is the total number of businesses if findstr(string,city_business{i}) Phoenix = [Phoenix; i]; %has the indices for all Phoenix businesses end … %getting the business IDs store in PhoenixBusID (13601x1) %Now we find which of those businesses are 'restaurants' string2 = '"Restaurants"'; for i = 1:n if findstr(string2,categories_business2{i}) Restaurants = [Restaurants; i]; %indices for all Restaurants %Find the Restaurants in Phoenix PhoenixRest = intersect(Phoenix,Restaurants);  2653 %Next, find all the business ID's the restaurants in Phoenix

The Yelp Review Dataset
1,562,264 Reviews Idea Filter out Restaurants in Phoenix using Business IDs Extracted Data User IDs User Ratings

MATLAB: Ratings of Restaurants in Phoenix
for i=1:2653 … for k=j: %optimized for last business in Phoenix if findstr(stringm,business_id_review{k}) record1 = [record1;k]; label = [label;i]; else flag = 1; break end  3 hours later… ratings from 45,867 users (3:1 ratio  BAD)

Idea: every person is a linear combination of a group similar people
Given user ratings 𝑏= and a group of similar users 𝐴= , Delete all 0 rows in 𝑏: 𝑏 𝑛𝑒𝑤 = (here, row 3 and 5). Delete the same rows in 𝐴: 𝐴 𝑛𝑒𝑤 = Solve 𝐴 𝑛𝑒𝑤 𝛼= 𝑏 𝑛𝑒𝑤 using LASSOpos and get 𝛼= Go back to the original matrix 𝐴: A𝛼= 𝑏 𝑐𝑜𝑚𝑝 where 𝑏 𝑐𝑜𝑚𝑝 = round to get 𝑏 𝑐𝑜𝑚𝑝 = Finally, if necessary replace falsely created entries in 𝑏 𝑐𝑜𝑚𝑝 . Say why we choose LASSO: because it gives us a linear combination of A to get b, then we can use this as a means for Matrix completion

Creating the Data Matrix
Challenge: never lose track of indices Need to know the business to be able to validate 3:1 ratio needs to get better Step 1: Find all users >20 entries (596) Corresponding to 2234 businesses < 1% filled  BAD Mention that some users had multiple reviews per one restaurant – decision: overwrite those since last one is latest one

Improving the Data Matrix
Trim: Restaurant > 20, Users > 15 198 Restaurants, 274 Users 86% zeros (sort of bad) BAD results 198∗274 Trim: Restaurant > 45, Users > 26 48 Restaurants, 111 Users only 66% zeros MUCH better results 48∗111 users users business business

A small part of the Bad Matrix

A small part of the Good Matrix

Creating the Training Sets
x1 = find(BestMatrix(:,k)>0); %find the indices of ratings in column k %and find what those ratings are x1sratings = BestMatrix(x1,k); % create matrix with only places the user k rated … x1_others_trim_index = find(x1_countotherreviews >5); %5 or more same ratings %find which users are most equal if abs(x1sratings(j)-x1_others_trim(j,i))==0 %find where same ratings %count on how many reviews they agree  then find most un-like users  unique to each user

Grouping Semi- and Unsupervised Spectral Clustering
Create Kernel Matrix Use Training points (unique to each user) each user gets individual Binary Classification % Kernel Matrix % for i = 1:n for j = 1:n v = -(norm(BestMatrix(:,i)- BestMatrix(:,j),2)^2); W(i,j) = exp(v/(2*sigma^2));  we have a choice of sigma end % Construct L L = diag(sum(W,2),0) - W;

Supervised Unsupervised for i = 1:column % Compute second eigenvector
ingroup = …;  the training data YES inopposite = …;  the training data NO b = zeros(n,1); b(indgroup) = 1; b(inopposite) = -1; M = diag( abs(b), 0); lambda = 1000; g = (L + lambda*M) \ (lambda*M*b); Cest1 = double(g>0)+1; userfriends{i} = find(Cest1==2); end % Compute second eigenvector [V,E] = eigs(L,2,'sm'); f = V(:,1); Cest = double(f>0)+1; Group1 = find(Cest == 1); Group2 = find(Cest == 2);

LASSOpos.m One change to the last sub-function of LASSO.m:
function m = PROX(z, lambda, dt) c = lambda*dt; m = (z-c).*double(z -c > 0) %+ (z+c).*double(z + c < 0);  delete  Only positive entries in 𝜶 Perform Lasso on every user Assemble into New Matrix The matrix is completed

Summing up the Approach
Training Points  Semi-Supervised  Lasso Unsupervised  Lasso Lasso (no grouping) Remember from the homework that unsupervised spectral clustering gave really bad results

Verifying Results Note: there is no original matrix to compare to
Pick random user, delete given entries complete the user, compare to given entries Taking average rating of completed compare to Current Average Rating of given Yelp Data for Business Average Rating of used data (before completion) Average Rating of User Unlike in class, we do not have the actual matrix that we complete;

Errors for BAD Matrix Deleting given entries of a Random User
Calculate difference: actual vs created entries (25 ratings) Semi-Supervised Learning Average of 1.44 points off from each given rating vs. RANDOM predictions are 1.52 points off LASSO only slightly better results Recall BAD matrix has 86% zeros i.e. 16% entries; mention that it should be given that we always use LASSO Given 25 reviews, the difference was off by 36, random is 38; mention that this means that still a bunch are correct but many are VERY far off

Errors for GOOD Matrix Deleting given entries of a Random User
Calculate difference: actual vs created entries (28 ratings) Semi-Supervised Learning Average of points off from each given rating Control: RANDOM predictions are 1.5 points off Unsupervised Learning Average of points off from each given review LASSO only 0.92 points off from each given review Recall: Good matrix has 66% zeros, i.e. 34% entries

Values from example User (Semi-Supervised)
5 3 4 2 5 3 4 2 Left Column: True Value Right Column: Predicted Rating Exact Off by 1 Off by > 1

Errors for GOOD Matrix Comparing Averages of User Ratings
Semi-Supervised Average of original vs predicted Ratings of all Users Before: After: Difference: We noted: the averages are always a little bit less; we overwrite 7’s etc

Errors for GOOD Matrix Comparing Averages of Business Ratings
Semi-Supervised Average of original vs predicted Average of Business Example: Pane Bianco (Pizza Place in Phoenix) Yelp: 4.0 Before (using our users): 4.12 After: 3.64 Difference: 0.36 and 0.48 Current on yelp means including all users that we threw out

The Yelp Dataset Acquiring the Data Matrix Creating the Training Data Grouping and Completing Results and Problems Further Ideas

Further Ideas Take predictions from smaller matrix, use to make bigger matrix make less sparse, complete again Apply to different cities and compare results Use just good ratings, overwrite bad ratings with 0 Go back to certain points where subjective decision were made, use different approach i.e. how to make training set Look at quality of review (in original yelp data)

The End Presentation by Daniel Hallman & Maike Scherer

The Yelp Dataset Challenge

Similar presentations

Presentation on theme: "The Yelp Dataset Challenge"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Yelp Dataset Challenge

Similar presentations

Presentation on theme: "The Yelp Dataset Challenge"— Presentation transcript:

Similar presentations

About project

Feedback