The Yelp Dataset Challenge Predicting User Ratings
Overview The Yelp Dataset Creating the Data Matrix Creating the Training Data Grouping and Completing Results and Problems Further Ideas
The Yelp Dataset Creating the Data Matrix Creating the Training Data Grouping and Completing Results and Problems Further Ideas
The Yelp Business Dataset 61,184 Businesses in 10 Cities Extracted Data Business IDs City Categories Average Rating (compare to results)
MATLAB code to find Restaurants in Phoenix string = 'Phoenix'; for i = 1:n %n is the total number of businesses if findstr(string,city_business{i}) Phoenix = [Phoenix; i]; %has the indices for all Phoenix businesses end … %getting the business IDs store in PhoenixBusID (13601x1) %Now we find which of those businesses are 'restaurants' string2 = '"Restaurants"'; for i = 1:n if findstr(string2,categories_business2{i}) Restaurants = [Restaurants; i]; %indices for all Restaurants %Find the Restaurants in Phoenix PhoenixRest = intersect(Phoenix,Restaurants); 2653 %Next, find all the business ID's the restaurants in Phoenix
The Yelp Review Dataset 1,562,264 Reviews Idea Filter out Restaurants in Phoenix using Business IDs Extracted Data User IDs User Ratings
MATLAB: Ratings of Restaurants in Phoenix for i=1:2653 … for k=j:1569111 %optimized for last business in Phoenix if findstr(stringm,business_id_review{k}) record1 = [record1;k]; label = [label;i]; else flag = 1; break end 3 hours later…141200 ratings from 45,867 users (3:1 ratio BAD)
Idea: every person is a linear combination of a group similar people Given user ratings 𝑏= 2 5 0 3 0 and a group of similar users 𝐴= 0 4 4 0 2 4 0 4 2 2 2 1 0 4 0 3 0 5 4 4 1 0 0 1 4 , Delete all 0 rows in 𝑏: 𝑏 𝑛𝑒𝑤 = 2 5 3 (here, row 3 and 5). Delete the same rows in 𝐴: 𝐴 𝑛𝑒𝑤 = 0 4 4 0 2 4 0 4 2 2 3 0 5 4 4 . Solve 𝐴 𝑛𝑒𝑤 𝛼= 𝑏 𝑛𝑒𝑤 using LASSOpos and get 𝛼= 1.1600 0.5000 0 0 0 . Go back to the original matrix 𝐴: A𝛼= 𝑏 𝑐𝑜𝑚𝑝 where 𝑏 𝑐𝑜𝑚𝑝 = 2.000 4.6400 2.8200 3.4800 1.1600 round to get 𝑏 𝑐𝑜𝑚𝑝 = 2 5 3 3 1 . Finally, if necessary replace falsely created entries in 𝑏 𝑐𝑜𝑚𝑝 . Say why we choose LASSO: because it gives us a linear combination of A to get b, then we can use this as a means for Matrix completion
The Yelp Dataset Creating the Data Matrix Creating the Training Data Grouping and Completing Results and Problems Further Ideas
Creating the Data Matrix Challenge: never lose track of indices Need to know the business to be able to validate 3:1 ratio needs to get better Step 1: Find all users >20 entries (596) Corresponding to 2234 businesses < 1% filled BAD Mention that some users had multiple reviews per one restaurant – decision: overwrite those since last one is latest one
Improving the Data Matrix Trim: Restaurant > 20, Users > 15 198 Restaurants, 274 Users 86% zeros (sort of bad) BAD results 198∗274 Trim: Restaurant > 45, Users > 26 48 Restaurants, 111 Users only 66% zeros MUCH better results 48∗111 users users business business
A small part of the Bad Matrix
A small part of the Good Matrix
The Yelp Dataset Creating the Data Matrix Creating the Training Data Grouping and Completing Results and Problems Further Ideas
Creating the Training Sets x1 = find(BestMatrix(:,k)>0); %find the indices of ratings in column k %and find what those ratings are x1sratings = BestMatrix(x1,k); % create matrix with only places the user k rated … x1_others_trim_index = find(x1_countotherreviews >5); %5 or more same ratings %find which users are most equal if abs(x1sratings(j)-x1_others_trim(j,i))==0 %find where same ratings %count on how many reviews they agree then find most un-like users unique to each user
The Yelp Dataset Creating the Data Matrix Creating the Training Data Grouping and Completing Results and Problems Further Ideas
Grouping Semi- and Unsupervised Spectral Clustering Create Kernel Matrix Use Training points (unique to each user) each user gets individual Binary Classification % Kernel Matrix % for i = 1:n for j = 1:n v = -(norm(BestMatrix(:,i)- BestMatrix(:,j),2)^2); W(i,j) = exp(v/(2*sigma^2)); we have a choice of sigma end % Construct L L = diag(sum(W,2),0) - W;
Supervised Unsupervised for i = 1:column % Compute second eigenvector ingroup = …; the training data YES inopposite = …; the training data NO b = zeros(n,1); b(indgroup) = 1; b(inopposite) = -1; M = diag( abs(b), 0); lambda = 1000; g = (L + lambda*M) \ (lambda*M*b); Cest1 = double(g>0)+1; userfriends{i} = find(Cest1==2); end % Compute second eigenvector [V,E] = eigs(L,2,'sm'); f = V(:,1); Cest = double(f>0)+1; Group1 = find(Cest == 1); Group2 = find(Cest == 2);
LASSOpos.m One change to the last sub-function of LASSO.m: function m = PROX(z, lambda, dt) c = lambda*dt; m = (z-c).*double(z -c > 0) %+ (z+c).*double(z + c < 0); delete Only positive entries in 𝜶 Perform Lasso on every user Assemble into New Matrix The matrix is completed
Summing up the Approach Training Points Semi-Supervised Lasso Unsupervised Lasso Lasso (no grouping) Remember from the homework that unsupervised spectral clustering gave really bad results
The Yelp Dataset Creating the Data Matrix Creating the Training Data Grouping and Completing Results and Problems Further Ideas
Verifying Results Note: there is no original matrix to compare to Pick random user, delete given entries complete the user, compare to given entries Taking average rating of completed compare to Current Average Rating of given Yelp Data for Business Average Rating of used data (before completion) Average Rating of User Unlike in class, we do not have the actual matrix that we complete;
Errors for BAD Matrix Deleting given entries of a Random User Calculate difference: actual vs created entries (25 ratings) Semi-Supervised Learning Average of 1.44 points off from each given rating vs. RANDOM predictions are 1.52 points off LASSO only slightly better results Recall BAD matrix has 86% zeros i.e. 16% entries; mention that it should be given that we always use LASSO Given 25 reviews, the difference was off by 36, random is 38; mention that this means that still a bunch are correct but many are VERY far off
Errors for GOOD Matrix Deleting given entries of a Random User Calculate difference: actual vs created entries (28 ratings) Semi-Supervised Learning Average of 0.7857 points off from each given rating Control: RANDOM predictions are 1.5 points off Unsupervised Learning Average of 0.9286 points off from each given review LASSO only 0.92 points off from each given review Recall: Good matrix has 66% zeros, i.e. 34% entries
Values from example User (Semi-Supervised) 5 3 4 2 5 3 4 2 Left Column: True Value Right Column: Predicted Rating Exact Off by 1 Off by > 1
Errors for GOOD Matrix Comparing Averages of User Ratings Semi-Supervised Average of original vs predicted Ratings of all Users Before: 3.9370 After: 3.4746 Difference: 0.4642 We noted: the averages are always a little bit less; we overwrite 7’s etc
Errors for GOOD Matrix Comparing Averages of Business Ratings Semi-Supervised Average of original vs predicted Average of Business Example: Pane Bianco (Pizza Place in Phoenix) Yelp: 4.0 Before (using our users): 4.12 After: 3.64 Difference: 0.36 and 0.48 Current on yelp means including all users that we threw out
The Yelp Dataset Acquiring the Data Matrix Creating the Training Data Grouping and Completing Results and Problems Further Ideas
Further Ideas Take predictions from smaller matrix, use to make bigger matrix make less sparse, complete again Apply to different cities and compare results Use just good ratings, overwrite bad ratings with 0 Go back to certain points where subjective decision were made, use different approach i.e. how to make training set Look at quality of review (in original yelp data)
The End Presentation by Daniel Hallman & Maike Scherer