HPVD using error = rating 2 NotEqual prediction 2 A B C D E F G H I J K L M N O P Q R S 61 1 1.006 1 1 1.006 1 1.003 1 1 1 1.006 1 1.003 1.006 1 1 1 1.003.

Slides:

Advertisements

Similar presentations

Bison Management Suppose you take over the management of a certain Bison population. The population dynamics are similar to those of the population we.

Advertisements

Linear Regression.

Part II – TIME SERIES ANALYSIS C3 Exponential Smoothing Methods © Angel A. Juan & Carles Serrat - UPC 2007/2008.

Regularization David Kauchak CS 451 – Fall 2013.

Introduction to Excel Formulas, Functions and References.

Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.

Statistical Techniques I EXST7005 Sample Size Calculation.

Mean, Proportion, CLT Bootstrap

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.

Some terminology When the relation between variables are expressed in this manner, we call the relevant equation(s) mathematical models The intercept and.

Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort

The loss function, the normal equation,

Visual Recognition Tutorial

x – independent variable (input)

Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.

MAE 552 Heuristic Optimization

Evaluating Hypotheses

Review of Matrix Algebra

Decision Trees (2). Numerical attributes Tests in nodes are of the form f i > constant.

STANDARD SCORES AND THE NORMAL DISTRIBUTION

CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.

Neural Networks Lecture 8: Two simple learning algorithms

G = (  n  SUPu 1 e(u 1,n)FM n,...,  n  SUPu lastu e(u lastu,n)FM n,...,  v  SUPm 1 e(v,m 1 )UF v,...,  v  SUPlastm 1 e(v,m lastm )UF v ) 0 = dsse(t)/dt.

Sample size vs. Error A tutorial By Bill Thomas, Colby-Sawyer College.

storing data in k-space what the Fourier transform does spatial encoding k-space examples we will review:  How K-Space Works This is covered in the What.

© The McGraw-Hill Companies, 2006 Chapter 4 Implementing methods.

a b c d e f g h i j k.

Basic linear regression and multiple regression Psych Fraley.

Numerical Methods Applications of Loops: The power of MATLAB Mathematics + Coding 1.

Copyright © 2010 Pearson Education, Inc. Chapter 6 The Standard Deviation as a Ruler and the Normal Model.

Ch4 Describing Relationships Between Variables. Pressure.

EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.

1 Psych 5510/6510 Chapter 10. Interactions and Polynomial Regression: Models with Products of Continuous Predictors Spring, 2009.

Ch4 Describing Relationships Between Variables. Section 4.1: Fitting a Line by Least Squares Often we want to fit a straight line to data. For example.

CSC321: Neural Networks Lecture 2: Learning with linear neurons Geoffrey Hinton.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Regularisation *Courtesy of Associate Professor Andrew.

Copyright © 2009 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.

30/10/ Iteration Loops Do While (condition is true) … Loop.

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

Psych 5500/6500 Other ANOVA’s Fall, Factorial Designs Factorial Designs have one dependent variable and more than one independent variable (i.e.

ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.

CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.

Physics 114: Lecture 14 Mean of Means Dale E. Gary NJIT Physics Department.

Stat 13, Tue 5/29/ Drawing the reg. line. 2. Making predictions. 3. Interpreting b and r. 4. RMS residual. 5. r Residual plots. Final exam.

11/23/2015Slide 1 Using a combination of tables and plots from SPSS plus spreadsheets from Excel, we will show the linkage between correlation and linear.

Dear Mr. Silverman: Maryland Procurement Office requests that your company submit a Firm, Fixed-Price proposal for the effort described below in accordance.

Check out the ebook on FPGAs and DP. 2points about the topic: 1. Thinking about FPGA DM together with the raging debate about the efficacy of Non-SQL,

CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Lecture 5 Instructor: Max Welling Squared Error Matrix Factorization.

Week 6. Statistics etc. GRS LX 865 Topics in Linguistics.

Yue Xu Shu Zhang.  A person has already rated some movies, which movies he/she may be interested, too?  If we have huge data of user and movies, this.

Multiplication of Common Fractions © Math As A Second Language All Rights Reserved next #6 Taking the Fear out of Math 1 3 ×1 3 Applying.

Chapter 3: Describing Relationships

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.

The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.

Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.

Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.

CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.

Take a feature vector, fv=fv u fv m of SVD, let fv(t) = t*v (this is for the line search of SVD). To do the line search, mse(t) = 1/|Ratings|  (v,m) 

Linear Algebra Review.

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

A Step-By-Step Tutorial for the Discipline Data Reporting Tool The Delaware Positive Behavior Support Project Slide 1: Welcome to.

A Step-By-Step Tutorial for the Discipline Data Reporting Tool The Delaware Positive Behavior Support Project Slide 1: Welcome to.

Collaborative Filtering Matrix Factorization Approach

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Here is the result after 1 round when using a fixed increment line search to find minimize mse with respect to the LRATE used:

Presentation transcript:

HPVD using error = rating 2 NotEqual prediction 2 A B C D E F G H I J K L M N O P Q R S T mse Actual rating - prediction error rate ended up at while starting at How close is the ending feature vector to the ending feature vector we get when we learn with error = rating - prediction? The next slide shows the same number or rounds using this actual error and the ending vector for it. The ending actual error after the same number of rounds is That ending vector is -->

VPHD A B C D E F G H I J K L M N O P Q R S T L HPVD A B C D E F G H I J K L M N O P Q R S T L Actual error: VPHD A B C D E F G H I J K L M N O P Q R S T L A B C D E F G H I J K L M N O P Q R S T lrate MSE VPHD Using HPVD and error=r 2  p 2 is not fruitful! Back to err=r-p and fixed increment line search each round (using.005 incr) Error= in 12 steps. A binary search could be even fewer steps? These could be done in parallel too! Funk's method with L=.001, 82 steps: A B C D E F G H I J K L M N O P Q R S T lrate MSE VPHD error = rating 2 <> prediction 2 with rounds 61 through 166: error = rating - prediction with rounds 61 through 166: Next slide: continuing this line search idea for more rounds

a b c d e f g h i j k l m n o p q r s t LRATE MSE Without the line search and using LRATE=.001, to arrive at nearly the same mse (and a nearly identical feature vector) it takes 81 rounds: Going from the round 1 result (LRATE=.0525) shown here, we do a second round and again do fixed increment line search: Note that we came up with an approximately minimized mse at LRATE=,030. Going from this line search resulting from LRATE=.03, we do another round round: Going from this line search resulting from LRATE=.02, we the same for the next round: At this point we might conclude that LRATE=.02 is a stable, near-optimal learning rate and just use it for the duration (no further line search). After 200 rounds at LRATE=.02. we arrive at the following (note that it took ~2000 rounds without line search and with line search ~219): Comparing this resulting feature vector to the one we got when we did ~2000 rounds at LRATE=.001 (without line search) we see that we arrive at a very different feature vector: , no ls a b c d e f g h i j k l m n o p q r s t LRATE , w ls However, an interesting observation is that the UserFeatureVector protions differ by constant multiplier and the MovieFeatureVector portions differ by a different constant. If we divide the LR=.001 vector by the LR=.020, we get the following multiplier vector (one is not a dialation of the other but if we split user portion from the movie portion, they are!!! What does that mean!?!?!?! ".001/.020" 1.80 avg 0.04 std0.54 avg 0.01 std Another interesting observation is that 1 / 1.8 =.55, that is, 1 / AVGufv = AVGmfv. They are reciporicals of oneanother!!! This makes some sense since it means, if you double the ufv you have to halve the mfv to get the same predictions. The bottom line is that the predictions are the same! What is the nature of the set of vectors that [nearly] minimize the mse? It is not a subspace (not closed under scalar multiplication) but it is clearly closed under "reciporical scalar multiplication" (multiplying the mfv's by the reciporical of the ufv's multiplier). Waht else can we say about it? So, we get an order of magnitude speedup by doing line search. If fact it may be more than that since we may be able to do all the LRATE calculations in parallel (without recalculating the error matrix or feature vectors????). Or we there may be a better search mechanism than fixed increment search. A binary type search? Othere? Here is the result after 1 round when using a fixed increment line search to find minimize mse with respect to the LRATE used:

e u,m = r u,m - f(u)f(m), f = the feature vector (dimensions include all users (a-t) followed by all movies (1-8) mse = (1/rc)  allm,u  sup(m) (r u,m - f(u)f(m)) 2 f(u)+=LA f(m)+=LB f(u) += L  n  sup(u) e(u,n)f(n) A f(m) += L  v  sup(m) e(v,m)f(v) B  MSE/  L =(2/rc)  m  M,u  sup(m) B - {-(f(u)+LA) A(f(m)+LB)}=0? { r u,m - } (f(u)+LA) (f(m)+LB) MSE=(1/rc)  allm,u  sup(m) {r u,m - } 2 (f(u)+LA)(f(m)+LB)  MSE/  L =(2/rc)  m  M,u  sup(m) {-(f(u)B +f(m)A)- 2LAB }{r u,m - } f(u)f(m) - (f(m)A+f(u)B)L - ABL 2 C  MSE/  L =(2/rc)  m  M,u  sup(m) {-C - 2ABL } = 0 (looks hopeless to find a nice formula for L) {e u,m } - CL - ABL 2 =(1/rc)  allm,u  sup(m) {r u,m } 2 -fufm-(fmA+fuB)L-ABL 2 (looks hopeless to find a nice formula where all the squared terms are 0) So the line search will have to be a simple (maybe binary) search) for the heuristically optimal L Here we pause just a minute to see if there is a "calculus based" closed form formula which will immediately give us the LRATE producing the minimum mse. If so, that will clearly save time! Below, as you can see, I failed to be able to produce a closed form formula. There may be ways to cheat and get close enough? Go for it! On the next slide I show some success with a fixed increment line search (increment=.005) but always starting at.001 (The LRATE that Funk uses throughout).

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z AAAB AC AD 1 \a=Z /rvnfv~fv~{goto}L~{edit}+.005~/XImse<omse ~/xg\a~ ~{goto}se~/rvfv~{end}{down}{down}~ /xg\a~ LRATE omse fv A22: +A2-A$10*$U2 /* error for u=a, m=1 */ A30: +A10+$L*(A$22*$U$2+A$24*$U$4+A$26*$U$6+A$29*$U$9) /* updates f(u=a) */ U29: +U9+$L*(($A29*$A$30+$K29*$K$30+$N29*$N$30+$P29*$P$30)/4) /* updates f(m=8 */ AB30: +U29 /* copies f(m=8) feature update in the new feature vector, nfv */ /* counts the number of actual ratings (users) for m=1 */ X22: /*adds ratings counts for all 8 movies = training count*/ AD30: /* averages se's giving the mse */ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z AAAB AC AD 21 working error and new feature vector (nfv) **0 ** ** 0 ** ** ** ** ** **** ** ** 1 0 ** ** ** L mse nfv A52: +A22^2 /*squares all the individual erros */ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z AAAB AC AD square errors SE /rvnfv~fv copies fv to nfv after converting fv to values. {goto}L~{edit}+.005~ increments L by.005 /XImse<omse ~/xg\a~ IF mse still decreasing, recalc mse with new L.001~ Reset L=.001 for next round /xg\a~ Start over with next round {goto}se~/rvfv~{end}{down}{down}~ "value copy" fv to output list Notes: In 2 rounds mse is as low as Funk gets it in 2000 rounds. After 5 rounds mse is lower than ever before (and appears to be bottoming out). I know I shouldn't hardcode parameters! Experiments should be done to optimize this line search (e.g., with some binary search for a low mse). Since we have the resulting individual square_errors for each training pair, we could run this, then for mask the pairs with se(u,m) > Threshold. Then do it again after masking out those that have already achieved a low se. But what do I do with the two resulting feature vectors? Do I treat it like a two feature SVD or do I use some linear combo of the resulting predictions of the two (or it could be more than two)? We need to test out which works best (or other modifications) on Netflix data. Maybe on those test pairs for which the training row and column have some high errors, we apply the second feature vector instead of the first? Maybe we invoke CkNN for test pairs in this case (or use all 3 and a linear combo?) This is powerful! We need to optimize the calculations using pTrees!!!

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z AAAB AC AD 1 \a=Z /rvnfv~fv~{goto}L~{edit}+.005~/XImse<omse ~/xg\a~ ~{goto}se~/rvfv~{end}{down}{down}~ /xg\a~ LRATE omse fv A22: +A2-A$10*$U2 /* error for u=a, m=1 */ A30: +A10+$L*(A$22*$U$2+A$24*$U$4+A$26*$U$6+A$29*$U$9) /* updates f(u=a) */ U29: +U9+$L*(($A29*$A$30+$K29*$K$30+$N29*$N$30+$P29*$P$30)/4) /* updates f(m=8 */ AB30: +U29 /* copies f(m=8) feature update in the new feature vector, nfv */ /* counts the number of actual ratings (users) for m=1 */ X22: /*adds ratings counts for all 8 movies = training count*/ AD30: /* averages se's giving the mse */ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z AAAB AC AD 21 working error and new feature vector (nfv) **0 ** ** 0 ** ** ** ** ** **** ** ** 1 0 ** ** ** L mse nfv A52: +A22^2 /*squares all the individual erros */ A B C D E F G H I J K L M N SE O P Q R S T Here I show the square errors to 2 decimal places. If we mask off all > 1.5 times our mse= (or ), then do it again on the next slide

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z AAAB AC AD 1 2 ** 3 1 ** 4 ** 5 ** 6 2 ** 7 ** 8 3 ** 9 1 ** omse fv A22: +A2-A$10*$U2 /* error for u=a, m=1 */ A30: +A10+$L*(A$22*$U$2+A$24*$U$4+A$26*$U$6+A$29*$U$9) /* updates f(u=a) */ U29: +U9+$L*(($A29*$A$30+$K29*$K$30+$N29*$N$30+$P29*$P$30)/4) /* updates f(m=8 */ AB30: +U29 /* copies f(m=8) feature update in the new feature vector, nfv */ /* counts the number of actual ratings (users) for m=1 */ X22: /*adds ratings counts for all 8 movies = training count*/ AD30: /* averages se's giving the mse */ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z AAAB AC AD 21 working error and new feature vector (nfv) **0 ** ** 0 ** ** ** ** ** **** ** ** 1 0 ** ** ** L mse nfv A52: +A22^2 /*squares all the individual erros */ A B C D E F G H ************************************ ************************************ ************************************ ************************************ ************************************ ************************************ ************************************ ************************************ ************************************ ************************************ ************************************ ************************************0.706 Notice, for these 4 ratings, the mse ws driven down from about 1 to.039 That's got to be good!!! AC AD A better method? Take each training rating as a separate mask. Combine the resulting predictions in a weighted average?

Take each training rating as a separate mask combine the resulting [3] predictions in a weighted average? rating=5 A B C D E F G H I J K L M N O P Q R **0.95 ** ****1.00 **** *******0.97 **1.21 *********1.06 ****2.92 ******************** *******0.97 **1.13 *********1.05 ****3.98 ******************** S T LRATE mse rating=4 A B C D E F G H I J K L M N O P Q R ******0.75 ** ***** ****0.92 ****************0.79 **2.13 ******************** ****0.97 ****************0.81 **2.57 ******************** ****0.99 ****************0.83 **2.95 ******************** ****1.00 ****************0.84 **3.25 ******************** ****1.00 ****************0.84 **3.52 ******************** ****1.01 ****************0.85 **3.76 ******************** ****1.01 ****************0.85 **3.98 ******************** ****1.01 ****************0.85 **4.20 ******************** ****1.01 ****************0.85 **4.35 ******************** ****1.01 ****************0.85 **4.51 ********************0.78 S T U V W X Y Z AA AB AC AD 0.00 ***** ********** ********** ********** ********** ********** ********** ********** ********** ********** ********** rating=2 A B C D E F G H I J K L M N O P Q R **** ** ****0.94 **0.00 **1.02 ***** ****0.92 *****1.02 ** ****0.87 *********1.03 ***** ****0.91 *****1.02 ** ****0.82 *********1.05 ***** ****0.91 *****1.02 ** ****0.79 *********1.06 ***** ****0.91 *****1.02 ** ****0.77 *********1.07 ***** ****0.92 *****1.02 ** ****0.76 *********1.08 ***** ****0.93 *****1.02 ** ****0.75 *********1.09 ***** ****0.94 *****1.02 ** ****0.74 *********1.10 ***** ****0.94 *****1.02 ** ****0.73 *********1.11 *****1.56 S T U V W X Y Z AA AB AC AD 0.00 ***** ********** ********** ********** ********** ********** ********** ***** ***** rating=2 A B C D E F G H I J K L M N O P Q R **0.37 **-0.0 ****-0.0 **0.32 **************0.37 *****0.68 ***** **0.40 ******************0.31 **************0.40 *****0.68 ***** **0.44 ******************0.29 **************0.44 *****0.68 ***** **0.48 ******************0.28 **************0.48 *****0.68 ***** **0.51 ******************0.27 **************0.51 *****0.68 ***** **0.55 ******************0.26 **************0.55 *****0.68 ***** **0.57 ******************0.25 **************0.57 *****0.68 ***** **0.59 ******************0.25 **************0.59 *****0.68 ***** **0.61 ******************0.24 **************0.61 *****0.68 ***** **0.63 ******************0.23 **************0.63 *****0.68 ***** **0.64 ******************0.23 **************0.64 *****0.68 ***** **0.65 ******************0.22 **************0.65 *****0.68 ***** S T U V W X Y Z AA AB AC AD ** ** ** ** ** ** ** ** ** ** ** ** rating=1 A B C D E F G H I J K L M N O P Q R * * * **0.17 **0.00 ** *0.17 ****************0.17 *****0.17 *************0.17 ********* *0.17 ****************0.17 *****0.17 *************0.17 ********* *0.17 ****************0.18 *****0.17 *************0.17 ********* *0.17 ****************0.18 *****0.18 *************0.17 ********* *0.17 ****************0.18 *****0.18 *************0.17 ********* *0.17 ****************0.18 *****0.18 *************0.17 ********* *0.17 ****************0.18 *****0.18 *************0.17 ********* *0.17 ****************0.18 *****0.18 *************0.17 ********* *0.17 ****************0.18 *****0.18 *************0.17 ********* *0.17 ****************0.18 *****0.18 *************0.17 ********* *0.17 ****************0.18 *****0.18 *************0.17 ********* *0.17 ****************0.18 *****0.18 *************0.17 ********* *0.17 ****************0.18 *****0.18 *************0.17 ********* *0.17 ****************0.18 *****0.18 *************0.17 ********* *0.17 ****************0.18 *****0.18 *************0.17 ********* S T U V W X Y Z AA AB AC AD **** **** **** **** **** **** **** **** **** **** **** **** **** **** **** ****

SVD: follow gradients to minimize mse over the TrainSet. Taxonomy of classifications: starting with the number of entities in the Training Set: 1 Entity TrainingSet (e.g., IRISs, Concrete, Wines, Seeds,...) use FAUST or CkNN or ? 2 Entity TrainingSet (e.g., NetflixCinematch(users,movies), MBR(users,items), TextMining(docs,terms) 3 Entity TrainingSet (e.g., Document Recomenders(users,docs,terms) Recommender Taxonomy Two Entity Recommenders (2Es) (e.g., Users and Items) User-ID-Yes (UY) 2 entity (users, items) s.t. users give identity at Check-Out (e.g., Sam's,Amazon?...) Quantity-Yes (QY) matrix cells contain the quantity of that item by that user Quantity-NO (QN) matrix cells do not contain the quantity of item User-ID-NO (UN) 2 entity (users, items) s.t. users do not give identity at Check-Out (e.g., Sunmart...) Ratings-Yes (RY) matrix cells always contain a qualitative rating Ratings-No (RN) matrix cells never contain a qualitative rating Ratings-Maybe (RM) matrix cells may contain a rating of that item by that user Three Entity Recommenders (3Es) (e.g., Document Recommenders : Users, Documents, Terms) Netflix (2E, 5 star, 85% blanks) is a UY,QN,RM pTreeSVD works best for RY. RN? (0 blanks.] (in RN, buy means rating=1. don't buy raing=0) Matrixes: Document-Term (tfIDFs?); User-Document (2E, Doc=Item); User-Term (user's liking of term). Cyclic 3 hop rolodex (DT DU UT) DT UT U T D DU Let uc=user_count, ic=item_count, fc=feature_count, bp=blanks_%, uc=500K, ic=17K, fc=40, bp=85%. HorizSVD converts 100M non-blanks to 2 matrixes with 500Kx40=20M and 17Kx40=700K (tot=21M). In RM, ignore blanks! For a Document Recommender, there are no blanks! Offsets ARE the way of pTrees. No alternative! Level-1 predicate(>50% 1s) pTrees may work. pTree mask non-blanks or eliminate blanks and re-pTree-ize? SVD loop: 1. Calc SqErrs for each nonblank trainer, e m,u = r m,u - p m,u = r m,u - UF u o MF m 2. update feature vectors, UF u +=Lrate*  n  upport(u) e n,u (nonblanks only) a b c d e f g h i j k l m n o p q r s t m u r r2 p2 e pi MFi UFi po MFo UFo 1 a d g j k p c e h o r t a i l n s b f re-create pTrees (eliminate blanks): User-ID-Maybe (UM) 2 entity (users, items) s.t. users may give identity (e.g., card carriers...) update feature vectors, MF m +=Lrate*  v  upport(m) e m,v (nonblanks only).

HPVD: Using pTree calculations of errors (~900 iterations) a b c d e f g h i j k l m n o p q r s t square error a b c d e f g h i j k l m n o p q r s t L=.01 B= comparison of pTree calculating (hpvd) versus vphd. VPHD: Using horizontal data calculations of errors (~900 iterations) a b c d e f g h i j k l m n o p q r s t square error There should be no difference. The difference is because of the way I set things up in the program (addition steps in HPVD with more approximations.)

Simon Funk: Netflix provided a database of 100M ratings (1 to 5) of 17K movies by 500K users. as a triplet of numbers: (User,Movie,Rating). The challenge: For (User,Movie,?) not in the database, predict how the given User would rate the given Movie. Think of the data as a big sparsely filled matrix, with userIDs across the top and movieIDs down the side (or vice versa then transpose everything), and each cell contains an observed rating (1-5) for that movie (row) by that user (column), or is blank meaning you don't know. This matrix would have 8.5B entries, but you are only given values for 1/85 th of those 8.5B cells (or 100M of them). The rest are all blank. Netflix posed a "quiz" of a bunch of question marks plopped into previously blank slots, and your job is to fill in best-guess ratings in their place. Squared error (se) measures accuracy (You guess=1.5, actual=2, you get docked (2-1.5) 2 =.25. They use root mean squared error (rmse) but if we minimize mse, we minimize rmse. There is a date for ratings and question marks (so a cell can potentially have >=1 rating in it. Any movie can be described in terms of some features (or aspects) such as quality, action, comedy, stars (e.g., Pitt), producer, etc. A user's preferences can be described in terms how they rate the same features (quality/action/comedy/star/producer/etc.). Then ratings ought to be explainable by a lot less than 8.5 billion numbers (e.g., a single number specifying how much action a particular movie has may help explain why a few million action-buffs like that movie.). SVD: Assume 40 features. A movie, m, is described by mF[40] = how much that movie exemplifies each aspect. A user, u, is described by uF[40] = how much he likes each aspect. P u,m =uF o mF err u,m =P u,m - r u,m ua+= lrate (  u,i * ia T - * ua ) where  u,i = p u,i - r u,i and r u,i = actual rating SVD is a trick which finds U T, M which minimize mse(k) (one k at a time). So, the rank=40 SVD of the 8.5B Training matrix, is the best (least error) approx we can get within limits of our user-movie-rating model. I.e., the SVD has found the "best" feature generalizations. To get the SVD matrixes we take the gradient of mse(k) and follow it.This has a bonus - we can ignore the unknown error on the 8.4B empty slots. Take gradient of mse(k) (just the given values, not empties), one k at a time. userValue[user] += lrate*err*movieValue[movie]; movieValue[movie] += lrate*err*userValue[user]; More correctly: uv = userValue[user] += err * movieValue[movie]; movieValue[movie] += err * uv; finds the most prominent feature remaining (most reduces error). When it's good, shift it onto done features, start a new one (cache residuals of the 100M. "What does that mean for us???). This Gradient descent has no local minima, which means it doesn't really matter how it's initialized. With Horizontal data, the code is evaluated for each rating. So, to train for one sample: real *userValue= userFeature[featureBeingTrained]; real *movieValue= movieFeature[featureBeingTrained]; real lrate = 0.001; U T a 1 a 40 u 1 u 500K u uF M m 1 m m 17K a 1 mF a 40 o P m 1 m m 17K u 1. u 500K u P u,m = =  k=1..40 uFk*mFk - r u,m  m=1..17K; u=1..500K ( ) 2 /8.5B  k=1..40 uFk*mFk - r u,m mse =  mse/  uFh = (2/8.5B)  m=1..17K; u=1..500K (err u,m )[  ( )/  uFh]  k=1..40 uFk*mFk - r u,m = (2/8.5B)  m=1..17K; u=1..500K (err u,m )[ mFh ]  mse/  mFh = (2/8.5B)  m=1..17K; u=1..500K (err u,m )[ uFk ] So, we increment each uFh+ = 2mse * mFh and we increment each mFh+ = 2mse * uFh+ This is a big move and may overshoot the minimum, so the 2 is replaced by a smaller learning rate, lrate (e.g., Funk takes lrate=0.001)

Moving on: 20M free params is a lot for a 100M TrainSet. Seems neat to just ignore all blanks, but we have expectations about them. As-is, this modified SVD algorithm tends to make a mess of sparsely observed movies or users. If you have a user who has only rated 1 movie, say American Beauty=2 while the avg is 4.5, and further that their offset is only -1, we'd, prior to SVD, expect them to rate it 3.5. So the error given to the SVD is -1.5 (the true rating is 1.5 less than we expect). m(Action) is training up to measure the amount of Action, say,.01 for American Beauty (ust slightly more than avg). SVD optimize predictions, which it can do by eventually setting our user's preference for Action to a huge I.e., the alg naively looks at the only example it has of this user's preferences and in the context of only the one feature it knows about so far (Action), determines that our user so hates action movies that even the tiniest bit of action in American Beauty makes it suck a lot more than it otherwise might. This is not a problem for users we have lots of observations for because those random apparent correlations average out and the true trends dominate. We need to account for priors. As with the average movie ratings, blend our sparse observations in with some sort of prior, but it's a little less clear how to do that with this incremental algorithm. But if you look at where the incremental algorithm theoretically converges, you get: userValue[user] = [sum residual[user,movie]*movieValue[movie]] / [sum (movieValue[movie]^2)] The numerator there will fall in a roughly zero-mean Gaussian distribution when charted over all users, which through various gyrations: userValue[user] = [sum residual[user,movie]*movieValue[movie]] / [sum (movieValue[movie]^2 + K)] And finally back to: userValue[user] += lrate * (err * movieValue[movie] - K * userValue[user]); movieValue[movie] += lrate * (err * userValue[user] - K * movieValue[movie]); This is equivalent to penalizing the magnitude of the features. To cut over fitting, allowing use of more features. If m only appears once with r(m,u)=1 say, AvgRating(m)=1? Probably not! View r(m,u)=1 as a draw from a true prob dist who's avg you want... View true avg as a draw from a prob dist of avgs-histogram of avg movie ratings. Assume dists Gaussian, then best-guess mean = lin combo of observed mean and apriori mean, with a blending ratio equal to the ratio of variances. If Ra and Va are the mean and variance (squared standard deviation) of all of the movies' average ratings (which defines your prior expectation for a new movie's average rating before you've observed any actual ratings) and Vb is the average variance of individual movie ratings (which tells you how indicative each new observation is of the true mean--e.g,. if the average variance is low, then ratings tend to be near the movie's true mean, whereas if the avg variance is high, ratings tend to be more random and less indicative) then: BogusMean = sum(ObservedRatings)/count(ObservedRatings) K = Vb/Va BetterMean = [GlobalAverage*K + sum(ObservedRatings)] / [K + count(ObservedRatings)] The point here is simply that any time you're averaging a small number of examples, the true average is most likely nearer the apriori average than the sparsely observed average. Note if the number of observed ratings for a particular movie is zero, the BetterMean (best guess) above defaults to the global average movie rating as one would expect. Refinements: Prior to starting SVD, Note: AvgRating(movie), AvgOffset(UserRating, MovieAvgRating), for every user. I.e.: static inline real predictRating_Baseline(int movie, int user) {return averageRating[movie] + averageOffset[user];} That's the return of predictRating before 1st SVD feature starts training. You'd think avg mvoie rating = its average rating! 1. clip the prediction to 1-5 after each comp is added. 2. Introduce some functional non-linearity such as a sigmoid. I.e., G(x) = sigmoid(x). Moving on: Despite the regularization term in the final incremental law above, over fitting remains a problem. Plotting the progress over time, the probe rmse eventually turns upward and starts getting worse (even though the training error is still inching down). We found that simply choosing a fixed number of training epochs appropriate to the learning rate and regularization constant gives best overall perf. Here is the probe and training rmse for the first few features with and w/o regularization term "decay" enabled. Same thing, just the probe set rmse, further along where you can see the regularized version pulling ahead: Moving on: Add non-linear outputs s.t. instead of predicting with: sum (userFeature[f][user] * movieFeature[f][movie]) for f from 1 to 40. We can use: sum G(userFeature[f][user] * movieFeature[f][movie]) for f from 1 to 40.