Download presentation
Presentation is loading. Please wait.
1
10_16_10 The Netflix Program: mpp-mpred.C reads PROBE, loops thru (M i, ProbeSup(M i ), pass each to mpp-user.C. mpp-mpred.C can call separate instances of mpp-user.C for many U s (in parallel (governed by # of slots.) mpp-user.C loops thru ProbeSup(M), reads config file, prints prediciton(M,U) to predictions For user votes, mpp-user.C calls user-vote.C For movie votes, mpp-user.C calls movie-vote.C user-vote.C prunes, loops thru user voters, V. calculating a V-vote. Combines V-votes and returns vote. movie-vote.C similar. mpp-mpred.C mpp-user.C user-vote.C movie-vote.C prune.C ( M i, ProbeSup(M i )={U i1, …, U ik }) Loops thru ProbeSup, from uservote, movieVOTE writes Predict(M i,U ik ) to predictions U ik ProbeSup(M i ) ( M i, Sup(M i ), U ik, Sup(U ik )) vote(M i,U ik ) ( M i, Sup(M i ), U ik, Sup(U ik )) VOTE(M i,U ik ) We must loop thru V’s (VPHD rather than HPVD) because the HP required of most correlation calculations is impossible using AND/OR/COMP. Netflix Classification use the RentsTrainingTable, Rents(MID,UID,Rating,Date) and class label Rating, to classify new (MID,UID,Date) tuples (i.e., predict ratings). Nearest Neighbor User Voting: uid votes on rating(MID,UID) if it is near enough to UID in it’s ratings of movies M={mid 1,..., mid k } (i.e., near is based on a User-User correlation over M ). User-User-Correlation? (Pearson, Cosine?) and the set M={mid 1,…, mid k }. Nearest Neighbor Movie Voting: mid votes on rating(MID,UID) if its ratings by U={uid 1,..., uid k } are near enough to those of MID (i.e., near is based on a Movie-Movie correlation over U). Movie-Movie-Correlation? (Pearson or Cosine or?) and set U={uid 1,…, uid k }. Today we will take a close look at the data mining algorithms in movie-vote.C (first the Nearest Neighbor Classification code, then ARM code, then??? Similar (dual) code either exists or will exist in user-vote.C. The file, movie-vote-full.C, contains ARM attempts, Boundary-based attempts and the Nearest Neighbor Classification attempts. The file, movie-vote-justNN.C contains only the NN attempts (so we will start with that). A long term goal: generalize the code away from the Netflix problem and toward a generic data mining system (e.g., for use by the Treeminer Corp. on, say, satellite imagery?)
2
How does one specify prunings? mpp-mpred.C mpp-user.C user-vote.C movie-vote.C prune.C In a file (named config ) there's a section for specifying the parameters for user-voting and a separate section for specifying parameters for movie-voting. E.g., for movie voting, at the bottom, there are 3 external prunings possible (0 or more can be chosen): 1. an intial pruning of dimensions to be used (since dimensions are user, it prunes supM): 2. a pruning of movie voters, N, (in supU) 3 a final pruning of dimensions (CoSupport(M,N) for the specific movie voter, N. E.g., parameters are specified for this final prune as below. Finally note that internal to user-vote and movie-vote are "internal prunings" in which voters are rejected (during their loop pass) if they fail to meet certain correlation levels). This type of internal pruning is somewhat redundant with the external prunings below. [movie_voting Prune_Users_in_CoSupMN] method = UserCommonCoSupportPrune leftside = 0 width = 8000 mstrt = 0 mstrt_mult = 0.0 ustrt = 0 ustrt_mult = 0.0 TSa = -100 TSb = -100 Tdvp = -1 Tdvs = -1 Tvdp = -1 Tvds = -1 TD = -1 TP = -1 PPm =.1 TV = -1 TSD = -1 Ch = 1 Ct = 2 specifies type of prune ( 3 types: UserPrune with a full range of possibilities; UserFastPrune with just PearsonCorrelation pruning; CommonCoSupportPrune which orders users, V, according to the size of their CommonCoSupport with U only (note that this is a correlation of sorts too.) specify leftside (from Uid) of an ID interval prune of supM specify the width of an ID interval prune of supM specify starting movie (intercept and slope) for N loop specify starting movie (intercept and slope) for V loop specify PearsonCorr threshold (a=Amal, meaning: use Amal's table lookup) specify PearsonCorr threshold (b=bill, meaning: use bill's formula - note if prior pruning this will have a different value than Amal's) threshold "diff of vectors" population-based std_dev prune threshold "diff of vectors"sample-based std_dev prune threshold "vectorof diffs" population-based std_dev prune threshold "vector of diffs"sample-based std_dev prune threshold (Gaussian of) Euclidean distance based prune threshold for (Gaussian of) 1perpendicular distance prune exponent for (Gaussian of) 1perpendicular distance prune threshold (Gaussian of) a variation based prune threshold std_dev based prune Picks odering for count-based prune below: 1=Amal_Pearson, 2=Bill_Pearson, etc. threshold for count based prune Note: all thresholds for similarities, not distance i.e., when we start with a distance we follow it with the Gaussian to make it a similarity or correlation.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.