Download presentation
Presentation is loading. Please wait.
Published byJewel Watts Modified over 8 years ago
1
1 A fast algorithm for learning large scale preference relations Vikas C. Raykar and Ramani Duraiswami University of Maryland College Park Balaji Krishnapuram Siemens medical solutions USA AISTATS 2007
2
2 Learning Many learning tasks can be viewed as function estimation.
3
3 Learning from examples Learning algorithm Training Not all supervised learning procedures fit in the standard classification/regression framework. In this talk we are mainly concerned with ranking/ordering.
4
4 Ranking / Ordering For some applications ordering is more important Example 1: Information retrieval Sort in the order of relevance
5
5 Ranking / Ordering For some applications ordering is more important Example 2: Recommender systems Sort in the order of preference
6
6 Ranking / Ordering For some applications ordering is more important Example 3: Medical decision making Decide over different treatment options
7
7 Ranking formulation Algorithm Fast algorithm Results Plan of the talk
8
8 Preference relations Given a we can order/rank a set of instances. Goal - Learn a preference relation Training data – Set of pairwise preferences
9
9 Ranking function Why not use classifier/ordinal regressor as the ranking function? Goal - Learn a preference relation New Goal - Learn a ranking function Provides a numerical score Not unique
10
10 Why is ranking different? Learning algorithm Training Pairwise preference Relations Pairwise disagreements
11
11 Training data..more formally From these two we can get a set of pairwise preference realtions
12
12 Loss function.. Generalized Wilcoxon-Mann-Whitney (WMW) statistic Minimize fraction of pairwise disagreements Maximize fraction of pairwise agreements Total # of pairwise agreements Total # of pairwise preference relations
13
13 Consider a two class problem + + + + + + + - - - - - -
14
14 Function class..Linear ranking function Different algorithms use different function class RankNet – neural network RankSVM – RKHS RankBoost – boosted decision stumps
15
15 Ranking formulation –Training data – Pairwise preference relations –Ideal Loss function – WMW statistic –Function class – linear ranking functions Algorithm Fast algorithm Results Plan of the talk
16
16 The Likelihood Discrete optimization problem Log-likelihood Assumption : Every pair is drawn independently Sigmoid [Burges et.al.] Choose w to maximize
17
17 The MAP estimator
18
18 Another interpretation O-1 indicator function Log-sigmoid What we want to maximize What we actually maximize Log-sigmoid is a lower bound for the indicator function
19
19 Lower bounding the WMW Log-likelihood <= WMW
20
20 Gradient based learning Use nonlinear conjugate-gradient algorithm. Requires only gradient evaluations. No function evaluations. No second derivatives. Gradient is given by
21
21 RankNet Learning algorithm Training Pairwise preference relations Cross entropy neural net Backpropagation
22
22 RankSVM Learning algorithm Training Pairwise preference relations Pairwise disagreements RKHS SVM
23
23 RankBoost Learning algorithm Training Pairwise preference relations Pairwise disagreements Decision stumps Boosting
24
24 Ranking formulation –Training data – Pairwise preference relations –Loss function – WMW statistic –Function class – linear ranking functions Algorithm –Maximize a lower bound on WMW –Use conjugate-gradient –Quadratic complexity Fast algorithm Results Plan of the talk
25
25 Key idea Use approximate gradient. Extremely fast in linear time. Converges to the same solution. Requires a few more iterations.
26
26 Core computational primitive Weighted summation of erfc functions
27
27 Notion of approximation
28
28 Example
29
29 1. Beauliu’s series expansion Retain only the first few terms contributing to the desired accuracy. Derive bounds for this to choose the number of terms
30
30 2. Error bounds
31
31 3. Use truncated series
32
32 3. Regrouping Does not depend on y. Can be computed in O(pN) Once A and B are precomputed Can be computed in O(pM) Reduced from O(MN) to O(p(M+N))
33
33 3. Other tricks Rapid saturation of the erfc function. Space subdivision Choosing the parameters to achieve the error bound See the technical report
34
34 Numerical experiments
35
35 Precision vs Speedup
36
36 Ranking formulation –Training data – Pairwise preference relations –Loss function – WMW statistic –Function class – linear ranking functions Algorithm –Maximize a lower bound on WMW –Use conjugate-gradient –Quadratic complexity Fast algorithm –Use fast approximate gradient –Fast summation of erfc functions Results Plan of the talk
37
37 Datasets 12 public benchmark datasets Five-fold cross-validation experiments CG tolerance 1e-3 Accuracy for the gradient computation 1e-6
38
38 Direct vs Fast -WMW statistic Dataset DirectFast 10.5360.534 20.917 30.623 4*0.979 WMW is similar for both the exact and the fast approximate version.
39
39 Dataset DirectFast 11736 secs.2 secs. 26731 secs.19 secs. 32557 secs.4 secs. 4*47 secs. Direct vs Fast – Time taken
40
40 Effect of gradient approximation
41
41 Comparison with other methods RankNet - Neural network RankSVM - SVM RankBoost - Boosting
42
42 Comparison with other methods WMW is almost similar for all the methods. Proposed method faster than all the other methods. Next best time is shown by RankBoost. Only proposed method can handle large datasets.
43
43 Sample result Dataset 8 N=950 d=10 S=5 Time taken (secs) WMW RankNCG direct3330.984 RankNCG fast30.984 RankNet linear12640.951 RankNet two layer24640.765 RankSVM linear340.984 RankSVM quadratic13320.996 RankBoost60.958
44
44 Sample result Dataset 11 N=4177 d=9 S=3 Time taken (secs) WMW RankNCG direct17360.536 RankNCG fast20.534 RankNet linear RankNet two layer RankSVM linear RankSVM quadratic RankBoost630.535
45
45 Application to collaborative filtering Predict movie ratings for a user based on the ratings provided by other users. MovieLens dataset (www.grouplens.org) 1 million ratings (1-5) 3592 movies 6040 users Feature vector for each movie – rating provided by d other users
46
46 Collaborative filtering results
47
47 Collaborative filtering results
48
48 Ranking formulation –Training data – Pairwise preference relations –Loss function – WMW statistic –Function class – linear ranking functions Algorithm –Maximize a lower bound on WMW –Use conjugate-gradient –Quadratic complexity Fast algorithm –Use fast approximate gradient –Fast summation of erfc functions Results –Similar accuracy as other methods –But much much faster Plan/Conclusion of the talk
49
49 Ranking formulation –Training data – Pairwise preference relations –Loss function – WMW statistic –Function class – linear ranking functions Algorithm –Maximize a lower bound on WMW –Use conjugate-gradient –Quadratic complexity Fast algorithm –Use fast approximate gradient –Fast summation of erfc functions Results –Similar accuracy as other methods –But much much faster Future work Other applications neural network Probit regression Code coming soon
50
50 Ranking formulation –Training data – Pairwise preference relations –Loss function – WMW statistic –Function class – linear ranking functions Algorithm –Maximize a lower bound on WMW –Use conjugate-gradient –Quadratic complexity Fast algorithm –Use fast approximate gradient –Fast summation of erfc functions Results –Similar accuracy as other methods –But much much faster Future work Other applications neural network Probit regression Nonlinear Kernelized Variation.
51
51 Thank You ! | Questions ?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.