Announcements….
What’s left in this class? 4/17 (today): trees, matrix factorization, … I’m lecturing (also: last assignment, due in 2 weeks, is up) 4/22 (Monday): scalable tensors Guest lecture, Evangelos Papalexakis (Christos Faloutsos student) 4/24 (Wed), 4/29 (Mon), 4/31 (Wed): project reports in random order each project: 9 min + 2 min for questions submit slides by noon before your presentation we understand about “future/ongoing work” at this point it’s fine if not everyone in the group speaks but make sure your partner’s talk is good 5/3 (Fri): Project report due I am extending this to 9am Tuesday May 7.
Gradient Boosting and Decision Trees
(non-stochastic) Gradient Descent Suppose you use m iterations of gradient descent to learn parameters θm: then first gradient step m-th gradient step
Functional Gradient Descent how we want the function to change instead lets define a sum of functions
Functional Gradient Descent how we want the function to change ≅ηm we can find the desired change at each example: log P(yi|xi; Ψm-1) yi - P(Y|xi; Ψm-1)
Functional Gradient Descent instead lets define a sum of functions functional gradient: how we want the function to change Put this together: we want to find a function Δm and we know what value we’d like it to have on a bunch of examples… so….? learn the next gradient-step function Δm we could also define the desired change at each example: log P(yi|xi; Ψm-1) yi - P(Y|xi; Ψm-1)
Functional Gradient Descent instead lets define a sum of functions functional gradient: how we want the function to change learn the next gradient-step function Δm using a regression tree trained against the target value: yi - P(Y|xi; Ψm-1) …. plus a line search to find η I.e.: examples are (xi,yi) where yi=yi - P(Y|xi; Ψm-1) ~ ~ we could also define the desired change at each example: log P(yi|xi; Ψm-1) yi - P(Y|xi; Ψm-1)
Functional Gradient Descent instead lets define a sum of functions functional gradient: how we want the function to change learn the next gradient-step function Δm using a regression tree trained against the target value: yi - P(Y|xi; Ψm-1) I.e.: we’re fitting regression trees to residuals we could also define the desired change at each example: log P(yi|xi; Ψm-1) yi - P(Y|xi; Ψm-1)
Gradient Boosting Algorithm Note: not the same as Shapire/Schapire & Freund’s boosting algorithm AdaBoost End result is a sum of many regression trees Advantages: all the advantages of regression trees (combinations of features, indifference to scale of numeric values, …) Flexibility about loss function Disadvantages: sequential nature of the boosting algorithm
Functional Gradient Descent more generally, this can be the loss of previous classifier
Gradient boosting with arbitrary loss
Gradient boosting with square loss
Gradient boosting with log loss Line search heuristically-sized step for each region of the learned tree
Gradient boosting with log loss
~ Yi = yi - P(Y|xi; Fm-1)
~ Yi = yi - P(Y|xi; Fm-1) computed with a line search (e.g.)
Pr(Z|x)=F(x) Pr(Y|z,w)=G(w)
z fixed Pr(Z|x)=F(x) Pr(Y|z,w)=G(w)
Bagging regression trees using a learning-to-rank loss function….. SIGIR 2011 Bagging regression trees using a learning-to-rank loss function…..