Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web-Mining Agents: Classification with Ensemble Methods R. Möller Institute of Information Systems University of Luebeck.

Similar presentations


Presentation on theme: "Web-Mining Agents: Classification with Ensemble Methods R. Möller Institute of Information Systems University of Luebeck."— Presentation transcript:

1 Web-Mining Agents: Classification with Ensemble Methods R. Möller Institute of Information Systems University of Luebeck

2 An Introduction to Ensemble Methods Bagging, Boosting, Random Forests, and More by Yisong Yue Presentation is based on parts of:

3 Supervised Learning Goal: learn predictor h(x) – High accuracy (low error) – Using training data {(x 1,y 1 ),…,(x n,y n )}

4 PersonAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena800 Male? Age>9? Age>10? 1 1 0 0 1 1 0 0 Yes No

5 Different Classifiers Performance – None of the classifiers is perfect – Complementary Examples which are not correctly classified by one classifier may be correctly classified by the other classifiers Potential Improvements? – Utilize the complementary property CS 4700, Foundations of Artificial Intelligence, Carla P. Gomes

6 Ensembles of Classifiers Idea – Combine the classifiers to improve the performance Ensembles of Classifiers – Combine the classification results from different classifiers to produce the final output Unweighted voting Weighted voting CS 4700, Foundations of Artificial Intelligence, Carla P. Gomes

7 Example: Weather Forecast Reality 1 2 3 4 5 Combine X X X X XX X XX X X X X CS 4700, Foundations of Artificial Intelligence, Carla P. Gomes

8 Outline Bias/Variance Tradeoff Ensemble methods that minimize variance – Bagging – Random Forests Ensemble methods that minimize bias – Functional Gradient Descent – Boosting – Ensemble Selection

9 Generalization Error “True” distribution: P(x,y) – Unknown to us Train: h(x) = y – Using training data S = {(x 1,y 1 ),…,(x n,y n )} – Sampled from P(x,y) Generalization Error: – L (h) = E (x,y)~P(x,y) [ f(h(x),y) ] – E.g., f(a,b) = (a-b) 2

10 PersonAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena800 PersonAgeMale?Height > 55” James1111 Jessica1401 Alice1401 Amy1201 Bob1011 Xavier910 Cathy901 Carol1301 Eugene1310 Rafael1211 Dave810 Peter910 Henry1310 Erin1100 Rose700 Iain811 Paulo1210 Margare t 1001 Frank911 Jill1300 Leon1010 Sarah1200 Gena800 Patrick511 … L (h) = E (x,y)~P(x,y) [ f(h(x),y) ] Generalization Error: h(x) y

11 Bias/Variance Tradeoff Treat h(x|S) has a random function – Depends on training data S L = E S [ E (x,y)~P(x,y) [ f(h(x|S),y) ] ] – Expected generalization error – Over the randomness of S

12 Bias/Variance Tradeoff Squared loss: f(a,b) = (a-b) 2 Consider one data point (x,y) Notation: – Z = h(x|S) – y – ž = E S [Z] – Z-ž = h(x|S) – E S [h(x|S)] E S [(Z-ž) 2 ] = E S [Z 2 – 2Zž + ž 2 ] = E S [Z 2 ] – 2E S [Z]ž + ž 2 = E S [Z 2 ] – ž 2 E S [f(h(x|S),y)] = E S [Z 2 ] = E S [(Z-ž) 2 ] + ž 2 BiasVariance Expected Error Bias/Variance for all (x,y) is expectation over P(x,y). Can also incorporate measurement noise. (Similar flavor of analysis for other loss functions.)

13 Example x y

14 h(x|S)

15

16

17 E S [(h(x|S) - y) 2 ] = E S [(Z-ž) 2 ] + ž 2 Variance Z = h(x|S) – y ž = E S [Z] Bias Variance Bias Variance Bias Variance Bias Expected Error

18 Outline Bias/Variance Tradeoff Ensemble methods that minimize variance – Bagging – Random Forests Ensemble methods that minimize bias – Functional Gradient Descent – Boosting – Ensemble Selection

19 Bagging Goal: reduce variance Ideal setting: many training sets S’ – Train model using each S’ – Average predictions E S [(h(x|S) - y) 2 ] = E S [(Z-ž) 2 ] + ž 2 VarianceBias Expected Error Z = h(x|S) – y ž = E S [Z] http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf “Bagging Predictors” [Leo Breiman, 1994] Variance reduces linearly Bias unchanged sampled independently S’S’ P(x,y)

20 Bagging Goal: reduce variance In practice: resample S’ with replacement – Train model using each S’ – Average predictions E S [(h(x|S) - y) 2 ] = E S [(Z-ž) 2 ] + ž 2 VarianceBias Expected Error Z = h(x|S) – y ž = E S [Z] from S http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf “Bagging Predictors” [Leo Breiman, 1994] Variance reduces sub-linearly (Because S’ are correlated) Bias often increases slightly S’S’S Bagging = Bootstrap Aggregation

21 “An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants” Eric Bauer & Ron Kohavi, Machine Learning 36, 105–139 (1999) Variance Bias DTBagged DT Better

22 Random Forests Goal: reduce variance – Bagging can only do so much – Resampling training data asymptotes Random Forests: sample data & features! – Sample S’ – Train DT At each node, sample features (sqrt) – Average predictions “Random Forests – Random Features” [Leo Breiman, 1997] http://oz.berkeley.edu/~breiman/random-forests.pdf Further de-correlates trees

23 Better Average performance over many datasets Random Forests perform the best “An Empirical Evaluation of Supervised Learning in High Dimensions” Caruana, Karampatziakis & Yessenalina, ICML 2008

24 Structured Random Forests DTs normally train on unary labels y=0/1 What about structured labels? – Must define information gain of structured labels Edge detection: – E.g., structured label is a 16x16 image patch – Map structured labels to another space where entropy is well defined “Structured Random Forests for Fast Edge Detection” Dollár & Zitnick, ICCV 2013

25 Outline Bias/Variance Tradeoff Ensemble methods that minimize variance – Bagging – Random Forests Ensemble methods that minimize bias – Functional Gradient Descent – Boosting – Ensemble Selection

26 Functional Gradient Descent http://statweb.stanford.edu/~jhf/ftp/trebst.pdf h(x) = h 1 (x) S’ = {(x,y)} h 1 (x) S’ = {(x,y-h 1 (x))} h 2 (x) S’ = {(x,y-h 1 (x) - … - h n-1 (x))} h n (x) … + h 2 (x)+ … + h n (x)

27 Coordinate Gradient Descent Learn w so that h(x) = w T x Coordinate descent – Init w = 0 – Choose dimension with highest gain Set component of w – Repeat

28 Coordinate Gradient Descent Learn w so that h(x) = w T x Coordinate descent – Init w = 0 – Choose dimension with highest gain Set component of w – Repeat

29 Coordinate Gradient Descent Learn w so that h(x) = w T x Coordinate descent – Init w = 0 – Choose dimension with highest gain Set component of w – Repeat

30 Coordinate Gradient Descent Learn w so that h(x) = w T x Coordinate descent – Init w = 0 – Choose dimension with highest gain Set component of w – Repeat

31 Coordinate Gradient Descent Learn w so that h(x) = w T x Coordinate descent – Init w = 0 – Choose dimension with highest gain Set component of w – Repeat

32 Coordinate Gradient Descent Learn w so that h(x) = w T x Coordinate descent – Init w = 0 – Choose dimension with highest gain Set component of w – Repeat

33 Functional Gradient Descent “Function Space” (All possible DTs) Coordinate descent in function space Restrict weights to be 0,1,2,… h(x) = h 1 (x)+ h 2 (x)+ … + h n (x)

34 Boosting (AdaBoost) h(x) = a 1 h 1 (x) S’ = {(x,y,u 1 )} h 1 (x) S’ = {(x,y,u 2 )} h 2 (x) S’ = {(x,y,u 3 ))} h n (x) … + a 2 h 2 (x)+ … + a 3 h n (x) https://www.cs.princeton.edu/~schapire/papers/explaining-adaboost.pdf u – weighting on data points a – weight of linear combination Stop when validation performance plateaus (will discuss later)

35 Theorem: training error drops exponentially fast https://www.cs.princeton.edu/~schapire/papers/explaining-adaboost.pdf Initial Distribution of Data Train model Error of model Coefficient of model Update Distribution Final average

36 DT AdaBoost Better Bagging Variance Bias Boosting often uses weak models E.g, “shallow” decision trees Weak models have lower variance “An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants” Eric Bauer & Ron Kohavi, Machine Learning 36, 105–139 (1999)

37 Ensemble Selection “Ensemble Selection from Libraries of Models” Caruana, Niculescu-Mizil, Crew & Ksikes, ICML 2004 Training S’ Validation V’ H = {2000 models trained using S’} h(x) = h 1 (x) + h 2 (x) + … + h n (x) Maintain ensemble model as combination of H: Add model from H that maximizes performance on V’ + h n+1 (x) Repeat S Denote as h n+1 Models are trained on S’ Ensemble built to optimize V’

38 MethodMinimize Bias?Minimize Variance?Other Comments BaggingComplex model class. (Deep DTs) Bootstrap aggregation (resampling training data) Does not work for simple models. Random Forests Complex model class. (Deep DTs) Bootstrap aggregation + bootstrapping features Only for decision trees. Gradient Boosting (AdaBoost) Optimize training performance. Simple model class. (Shallow DTs) Determines which model to add at run- time. Ensemble Selection Optimize validation performance Pre-specified dictionary of models learned on training set. State-of-the-art prediction performance – Won Netflix Challenge – Won numerous KDD Cups – Industry standard …and many other ensemble methods as well.

39 References & Further Reading “An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants” Bauer & Kohavi, Machine Learning, 36, 105–139 (1999) “Bagging Predictors” Leo Breiman, Tech Report #421, UC Berkeley, 1994, http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf “An Empirical Comparison of Supervised Learning Algorithms” Caruana & Niculescu-Mizil, ICML 2006 “An Empirical Evaluation of Supervised Learning in High Dimensions” Caruana, Karampatziakis & Yessenalina, ICML 2008 “Ensemble Methods in Machine Learning” Thomas Dietterich, Multiple Classifier Systems, 2000 “Ensemble Selection from Libraries of Models” Caruana, Niculescu-Mizil, Crew & Ksikes, ICML 2004 “Getting the Most Out of Ensemble Selection” Caruana, Munson, & Niculescu-Mizil, ICDM 2006 “Explaining AdaBoost” Rob Schapire, https://www.cs.princeton.edu/~schapire/papers/explaining-adaboost.pdfhttps://www.cs.princeton.edu/~schapire/papers/explaining-adaboost.pdf “Greedy Function Approximation: A Gradient Boosting Machine”, Jerome Friedman, 2001, http://statweb.stanford.edu/~jhf/ftp/trebst.pdf http://statweb.stanford.edu/~jhf/ftp/trebst.pdf “Random Forests – Random Features” Leo Breiman, Tech Report #567, UC Berkeley, 1999, “Structured Random Forests for Fast Edge Detection” Dollár & Zitnick, ICCV 2013 “ABC-Boost: Adaptive Base Class Boost for Multi-class Classification” Ping Li, ICML 2009 “Additive Groves of Regression Trees” Sorokina, Caruana & Riedewald, ECML 2007, http://additivegroves.net/http://additivegroves.net/ “Winning the KDD Cup Orange Challenge with Ensemble Selection”, Niculescu-Mizil et al., KDD 2009 “Lessons from the Netflix Prize Challenge” Bell & Koren, SIGKDD Exporations 9(2), 75—79, 2007


Download ppt "Web-Mining Agents: Classification with Ensemble Methods R. Möller Institute of Information Systems University of Luebeck."

Similar presentations


Ads by Google