Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ensembles of Classifiers Evgueni Smirnov. Outline 1 Methods for Independently Constructing Ensembles 1.1 Bagging 1.2 Randomness Injection 1.3 Feature-Selection.

Similar presentations


Presentation on theme: "Ensembles of Classifiers Evgueni Smirnov. Outline 1 Methods for Independently Constructing Ensembles 1.1 Bagging 1.2 Randomness Injection 1.3 Feature-Selection."— Presentation transcript:

1 Ensembles of Classifiers Evgueni Smirnov

2 Outline 1 Methods for Independently Constructing Ensembles 1.1 Bagging 1.2 Randomness Injection 1.3 Feature-Selection Ensembles 1.4 Error-Correcting Output Coding 2 Methods for Coordinated Construction of Ensembles 2.1Boosting 2.2 Stacking 3 Reliable Classification 3.1 Meta-Classifier Approach 3.2 Version Spaces 3 Co-Training

3 Ensembles of Classifiers Basic idea is to learn a set of classifiers (experts) and to allow them to vote. Advantage: improvement in predictive accuracy. Disadvantage: it is difficult to understand an ensemble of classifiers.

4 Why do ensembles work? Dietterich(2002) showed that ensembles overcome three problems: The Statistical Problem arises when the hypothesis space is too large for the amount of available data. Hence, there are many hypotheses with the same accuracy on the data and the learning algorithm chooses only one of them! There is a risk that the accuracy of the chosen hypothesis is low on unseen data! The Computational Problem arises when the learning algorithm cannot guarantees finding the best hypothesis. The Representational Problem arises when the hypothesis space does not contain any good approximation of the target class(es). The statistical problem and computational problem result in the variance component of the error of the classifiers! The representational problem results in the bias component of the error of the classifiers!

5 Methods for Independently Constructing Ensembles One way to force a learning algorithm to construct multiple hypotheses is to run the algorithm several times and provide it with somewhat different data in each run. This idea is used in the following methods: Bagging Randomness Injection Feature-Selection Ensembles Error-Correcting Output Coding.

6 Bagging Employs simplest way of combining predictions that belong to the same type. Combining can be realized with voting or averaging Each model receives equal weight “Idealized” version of bagging: –Sample several training sets of size n (instead of just having one training set of size n) –Build a classifier for each training set –Combine the classifier’s predictions This improves performance in almost all cases if learning scheme is unstable (i.e. decision trees)

7 Bagging classifiers Classifier generation Let n be the size of the training set. For each of t iterations: Sample n instances with replacement from the training set. Apply the learning algorithm to the sample. Store the resulting classifier. classification For each of the t classifiers: Predict class of instance using classifier. Return class that was predicted most often.

8 Why does bagging work? Bagging reduces variance by voting/ averaging, thus reducing the overall expected error –In the case of classification there are pathological situations where the overall error might increase –Usually, the more classifiers the better

9 Randomization Injection Inject some randomization into a standard learning algorithm (usually easy): –Neural network: random initial weights –Decision tree: when splitting, choose one of the top N attributes at random (uniformly) Dietterich (2000) showed that 200 randomized trees are statistically significantly better than C4.5 for over 33 datasets!

10 Feature-Selection Ensembles Key idea: Provide a different subset of the input features in each call of the learning algorithm. Example: Venus&Cherkauer (1996) trained an ensemble with 32 neural networks. The 32 networks were based on 8 different subsets of 119 available features and 4 different algorithms. The ensemble was significantly better than any of the neural networks!

11 Error-correcting output codes Very elegant method of transforming multi-class problem into two-class problem –Simple scheme: as many binary class attributes as original classes using one-per-class coding Idea: use error-correcting codes instead classclass vector a1000 b0100 c0010 d0001

12 Error-correcting output codes Example: –What’s the true class if base classifiers predict 1011111? classclass vector a1111111 b0000111 c0011001 d0101010

13 Methods for Coordinated Construction of Ensembles The key idea is to learn complementary classifiers so that instance classification is realized by taking an weighted sum of the classifiers. This idea is used in two methods: Boosting Stacking.

14 Boosting Also uses voting/averaging but models are weighted according to their performance Iterative procedure: new models are influenced by performance of previously built ones –New model is encouraged to become expert for instances classified incorrectly by earlier models –Intuitive justification: models should be experts that complement each other There are several variants of this algorithm

15 AdaBoost.M1 classifier generation Assign equal weight to each training instance. For each of t iterations: Learn a classifier from weighted dataset. Compute error e of classifier on weighted dataset. If e equal to zero, or e greater or equal to 0.5: Terminate classifier generation. For each instance in dataset: If instance classified correctly by classifier: Multiply weight of instance by e / (1 - e). Normalize weight of all instances. classification Assign weight of zero to all classes. For each of the t classifiers: Add -log(e / (1 - e)) to weight of class predicted by the classifier. Return class with highest weight.

16 Remarks on Boosting Boosting can be applied without weights using re- sampling with probability determined by weights; Boosting decreases exponentially the training error in the number of iterations; Boosting works well if base classifiers are not too complex and their error doesn’t become too large too quickly! Boosting reduces the bias component of the error of simple classifiers!

17 Stacking Uses meta learner instead of voting to combine predictions of base learners –Predictions of base learners (level-0 models) are used as input for meta learner (level-1 model) Base learners usually different learning schemes Hard to analyze theoretically: “black magic”

18 Stacking instance 1 BC 1 BC 2 BC n meta instances instance 1 0 1 1 011 BC 1 BC 2 BC n … Class 1

19 Stacking instance 2 BC 1 BC 2 BC n meta instances instance 1 1 0 0 011 BC 1 BC 2 BC n … Class 1 instance 2 1000

20 Stacking meta instances instance 1 011 BC 1 BC 2 BC n … Class 1 instance 2 1000 Meta Classifier

21 Stacking instance BC 1 BC 2 BC n meta instance instance 0 1 1 011 BC 1 BC 2 BC n … Meta Classifier 1

22 More on stacking Predictions on training data can’t be used to generate data for level-1 model! The reason is that the level-0 classifier that better fit training data will be chosen by the level-1 model! Thus, k-fold cross-validation-like scheme is employed! An example for k = 3! train testtraintesttraintesttrain test Meta Data

23 More on stacking If base learners can output probabilities it’s better to use those as input to meta learner Which algorithm to use to generate meta learner? –In principle, any learning scheme can be applied –David Wolpert: “relatively global, smooth” model Base learners do most of the work Reduces risk of overfitting

24 Some Practical Advices If the classifier is unstable (high variance), then apply bagging! If the classifier is stable and simple (high bias) then apply boosting! If the classifier is stable and complex then apply randomization injection! If you have many classes and a binary classifier then try error-correcting codes! If it does not work then use a complex binary classifier!

25 Reliable Classification Classifiers applied in critical applications with high misclassification costs need to determine whether classifications they assign to individual instances are indeed correct. We consider two approaches that are related to ensembles of classifiers: –Meta-Classifier Approach –Version Spaces

26 The Task of Reliable Classification Given: Instance space X. Classifier space H. Class set Y. Training sets D  X x Y. Find: Classifier h  H, h: X  Y that correctly classifies future, unseen instances. If h cannot classify an instance correctly, symbol “?” is returned.

27 Meta Classifier Approach instanceBC instance 1 0 BCClass 1 instance n 11 ………………………………………….. Meta Class 0 1

28 Meta Classifier Approach meta instances instance 1 instance n ………………….. Meta Class 0 1 instanceBC MC

29 Meta Classifier Approach instanceBC MC The classification of the base classifier BC is outputted if the meta classifier decides that the instance is classified correctly. Theorem. The precision of the meta classifier equals the accuracy of the combined classifier on the classified instances. Combined Classifier

30 Version Spaces (Mitchell, 1978) Definition 1. Given a classifier space H and training data D, the version space VS(D) is: VS(D) = {h  H | cons(h, D)}, where cons(h, D)  (  (x, y)  D)( y = h(x)). H VS(D)

31 Classification Rule of Version Spaces: the Unanimous-Voting Rule Definition 2. Given version space VS(D), instance x  X receives a classification VS(D)(x) defined as follows: Definition 3. Volume V(VS(D)) of version space VS(D) is the set of all instances that are not classified by VS(D). yVS(D)   (  h  VS(D)) y = h(x), ?otherwise. VS(D)(x) =

32 Unanimous Voting H VS(D)

33 H Unanimous Voting

34 Theorem 1. For any instance x  X and class y  Y : (  h  VS(D))(h(x) = y)  (  y '  Y \{y}) VS(D  {(x,y ' )})= . Theorem 1 states the unanimous-voting rule can be implemented if we have an algorithm to test version spaces for collapse. Unanimous Voting

35 H VS(I +, I - )

36 VS(D) H 1: Check VS(D) 

37 H 2: Classify an instance x x VS(D)

38 H 3: Check VS(D  {(x,-)}): VS(D  {(x,-)})=  x VS(D)

39 H 4: Check VS(D  {(x,+)}): VS(D  {(x,+)})  x

40 H VS(D)  , VS(D  {(x,-)}) = , and VS(D  {(x,+)})   imply that x is positive VS(D) x

41 H 2: Classify another instance x VS(D) x

42 H 3: Check VS(D  {(x,-)}): VS(D  {(x,-)})  VS(D) x

43 H 4:Check VS(D  {(x,+)}): VS(D  {(x,+)})  VS(D) x

44 H x VS(D)  , VS(D  {(x,-)})  , and VS(D  {(x,+)})   imply that x is not classified

45 When can we reach 100% Accuracy? Case 1: When the data are noise-free and the classifier space H contains the target classifier. VS(D) H

46 When is it not possible to reach 100% Accuracy? Case 2: When the classifiers space H does not contain the target classifier. H VS(D)

47 Case 3: When the datasets are noisy. H VS(I +, I - ) When is it not possible to reach 100% Accuracy?

48 Case 3: When the datasets are noisy. H VS(I +, I - ) When is it not possible to reach 100% Accuracy?

49 Case 3: When the datasets are noisy. H VS(I +, I - ) When is it not possible to reach 100% Accuracy? VS(I +, I - )

50 Case 3: When the datasets are noisy. H VS(D) When is it not possible to reach 100% Accuracy?

51 Volume Extension Approach Theorem 2. Consider classifier spaces H and H' such that: (  D)((  h  H) cons(h, D)  (  h '  H ' )cons(h ', D)). Then, for any data set D: V(VS(D))  V(VS'(D)).

52 Volume Extension Approach: Case 2 Case 2: H does not contain the target classifier. H VS(D)

53 Volume Extension Approach: Case 2 Case 2: We add a classifier that classifies the instance differently than the classifiers in VS(D). H VS(D)

54 Volume Extension Approach: Case 2 Case 2: We extend the volume of VS(D). H VS(D)

55 Case 3: When the datasets are noisy. H VS(D) Volume Extension Approach: Case 3

56 Case 3: When the datasets are noisy. H VS(D) Volume Extension Approach: Case 3

57 Case 3: We add a classifier that classifies the instances differently than the classifiers in VS(D). H VS(D) Volume Extension Approach: Case 3

58 Case 3: and we extend the volume of VS(D). H VS(D) Volume Extension Approach: Case 3

59 Version Space Support Vector Machines Version Space Support Vector Machines (VSSVM) are version spaces of which the classification algorithm uses SVM: h(C,D). VS(D) H

60 Classifier Space for VSSVMs Definition 4. Given space H of oriented hyperplanes and data set D, if the SVM hyperplane h(C, D) is consistent with D, then the classifier space H(C, D) for D equals: { h(C, D) }  { h(C, D  {(x,y)})  H| (x,y)  X x Y  cons( h(C, D  {(x,y)}), D  {(x,y)}))}, Otherwise, H(C, D) is empty. SVMs become consistent for the classifier space H(C, D) if the property below holds. Definition 5. The classifiers space H(C, D) is said to have the consistency- identification property if and only if for any instance x  X : if the SVM h(C, D  {(x,y)}) is inconsistent with the data set D  {(x,y)}, then for any instance (x',y')  X x Y there is no SVM h(C, D  {(x’,y’)}) that is consistent with the data set D  {(x,y)}.

61 VSSVMs Definition 6. Given data set D and the classifiers space H(C, D), then for any D ' that is a superset of D the version space support vector machine VS(C, D') is defined as follows: VS(C, D') = {h  H(C, D) | cons(h, D')}. Note that the inductive bias of version space support vector machines is controlled by the parameter C.

62 Extending the Volume of VSSVMs The probability that the SVM hyperplane h(C, D) is consistent with data D increases with C. This implies that if C 1 < C 2, the probability of cons(h(C 1, D), D)  cons(h(C 2, D), D) increases. This implies by theorem 2 that the probability of V(VS(C 1, D))  V(VS(C 2, D)) increases.

63 The Volume-Extension Approach for VSSVMs

64 Experiments: VSSVM with RBF (cases 2 and 3) Data SetParametersCoverageAccuracy Breast CancerG=0.078, C=103.9…147.553.4%70.5% Heart-ClevelandG= 0.078, C=1201.3…2453.256.4%95.9% HepatitisG= 0.078, C=75.4…956.278.7%89.3% Horse ColicG= 0.078, C=719.2…956.234.0%80.8% IonosphereG= 0.078, C=1670.2…1744.886.9%91.1% LaborG= 0.078, C=3.0-17.463.2%91.7% Sonar G= 0.078, C=20.7…41.869.2%68.1% W. Breast Cancer G=0.156, C=3367.0…3789.193.7%96.5%

65 Experiments: VSSVMs with RBF (cases 2 and 3: applying the volume extension approach) Data SetParametersIcCoverageAccuracy Breast CancerG=0.078, C=103.9…147.5359.32%100% Heart-ClevelandG= 0.078, C=1201.3…2453.250015.8%100% HepatitisG= 0.078, C=75.4…956.29541.3%100% Horse ColicG= 0.078, C=719.2…956.210006.8%100% IonosphereG= 0.078, C=1670.2…1744.8120028.5%100% LaborG= 0.078, C=3.0-17.41340.4%100% Sonar G= 0.078, C=20.7…41.81533.7%100% W. Breast CancerG=0.156, C=3367.0…3789.170079.5%100 %

66 Comparison of the Coverage for accuracy of 100% Data SetVSSVMsTCMNN Breast Cancer9.32%1.4% Heart-Cleveland15.8%9.3% Hepatitis41.3%34.9% Horse Colic6.8%4.7% Ionosphere28.5%45.3% Labor40.4%54.4% Sonar33.7%42.1% W. Breast Cancer79.5%35.7%

67 Future Research: VSSVMs Extending version space support vector machines for multi-class classification tasks; Extending version space support vector machines for classification tasks for which it is not possible to find consistent solutions; Improving computational efficiency of version space support vector machines using incremental SVM.

68 Co-Training (WWW application) Consider the problem of learning to classify pages of hypertext from the www, given labeled training data consist of individual web pages along with their correct classifications. The task of classifying a web page can be done by considering just the words on the web page, and the words on hyperlinks that point to the web page.

69 Co-Training Professor Faloutsos my advisor

70 The Co-Training algorithm Given: –Set L of labeled training examples –Set U of unlabeled examples Loop: –Learn hyperlink-based classifier H from L –Learn full-text classifier F from L –Allow H to label p positive and n negative examples from U –Allow F to label p positive and n negative example from U –Add these self-labeled examples to L

71 Learning to Classify Web using Co-Training Mitchell(1999) reported an experiment to co-train text classifiers that recognize course home pages. In experiment, he used 16 labeled examples, 800 unlabeled pages. Mitchell(1999) found that the Co-training algorithm does improve classification accuracy when learning to classify web pages.

72 Learning to Classify Web using Co-Training

73 When does Co-Training work? When examples are described by redundantly sufficient features; and When the hypothesis spaces corresponding to the sets of redundantly sufficient features contain different hypotheses or the learning algorithms are different.


Download ppt "Ensembles of Classifiers Evgueni Smirnov. Outline 1 Methods for Independently Constructing Ensembles 1.1 Bagging 1.2 Randomness Injection 1.3 Feature-Selection."

Similar presentations


Ads by Google