Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning in Practice Lecture 24 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Similar presentations


Presentation on theme: "Machine Learning in Practice Lecture 24 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute."— Presentation transcript:

1 Machine Learning in Practice Lecture 24 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

2 Plan for the day Announcements  Questions?  This lecture: Finish Chapter 7  Next Lecture: Cover Chapter 8  Following Lecture: Mid-term review  Final 3 Lectures: More Applications of Machine Learning Optimization Shortcut! Ensemble Methods Semi-Supervised Learning

3 Optimization Shortcut!

4 Using CVParameterSelection

5

6 You have to know what the command line options look like. You can find out on line or in the Experimenter Don’t forget to click Add!

7 Using CVParameterSelection Best setting over whole set

8 Using CVParameterSelection * Tuned performance.

9 Ensemble Methods

10 Is noise always bad? Note: if testing data will be noisy, it is helpful to add similar noise to the training data – the learner then learns which features to “trust” Noise also plays a key role in ensemble methods… http://www.city.vancouver.bc.ca/ctyclerk/cclerk/970513/citynoisereport/noise2.gif

11 Simulated Annealing http://biology.st-andrews.ac.uk/vannesmithlab/simanneal.png

12 Key idea: combine multiple views on the same data in order to increase reliability

13 Ensemble Methods: Combining Multiple Models Current area of active research in machine learning Bagging and Boosting are both ways of training a “committee” of classifiers and then combining their predictions  For classification, they vote  For regression, they average Stacking is where you train multiple classifiers on the same data and combine predictions with a trained model

14 Ensemble Methods: Combining Multiple Models In Bagging, all models have equal weight In Boosting, more successful models are given more weight In Stacking, a trained classifier assigns “weights” Weka has several meta classfiers for forms of boosting, one for bagging, and one for stacking

15 Multiple Models from the Same Data Random selection with replacement  You can create as many data sets as you want of the size you want (sort of!)  Bagging = “Bootstrap Aggregating”: Create new datasets by resampling the data with replacement From n datapoints create t datasets of size n Trained models will differ in the places where the models depend on quirks of the data

16 Multiple Models from the Same Data  Reduces the effects of noise and avoids overfitting  Bagging helps most with unstable learning algorithms  Sometimes bagging has a better effect if you increase the level of instability in the learner (by adding noise to the data, turning off pruning, or reducing pruning)

17 Bagging and Probabilities Bagging works well when the output of the classifier is a probability estimate and the decision can be made probabilistically rather than by voting  Voting approach: each model votes on one class  Probability approach: each model contributes a distribution of predictions Set 1 Set 2 Set 3 Set 4 Set 5

18 Bagging and Probabilities Bagging also produces good probability estimates as output, so it works well with cost sensitive classification  Even if the models only contribute one vote, you can compute a probability from the proportion of models that voted the same way Set 1 Set 2 Set 3 Set 4 Set 5

19 Bagging and Probabilities MetaCost is a similar idea – and is easier to analyze  Does a cost sensitive version of bagging and uses this to relabel the data  Trains a model on the relabeled data  New model inherits cost sensitivity from the labels it is trained on  Tends to work better than standard cost sensitive classification

20 Bagging and Probabilities A slightly different option is to train Option Trees that build a “packed shared forest” of decision trees that explicitly represent the choice points * A packed tree is really the same as a set of trees.

21 Randomness and Greedy Algorithms Randomization of greedy algorithms: rather than always selecting the best looking next action, select one of the top N  More randomness means models are based less on data, which means each individual model is less accurate  But if you do it several times, it will be different each time, so you can use this to do something like Bagging

22 Randomization and Nearest Neighbor Methods Standard bagging does not help much with nearest neighbor classifiers because they are not unstable in the right way  Since predictions are based on k neighbors, small perterbations in the data don’t have a big effect on decision making

23 Randomization and Nearest Neighbor Methods The trick is to randomize in a way that makes the classifier diverse without sacrificing accuracy  With nearest neighbor methods, it works well to randomize the selection of a subset of features used to compute the distance between instances  Each selection gives you a very different view of your data

24 Boosting Boosting is similar to bagging in that it trains multiple models and then combines the predictions It specifically seeks to train multiple models that complement each other In boosting, a series of models are trained and each trained model is influenced by the strengths and weaknesses of the previous model  New models should be experts in classifying examples that the previous model got wrong In the final vote, model predictions are weighted based on their model’s performance

25 AdaBoost Assigning weights to instances is a way to get a classifier to pay more attention to some instances than other Remember that in boosting, models are training in a sequence  Reweighting: Model x+1 weights examples that Model x and previous classifiers got wrong higher than the ones that were treated correctly more often  Resampling: errors affect the probability of selecting an example, but the classifier treats each instance in the selected sample with the same importance

26 AdaBoost The amount of reweighting depends on the extent of the errors With reweighting, you use each example once, but the examples are weighted differently With resampling, you do selection with replacement like in Bagging, but the probability is affected by the “weight” assigned to an example

27 More about Boosting The more iterations, the more confident the trained classifier will be in its predictions (since it will have more experts voting)  This is true even beyond where the error on the training data goes down to 0 Because of that, it might be helpful to have a validation set for tuning  On the other side, sometimes Boosting overfits That’s another reason why it is helpful to have a validation set Boosting can turn a weak classifier into a strong classifier

28 Why does Boosting work? You can learn a very complex model all at once Or you can learn a sequence of simpler models  When you combine the simple models, you get a more complex model  The advantage is that at each stage, the search is more constrained  Sort of like a “divide-and-conquer” approach

29 Boosting and Additive Regression Boosting is a form of forward, stagewise, additive modeling LogitBoost is like AdaBoost except that it uses a regression model as the base classifier whereas AdaBoost uses a classification model

30 Boosting and Additive Regression Additive regression is when you: 1. train a regression equation 2. then train another to predict the residuals 3. then another, and so on 4. and then add the predictions together With additive regression, the more iterations, the better you do on the training data  but you might overfit  You can get around this with cross validation  You can also reduce the chance of overfitting by decreasing the size of the increment each time – but the run time is slower  Same idea as the momentum and learning rate parameters in multi-layer perceptrons

31 Stacking Stacking combines the predictions of multiple learning methods over the same data  Rather than manipulating the training data as in bagging and boosting Use several different learners to add labels to your data using cross validation Then train a meta-learner to make an “intelligent guess” based on the pattern of predictions it sees  The meta-learner can usually be a simple algorithm

32 Stacking A more careful option is to train the level 0 classifiers on the training data, and train the meta- learner on validation data The trained model will make predictions about novel examples by first applying the level 0 classifiers to the test data and then applying the meta-learner to those labels

33 Error Correcting Output Codes Instead of training 4 classifiers, you train 7 Look at the pattern of results and pick the class with the most similar pattern (avoids ad hoc tie breakers) So if one classifier makes a mistake, you can usually compensate for it with the others ABCDABCD 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 ClassesOne Versus AllError Correcting Codes

34 Error Correcting Output Codes Because the classifiers are making different comparisons, they will make errors in different places It’s like training subclassifiers to make individual pairwise comparisons to resolve conflicts  But it always trains models on all of the data rather than part A good error correcting code has good row separation and column separation (so you need at least 4 class distinctions before you can achieve this)  Separation is computed using hamming distance ABCDABCD 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 ClassesOne Versus AllError Correcting Codes

35 Using Error Correcting Codes

36

37

38

39 Semi-Supervised Learning

40 Key idea: avoid overfitting to a small amount of labeled data by leveraging a lot of unlabeled data

41 Using Unlabeled Data If you have a small amount of labeled data and a large amount of unlabeled data:  you can use a type of bootstrapping to learn a model that exploits regularities in the larger set of data  The stable regularities might be easier to spot in the larger set than the smaller set  Less likely to overfit your labeled data Draws on concepts from Clustering!  Clustering shows you where the natural breaks are in your data

42 Expectation maximization approach Train model on labeled data Apply model to unlabeled data Train model on newly labeled data  You can use a cross-validation approach to reassign labels to the same data from this new trained model  You can keep doing this iteratively until the model converges  Probabilities on labels assigns a weight to each training example If you consider hand labeled data to have a score of 100%, then as your amount of hand labeled data increases, your unlabeled data will have less and less influence over the target model  This maximizes the expectation of correct classification

43 Doing Semi-Supervised Learning Not built in to Weka! Set up labeled data as usual In unlabeled data, class value is always ? Create one whole labeled set of data  Set up the Explorer to output predictions  Run classifier in Explorer with “Use supplied test set”  You can then add the predictions to the unlabeled data  Make one large dataset with original labeled data and newly labeled data Then, create train/test pairs so you can re- estimate the labels

44 Built into TagHelper Tools! Unlabeled examples have class ? Turn on Self-training

45 Co-training Train two different models based on a few labeled examples  Each model is learning the same labels but using different features Use each of these to label the unlabeled data For each approach, take the example most confidently labeled negative and most confidently labeled positive and add them to the labeled data Now repeat the process until all of the data is labeled

46 Co-training Co-training is better than EM for data that truly has two independent feature sets (like content versus links for web pages) Co-EM combines the two approaches: use labeled data to train a model with approach A, then use approach B to learn those labels and assign them to the data, then use A again, and pass back and forth until convergence  Probabilistically re-estimates labels on all data on each iteration

47 What Makes Good Applied Machine Learning Work Based on Bootstrapping and Co-training? Determining what are good “alternative views” on your data Involves all of the same issues as simply applying classifiers:  What features do you have available?  How will you select subsets of these?  Where will you get your labeled data from? What is the quality of this labeling?

48 Take Home Message Noise and instability are not always bad! Increase stability in classification using “multiple views” Ensemble methods use noise to get a “broader” view of your data Semi-supervised learning gets a “broader view” of your data by leveraging regularities found in a larger, unlabeled set of data


Download ppt "Machine Learning in Practice Lecture 24 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute."

Similar presentations


Ads by Google