Machine Learning in Practice Lecture 24 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute
Plan for the day Announcements Questions? This lecture: Finish Chapter 7 Next Lecture: Cover Chapter 8 Following Lecture: Mid-term review Final 3 Lectures: More Applications of Machine Learning Optimization Shortcut! Ensemble Methods Semi-Supervised Learning
Optimization Shortcut!
Using CVParameterSelection
You have to know what the command line options look like. You can find out on line or in the Experimenter Don’t forget to click Add!
Using CVParameterSelection Best setting over whole set
Using CVParameterSelection * Tuned performance.
Ensemble Methods
Is noise always bad? Note: if testing data will be noisy, it is helpful to add similar noise to the training data – the learner then learns which features to “trust” Noise also plays a key role in ensemble methods…
Simulated Annealing
Key idea: combine multiple views on the same data in order to increase reliability
Ensemble Methods: Combining Multiple Models Current area of active research in machine learning Bagging and Boosting are both ways of training a “committee” of classifiers and then combining their predictions For classification, they vote For regression, they average Stacking is where you train multiple classifiers on the same data and combine predictions with a trained model
Ensemble Methods: Combining Multiple Models In Bagging, all models have equal weight In Boosting, more successful models are given more weight In Stacking, a trained classifier assigns “weights” Weka has several meta classfiers for forms of boosting, one for bagging, and one for stacking
Multiple Models from the Same Data Random selection with replacement You can create as many data sets as you want of the size you want (sort of!) Bagging = “Bootstrap Aggregating”: Create new datasets by resampling the data with replacement From n datapoints create t datasets of size n Trained models will differ in the places where the models depend on quirks of the data
Multiple Models from the Same Data Reduces the effects of noise and avoids overfitting Bagging helps most with unstable learning algorithms Sometimes bagging has a better effect if you increase the level of instability in the learner (by adding noise to the data, turning off pruning, or reducing pruning)
Bagging and Probabilities Bagging works well when the output of the classifier is a probability estimate and the decision can be made probabilistically rather than by voting Voting approach: each model votes on one class Probability approach: each model contributes a distribution of predictions Set 1 Set 2 Set 3 Set 4 Set 5
Bagging and Probabilities Bagging also produces good probability estimates as output, so it works well with cost sensitive classification Even if the models only contribute one vote, you can compute a probability from the proportion of models that voted the same way Set 1 Set 2 Set 3 Set 4 Set 5
Bagging and Probabilities MetaCost is a similar idea – and is easier to analyze Does a cost sensitive version of bagging and uses this to relabel the data Trains a model on the relabeled data New model inherits cost sensitivity from the labels it is trained on Tends to work better than standard cost sensitive classification
Bagging and Probabilities A slightly different option is to train Option Trees that build a “packed shared forest” of decision trees that explicitly represent the choice points * A packed tree is really the same as a set of trees.
Randomness and Greedy Algorithms Randomization of greedy algorithms: rather than always selecting the best looking next action, select one of the top N More randomness means models are based less on data, which means each individual model is less accurate But if you do it several times, it will be different each time, so you can use this to do something like Bagging
Randomization and Nearest Neighbor Methods Standard bagging does not help much with nearest neighbor classifiers because they are not unstable in the right way Since predictions are based on k neighbors, small perterbations in the data don’t have a big effect on decision making
Randomization and Nearest Neighbor Methods The trick is to randomize in a way that makes the classifier diverse without sacrificing accuracy With nearest neighbor methods, it works well to randomize the selection of a subset of features used to compute the distance between instances Each selection gives you a very different view of your data
Boosting Boosting is similar to bagging in that it trains multiple models and then combines the predictions It specifically seeks to train multiple models that complement each other In boosting, a series of models are trained and each trained model is influenced by the strengths and weaknesses of the previous model New models should be experts in classifying examples that the previous model got wrong In the final vote, model predictions are weighted based on their model’s performance
AdaBoost Assigning weights to instances is a way to get a classifier to pay more attention to some instances than other Remember that in boosting, models are training in a sequence Reweighting: Model x+1 weights examples that Model x and previous classifiers got wrong higher than the ones that were treated correctly more often Resampling: errors affect the probability of selecting an example, but the classifier treats each instance in the selected sample with the same importance
AdaBoost The amount of reweighting depends on the extent of the errors With reweighting, you use each example once, but the examples are weighted differently With resampling, you do selection with replacement like in Bagging, but the probability is affected by the “weight” assigned to an example
More about Boosting The more iterations, the more confident the trained classifier will be in its predictions (since it will have more experts voting) This is true even beyond where the error on the training data goes down to 0 Because of that, it might be helpful to have a validation set for tuning On the other side, sometimes Boosting overfits That’s another reason why it is helpful to have a validation set Boosting can turn a weak classifier into a strong classifier
Why does Boosting work? You can learn a very complex model all at once Or you can learn a sequence of simpler models When you combine the simple models, you get a more complex model The advantage is that at each stage, the search is more constrained Sort of like a “divide-and-conquer” approach
Boosting and Additive Regression Boosting is a form of forward, stagewise, additive modeling LogitBoost is like AdaBoost except that it uses a regression model as the base classifier whereas AdaBoost uses a classification model
Boosting and Additive Regression Additive regression is when you: 1. train a regression equation 2. then train another to predict the residuals 3. then another, and so on 4. and then add the predictions together With additive regression, the more iterations, the better you do on the training data but you might overfit You can get around this with cross validation You can also reduce the chance of overfitting by decreasing the size of the increment each time – but the run time is slower Same idea as the momentum and learning rate parameters in multi-layer perceptrons
Stacking Stacking combines the predictions of multiple learning methods over the same data Rather than manipulating the training data as in bagging and boosting Use several different learners to add labels to your data using cross validation Then train a meta-learner to make an “intelligent guess” based on the pattern of predictions it sees The meta-learner can usually be a simple algorithm
Stacking A more careful option is to train the level 0 classifiers on the training data, and train the meta- learner on validation data The trained model will make predictions about novel examples by first applying the level 0 classifiers to the test data and then applying the meta-learner to those labels
Error Correcting Output Codes Instead of training 4 classifiers, you train 7 Look at the pattern of results and pick the class with the most similar pattern (avoids ad hoc tie breakers) So if one classifier makes a mistake, you can usually compensate for it with the others ABCDABCD ClassesOne Versus AllError Correcting Codes
Error Correcting Output Codes Because the classifiers are making different comparisons, they will make errors in different places It’s like training subclassifiers to make individual pairwise comparisons to resolve conflicts But it always trains models on all of the data rather than part A good error correcting code has good row separation and column separation (so you need at least 4 class distinctions before you can achieve this) Separation is computed using hamming distance ABCDABCD ClassesOne Versus AllError Correcting Codes
Using Error Correcting Codes
Semi-Supervised Learning
Key idea: avoid overfitting to a small amount of labeled data by leveraging a lot of unlabeled data
Using Unlabeled Data If you have a small amount of labeled data and a large amount of unlabeled data: you can use a type of bootstrapping to learn a model that exploits regularities in the larger set of data The stable regularities might be easier to spot in the larger set than the smaller set Less likely to overfit your labeled data Draws on concepts from Clustering! Clustering shows you where the natural breaks are in your data
Expectation maximization approach Train model on labeled data Apply model to unlabeled data Train model on newly labeled data You can use a cross-validation approach to reassign labels to the same data from this new trained model You can keep doing this iteratively until the model converges Probabilities on labels assigns a weight to each training example If you consider hand labeled data to have a score of 100%, then as your amount of hand labeled data increases, your unlabeled data will have less and less influence over the target model This maximizes the expectation of correct classification
Doing Semi-Supervised Learning Not built in to Weka! Set up labeled data as usual In unlabeled data, class value is always ? Create one whole labeled set of data Set up the Explorer to output predictions Run classifier in Explorer with “Use supplied test set” You can then add the predictions to the unlabeled data Make one large dataset with original labeled data and newly labeled data Then, create train/test pairs so you can re- estimate the labels
Built into TagHelper Tools! Unlabeled examples have class ? Turn on Self-training
Co-training Train two different models based on a few labeled examples Each model is learning the same labels but using different features Use each of these to label the unlabeled data For each approach, take the example most confidently labeled negative and most confidently labeled positive and add them to the labeled data Now repeat the process until all of the data is labeled
Co-training Co-training is better than EM for data that truly has two independent feature sets (like content versus links for web pages) Co-EM combines the two approaches: use labeled data to train a model with approach A, then use approach B to learn those labels and assign them to the data, then use A again, and pass back and forth until convergence Probabilistically re-estimates labels on all data on each iteration
What Makes Good Applied Machine Learning Work Based on Bootstrapping and Co-training? Determining what are good “alternative views” on your data Involves all of the same issues as simply applying classifiers: What features do you have available? How will you select subsets of these? Where will you get your labeled data from? What is the quality of this labeling?
Take Home Message Noise and instability are not always bad! Increase stability in classification using “multiple views” Ensemble methods use noise to get a “broader” view of your data Semi-supervised learning gets a “broader view” of your data by leveraging regularities found in a larger, unlabeled set of data