Download presentation
Presentation is loading. Please wait.
1
Machine Learning in Practice Lecture 26
Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute
2
Plan for the day Announcements Mid-term Review Questions?
Readings for next 3 lectures on Blackboard Mid-term Review
3
Locally Optimal Solutions
4
What do we learn from this?
No algorithm is guaranteed to find the globally optimal solution Some algorithms or variations on algorithms may do better on one set just because of where in the space they started Instability can be exploited Noise can put you in a different starting place Different views on the same data is useful When you tune, you need to carefully avoid overfitting to flukes in your data
5
Optimization
6
Optimizing Parameter Settings
1 2 4 5 3 Test This approach assumes that you want to estimate the generalization you will get from your learning and tuning approach together. If you just want to know what the best performance you can get on *this* set by tuning, you can just use standard cross-validation Validation Train
7
Overview of Optimization
Stage 1: Estimate Tuned Performance On each fold, test all versions of algorithm over training data to find optimal one for that fold Train model with optimal setting over training data Apply that model to the testing data for that fold Do for all folds and average across folds Stage 2: Find Optimal settings over whole set Test each version of the algorithm using cross-validation over the whole set Pick the one that works the best But ignore the performance value you get! Stage 3: Train Optimal Model over whole set
8
Overview of Optimization
Stage 1: Estimate Tuned Performance On each fold, test all versions of algorithm over training data to find optimal one for that fold Train model with optimal setting over training data Apply that model to the testing data for that fold Do for all folds and average across folds Stage 1 tells you how well the optimized model you will train in Stage 3 over the whole set will do on a new data set
9
Overview of Optimization
Stage 2: Find Optimal settings over whole set Test each version of the algorithm using cross-validation over the whole set Pick the one that works the best But ignore the performance value you get! Stage 3: Train Optimal Model over whole set The result of stage 3 is the trained, optimized model that you will use!!!
10
Optimization in Weka Divide your data into 10 train/test pairs
Tune parameters using cross validation on the training set (this is the inner loop) Use those optimized settings on the corresponding test set Note that you may have a different set of parameter setting for each of the 10 train/test pairs You can do the optimization in the Experimenter
11
Train/Test Pairs * Use the StratifiedRemoveFolds filter
12
Setting Up for Optimization
* Prepare to save the results Load in training sets for all folds We’ll use cross validation Within training folds to Do the optimization
13
What are we optimizing? Let’s optimize the confidence factor.
Let’s try .1, .25, .5, and .75
14
Add Each Algorithm to Experimenter Interface
15
Look at the Results * Note that optimal setting varies across folds.
16
Apply the optimized settings on each fold
* Performance on Test1 using optimized settings from Train1
17
Using CVParameterSelection
18
Using CVParameterSelection
19
Using CVParameterSelection
You have to know what the command line options look like. You can find out on line or in the Experimenter Don’t forget to click Add!
20
Using CVParameterSelection
Best setting over whole set
21
Using CVParameterSelection
* Tuned performance.
22
Non-linearity in Support Vector Machines
23
Maximum Margin Hyperplanes
Convex Hull The maximum margin hyperplane is computed by taking the perpendicular bisector of shortest line that connects the two convex hulls.
24
Maximum Margin Hyperplanes
Support Vectors Convex Hull The maximum margin hyperplane is computed by taking the perpendicular bisector of shortest line that connects the two convex hulls. Note that the maximum margin hyperplane depends only on the support vectors, which should be relatively few in comparison with the total set of data points.
25
“The Kernel Trick” If your data is not linearly separable
Note that “the kernel trick” can be applied to other algorithms, like perceptron learners, but they will not necessarily learn the maximum margin hyperplane.
26
An example of a polynomial kernel function
27
Thought Question! What is the connection between the meta-features we have been talking about under feature space design and kernel functions?
28
Remember: Use just as much power as you need, and no more
29
Similarity
30
What does it mean for two vectors to be similar?
31
What does it mean for two vectors to be similar?
If there are k attributes: Squrt((a1 – b1)^2 + (a2 – b2)^2 … + (an –bn)^2) For nominal attributes, difference is 0 when the values are the same and 1 otherwise A common policy for missing values is that if either or both of the values being compared are missing, they are treated as different
32
What does it mean for two vectors to be similar?
Cosine similarity = Dot(A,B)/Len(A)Len(B) (a1b1 + a2b2 + … anbn) / Squrt(a1^2 + a2^2 + … an^2) Squrt(b1^2 …)
33
What does it mean for two vectors to be similar?
Cosine similarity rates B and A as more similar than C and A Euclidean distance rates C and A closer than B and A B C
34
What does it mean for two vectors to be similar?
Cosine similarity rates B and A as more similar than C and A Euclidean distance also rates B and A closer than C and A B C
35
Remember! Different similarity metrics will lead to a different grouping of your instances! Think in terms of neighborhoods of instances…
36
Feature Selection
37
Why do irrelevant features hurt performance?
Divide-and-conquer approaches have the problem that the further down in the tree you get, the less data you are paying attention to it’s easy for the classifier to get confused Naïve Bayes does not have this problem, but it has other problems, as we have discussed SVM is relatively good at ignoring irrelevant attributes, but it can still suffer Also, it’s very computationally expensive with large attribute spaces
38
Take Home Message Good Luck!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.