Machine Learning in Practice Lecture 26

Machine Learning in Practice Lecture 26
Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Plan for the day Announcements Mid-term Review Questions?
Readings for next 3 lectures on Blackboard Mid-term Review

Locally Optimal Solutions

What do we learn from this?
No algorithm is guaranteed to find the globally optimal solution Some algorithms or variations on algorithms may do better on one set just because of where in the space they started Instability can be exploited Noise can put you in a different starting place Different views on the same data is useful When you tune, you need to carefully avoid overfitting to flukes in your data

Optimization

Optimizing Parameter Settings
1 2 4 5 3 Test This approach assumes that you want to estimate the generalization you will get from your learning and tuning approach together. If you just want to know what the best performance you can get on *this* set by tuning, you can just use standard cross-validation Validation Train

Overview of Optimization
Stage 1: Estimate Tuned Performance On each fold, test all versions of algorithm over training data to find optimal one for that fold Train model with optimal setting over training data Apply that model to the testing data for that fold Do for all folds and average across folds Stage 2: Find Optimal settings over whole set Test each version of the algorithm using cross-validation over the whole set Pick the one that works the best But ignore the performance value you get! Stage 3: Train Optimal Model over whole set

Stage 1: Estimate Tuned Performance On each fold, test all versions of algorithm over training data to find optimal one for that fold Train model with optimal setting over training data Apply that model to the testing data for that fold Do for all folds and average across folds Stage 1 tells you how well the optimized model you will train in Stage 3 over the whole set will do on a new data set

Stage 2: Find Optimal settings over whole set Test each version of the algorithm using cross-validation over the whole set Pick the one that works the best But ignore the performance value you get! Stage 3: Train Optimal Model over whole set The result of stage 3 is the trained, optimized model that you will use!!!

Optimization in Weka Divide your data into 10 train/test pairs
Tune parameters using cross validation on the training set (this is the inner loop) Use those optimized settings on the corresponding test set Note that you may have a different set of parameter setting for each of the 10 train/test pairs You can do the optimization in the Experimenter

Train/Test Pairs * Use the StratifiedRemoveFolds filter

Setting Up for Optimization
* Prepare to save the results Load in training sets for all folds We’ll use cross validation Within training folds to Do the optimization

What are we optimizing? Let’s optimize the confidence factor.
Let’s try .1, .25, .5, and .75

Add Each Algorithm to Experimenter Interface

Look at the Results * Note that optimal setting varies across folds.

Apply the optimized settings on each fold
* Performance on Test1 using optimized settings from Train1

Using CVParameterSelection

You have to know what the command line options look like. You can find out on line or in the Experimenter Don’t forget to click Add!

Best setting over whole set

* Tuned performance.

Non-linearity in Support Vector Machines

Maximum Margin Hyperplanes
Convex Hull The maximum margin hyperplane is computed by taking the perpendicular bisector of shortest line that connects the two convex hulls.

Maximum Margin Hyperplanes
Support Vectors Convex Hull The maximum margin hyperplane is computed by taking the perpendicular bisector of shortest line that connects the two convex hulls. Note that the maximum margin hyperplane depends only on the support vectors, which should be relatively few in comparison with the total set of data points.

“The Kernel Trick” If your data is not linearly separable
Note that “the kernel trick” can be applied to other algorithms, like perceptron learners, but they will not necessarily learn the maximum margin hyperplane.

An example of a polynomial kernel function

Thought Question! What is the connection between the meta-features we have been talking about under feature space design and kernel functions?

Remember: Use just as much power as you need, and no more

Similarity

What does it mean for two vectors to be similar?

If there are k attributes: Squrt((a1 – b1)^2 + (a2 – b2)^2 … + (an –bn)^2) For nominal attributes, difference is 0 when the values are the same and 1 otherwise A common policy for missing values is that if either or both of the values being compared are missing, they are treated as different

Cosine similarity = Dot(A,B)/Len(A)Len(B) (a1b1 + a2b2 + … anbn) / Squrt(a1^2 + a2^2 + … an^2) Squrt(b1^2 …)

Cosine similarity rates B and A as more similar than C and A Euclidean distance rates C and A closer than B and A B C

Cosine similarity rates B and A as more similar than C and A Euclidean distance also rates B and A closer than C and A B C

Remember! Different similarity metrics will lead to a different grouping of your instances! Think in terms of neighborhoods of instances…

Feature Selection

Why do irrelevant features hurt performance?
Divide-and-conquer approaches have the problem that the further down in the tree you get, the less data you are paying attention to it’s easy for the classifier to get confused Naïve Bayes does not have this problem, but it has other problems, as we have discussed SVM is relatively good at ignoring irrelevant attributes, but it can still suffer Also, it’s very computationally expensive with large attribute spaces

Take Home Message Good Luck!

Machine Learning in Practice Lecture 26

Similar presentations

Presentation on theme: "Machine Learning in Practice Lecture 26"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning in Practice Lecture 26

Similar presentations

Presentation on theme: "Machine Learning in Practice Lecture 26"— Presentation transcript:

Similar presentations

About project

Feedback