Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Plan for the Day Announcements  Questions?  No quiz and no new assignment today Weka helpful hints Clustering Advanced Statistical Models More on Optimization and Tuning

Weka Helpful Hints

Remember SMOreg vs SMO…

Setting the Exponent in SMO * Note that an exponent larger than 1.0 means you are using a non-linear kernel.

Clustering

What is clustering Finding natural groupings of your data Not supervised! No class attribute. Usually only works well if you have a huge amount of data!

InfoMagnets: Interactive Text Clustering

What does clustering do? Finds natural breaks in your data If there are obvious clusters, you can do this with a small amount of data If you have lots of weak predictors, you need a huge amount of data to make it work

Clustering in Weka * You can pick which clustering algorithm you want to use and how many clusters you want.

Clustering in Weka Click here Select the class attribute * Clustering is unsupervised, so you want it to ignore your class attribute!

Clustering in Weka * You can evaluate the clustering in comparison with class attribute assignments

Adding a Cluster Feature

* You should set it explicitly to ignore the class attribute * Set the pulldown menu to No Class

Why add cluster features? Class 1 Class 2

Clustering with Weka K-means and FarthestFirst: disjoint flat clusters EM: statistical approach Cobweb: hierarchical clustering

K-Means You choose the number of clusters you want  You might need to play with this by looking at what kind of clusters you get out K initial points chosen randomly as cluster centriods All points assigned to the centroid they are closest to Once data is clustered, a new centroid is picked based on relationships within the cluster

K-Means Then clustering occurs again using the new centroids This continues until no changes in clustering take place Clusters are flat and disjoint

K-Means

EM: Expectation Maximization Does not base clustering on distance from a centroid Instead clusters based on probability of class assignment Overlapping clusters rather than disjoint clusters  Every instance belongs to every cluster with some probability

EM: Expectation Maximization Two important kinds of probability distributions  Each cluster has an associated distribution of attribute values for each attribute Based on the extent to which instances are in the cluster  Each instance has a certain probability of being in each cluster Based on how close its attribute values are to typical attribute values for the cluster

Probabilities of Cluster Membership Initialized 65% B 35% A 25% B 75% A

Central Tendencies Computed Based on Cluster Membership A B 65% B 35% A 25% B 75% A

Cluster Membership Re-Assigned Probabilistically A B 75% B 25% A 35% B 65% A

Central tendencies Re-Assigned Based on Membership A B 75% B 25% A 35% B 65% A

Cluster Membership Reassigned A B 60% B 40% A 45% B 55% A

EM: Expectation Maximization Iterative like k-means – but guided by a different computation Considered more principled than k-means, but much more computationally expensive Like k-means, you pick the number of clusters you want

Advanced Statistical Models

Quick View of Bayesian Networks Normally with Naïve Bayes you have simple conditional probabilities  P[Play = yes | Humitity = high] WindyPlay OutlookHumidityTemperature

Quick View of Bayesian Networks With Bayes Nets, there are interactions between attributes  P[play = yes & temp = hot | Humidity = high] Similar likelihood computation for an instance  You will still have one conditional probability per attribute to multiply together  But they won’t all be simple  Humidity is related jointly to temperature and play WindyPlay OutlookHumidityTemperature

Quick View of Bayesian Networks Learning algorithm needs to find the shape of the network Probabilities come from counts Two stages – similar idea to “kernel methods” WindyPlay OutlookHumidityTemperature

Doing Optimization in Weka

Optimizing Parameter Settings 1 2 4 5 3 Train Validation Test Iterate over settings Compare performance over validation set; Pick optimal setting Test on Test Set Use a modified form of cross- validation: Or you can have a hold-out Validation set you use for all folds Still N folds, but each fold has less training data than with standard cross validation

Remember! Cross-validation is for estimating your performance If you want the model that achieves that estimated performance, train over the whole set Same principle for optimization  Estimate your tuned performance using cross validation with an inner loop for optimization  When you build the model over the whole set, use the settings that work best in cross- validation over the whole set

Optimization in Weka Divide your data into 10 train/test pairs  Tune parameters using cross validation on the training set (this is the inner loop)  Use those optimized settings on the corresponding test set  Note that you may have a different set of parameter setting for each of the 10 train/test pairs You can do the optimization in the Experimenter

Train/Test Pairs * Use the StratifiedRemoveFolds filter

Setting Up for Optimization * Prepare to save the results Load in training sets for all folds We’ll use cross validation Within training folds to Do the optimization

What are we optimizing? Let’s optimize the confidence factor. Let’s try.1,.25,.5, and.75

Add Each Algorithm to Experimenter Interface

Look at the Results * Note that optimal setting varies across folds.

Apply the optimized settings on each fold * Performance on Test1 using optimized settings from Train1

What if the optimization requires work by hand? Do you see a problem with the following?  Do feature selection over the whole set to see which words are highly ranked  Create user defined features with subsets of these to see which ones look good  Add those to your feature space and do the classification

What if the optimization requires work by hand? The problem is that is just like doing feature selection over your whole data set You will over- estimate your performance So what’s a better way of doing that?

What if the optimization requires work by hand? You could set aside a small subset of data Using that small subset, do the same process Then use those user defined features with the other part of the data

Take Home Message Instance based learning and clustering both make use of similarity metrics Clustering can be used to help you understand your data or to add new features to your data Weka provides opportunities to tune all of its algorithms through the object editor You can use the Experimenter to tune the parameter settings when you are estimating your performance using cross- validation

Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Similar presentations

Presentation on theme: "Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Similar presentations

Presentation on theme: "Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute."— Presentation transcript:

Similar presentations

About project

Feedback