Machine Learning in Practice Lecture 20

Machine Learning in Practice Lecture 20
Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Plan for the Day Announcements Similarity Instance Based Learning
Questions? Quiz feedback Similarity Instance Based Learning

Similarity

Why should we care? So far we have been investigating machine learning approaches that locate and use important attributes Now we’re going to focus on algorithms that make comparisons between vectors that consider all attributes Today we’ll talk about instance based learning and clustering

What does it mean for two objects to be similar?
How many ways can you group these objects?

Clustering by color

Clustering by shape

Clustering by shape, more fine grained

What does it mean for two vectors to be similar?

Remember that just like shapes, vectors are multi-dimensional.

Similarity: Things to consider
Distance between any two different values of the same nominal attribute is 1 Distance if the value is the same is 0 If features are numeric, and the ranges are different, the features with bigger ranges will have more influence Similarity metrics that compare vectors consider all features

If there are k attributes: Squrt((a1 – b1)^2 + (a2 – b2)^2 … + (an –bn)^2) For nominal attributes, difference is 0 when the values are the same and 1 otherwise A common policy for missing values is that if either or both of the values being compared are missing, they are treated as different

Cosine similarity = Dot(A,B)/Len(A)Len(B) (a1b1 + a2b2 + … anbn) / Squrt(a1^2 + a2^2 + … an^2) Squrt(b1^2 …)

Cosine similarity rates B and A as more similar than C and A Euclidean distance rates C and A closer than B and A B C

Cosine similarity rates B and A as more similar than C and A Euclidean distance also rates B and A closer than C and A B C

Manhattan distance is the sum of the absolute value of the difference Abs(a1 – b1) + Abs(a2 – b2) + … + Abs (an – bn)

Remember! Different similarity metrics will lead to a different grouping of your instances! Think in terms of neighborhoods of instances…

Instance Based Learning

Class membership computed based on similarity May be similarity with a previously seen example May be similarity with a “prototypical example” related to a class Called a “centroid” Sometimes this approach is called “Nearest neighbor classification”

?

Rote learning is at the extreme end of instance based representations A more general form of instance based representation is where membership is computed based on a similarity measure between a centroid vector and the vector of the example instance Advantage: Possible to learn incrementally

Why “Lazy”? ?

Finding Nearest Neighbors Efficiently
Bruit force method – compute distance between new vector and every vector in the training set, and pick the one with the smallest distance Better method: divide the search space, and strategically select relevant regions so you only have to compare to a subset of instances

kD-trees kD-trees partition the space so that nearest neighbors can be found more efficiently Each split takes place along one attribute and splits the examples at the parent node roughly in half Split points chosen in such a way as to keep the tree as balanced as possible (*not* to optimize for accuracy or information gain) Can you guess why you would want the tree to be as balanced as possible? Hint – think about computational complexity of search algorithms

Algorithm D E A B C D E F G F G New Approx. Nearest Neighbor

Algorithm Tweak: sometimes you Average over k nearest
neighbors rather than taking the absolute nearest neighbor D E Tweak: Use ball shaped regions rather than rectangles to keep number of regions overlapped down A B C D E F G F G New Approx. Nearest Neighbor

Locally Weighted Learning
Base predictions on models trained specifically for regions within the vector space Caveat! This is an over-simplification of what’s happening Weighting of examples accomplished as in cost sensitive classification Similar idea to M5P (learning separate regressions for different regions of the vector space determined by a path through a decision tree)

Locally Weighted Learning
LBR baysean classification that relaxes independence assumptions using similarity between training and test instances Only assumes independence within a neighborhood LWL is a general locally weighted learning approach Note that Baysean Networks are another way of taking non-independence into account with probabilistic models by explicitly modeling interactions (see last section of Chapter 6)

Problems with Nearest-Neighbor Classification
Slow for large numbers of exemplars Performs poorly with noisy data if only single closest exemplar is used for classification All attributes contribute equally to the distance comparison If no normalization is done, then attributes with the biggest range have the biggest effect, regardless of their importance for classification

Problems with Nearest-Neighbor Classification
Even if you normalize, you still have the problem that attributes are not weighted by importance Normally does not do any sort of explicit generalization

Reducing the Number of Exemplars
Normally unnecessary to retain all examples ever seen Ideally only one important example per section of instance space is needed One strategy that works reasonably well is to only keep exemplars that were initially classified wrong Over time the number of exemplars kept increases, and the error rate goes down

Reducing the Number of Exemplars
One problem is that sometimes it is not clear that an exemplar is important until sometime after it has been thrown away Also, this strategy of keeping just those exemplars that are classified wrong is bad for noisy data, because it will tend to keep the noisy examples

Tuning K for K-Nearest Neighbors Compensates for noise

Tuning K for K-Nearest Neighbors Compensates for noise
* Tune for optimal value of K

Pruning Noisy Examples
Using success ratios, it is possible to reduce the number of examples you are paying attention to based on their observed reliability You can compute a success ratio for every instance within range K of the new instance based on the accuracy of their prediction Computed over examples seen since they were added to the space

Pruning Noisy Examples
Keep an upper and lower threshold Throw out examples that fall below the lower threshold Only use exemplars that are above the upper threshold But keep updating the success ratio of all exemplars

Don’t do anything rash! We can compute confidence intervals on the success ratios we compute based on the number of observations we have made You won’t pay attention to an exemplar that just happens to look good at first You won’t throw instances away carelessly Eventually, these will be thrown out.

What do we do about irrelevant attributes?
You can compensate for irrelevant attributes by scaling attribute values based on importance Attribute weights modified after a new example is added to the space Use the most similar exemplar to the new training instance

What do we do about irrelevant attributes?
Adjust the weights so that the new instance comes closer to the most similar exemplar if it classified it correctly or farther away if it was wrong Weights are usually renormalized after this adjustment Weights will be trained to emphasize attributes that lead to useful generalizations

Instance Based Learning with Generalization
Instances generalized to regions. Allows instance based learning algorithms to behave like other machine learning algorithms (just another complex decision boundary) Key idea is determining how far to generalize from each instance

IB1: Plain Vanilla Nearest Neighbor Algorithm
Keeps all training instances, doesn’t normalize Uses euclidean distance Bases prediction on the first instance found with the shortest distance Nothing to optimize Published in 1991 by my AI programming professor from UCI!

IBK: More general than IB1
kNN: how many neighbors to do pay attention to crossValidate: use leave one out cross-validation to select optimal K distanceWeighting: allows you to select the method for weighting based on distance meanSquared: if it’s true, use mean squared error rather than absolute error for regression problems

IBK: More general than IB1
noNormalization: turns off normalization windowSize: sets the maximum number of instances to keep. Prunes off older instances when necessary. 0 means no limit.

K* Uses an entropy based distance metric rather than euclidean distance Much slower than IBK! Optimizations related to concepts we aren’t learning in this course Allows you to choose what to do with missing values

What is special about K*?
Distance is computed based on a computation of how many transformation operations it would take to map one vector onto another There may be multiple transformation paths, and all of them are taken into account So the distance is an average over all possible transformation paths (randomly generated – so branching factor matters!) That’s why it’s slow!!! Allows for a more natural way of handling distance when your attribute space has many different types of attributes

What is special about K*?
Also allows a natural way of handling unknown values (probabilistically imputing values) K* is likely to do better than other approaches if you have lots of unknown values or a very heterogeneous feature space (in terms of types of features)

Locally Weighted Numeric Prediction
Two Main types of trees used for numeric prediction Regression trees: average values computed at leaf nodes Model trees: regression functions trained at leaf nodes Rather than maximize information gain, these algorithms minimize variation within subsets at leaf nodes

Locally Weighted Numeric Prediction
Locally weighted regression is an alternative to regression trees where the regression is computed at testing time rather than training time Compute a regression for instances that are close to the testing instance

Summary of Locally Weighted Learning
Use Instance Based Learning together with a base classifier – almost like a wrapper Learn a model within a neighborhood Basic idea: approximate non-linear function learning with simple linear algorithms

Summary of Locally Weighted Learning
Big advantage: allows for incremental learning, whereas things like SVM do not If you don’t need the incrementality, then it is probably better not to go with instance based learning

Take Home Message Many ways of evaluating similarity of instances, which lead to different results Instance based learning and clustering both make use of these approaches Locally weighted learning is another way (besides the “kernel trick”) to get nonlinearity into otherwise linear approaches

Machine Learning in Practice Lecture 20

Similar presentations

Presentation on theme: "Machine Learning in Practice Lecture 20"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning in Practice Lecture 20

Similar presentations

Presentation on theme: "Machine Learning in Practice Lecture 20"— Presentation transcript:

Similar presentations

About project

Feedback