1 Instance Based Learning Soongsil University Intelligent Systems Lab.

2 Content Motivation Eager Learning Lazy Learning Instance-Based Learning k-Nearest Neighbour Learning(kNN) Distance-Weighted k-NN Locally Weighted Regression(LWR) Radial Basis Functions(RBF) Case-Based Reasoning(CBR) Summary

3 Instance-based learning One way of solving tasks of approximating discrete or real valued target functions Have training examples: (x n, f(x n )), n=1..N. Key idea: just store the training examples when a test example is given then find the closest matches

4 Motivation: Eager Learning The Learning Task: Try to approximate a target function through a hypothesis on the basis of training examples EAGER Learning: As soon as the training examples and the hypothesis space are received the search for the first hypothesis begins Training phase: given: training examples D= hypothesis space H search: best hypothesis Processing phase: for every new instance x q return Examples

5 Motivation: Lazy Algorithms LAZY ALGORITHMS: Training examples are stored and sleeping Generalisation beyond these examples is postponed till new instances must be classified Every time a new query instance is encountered, its relationship to the previously stored examples is examined in order to compute the value of the target function for this new instance

6 Motivation: Instance-Based Learning Instance-Based Algorithms can establish a new local approximation for every new instance Training phase: given: training sample D= Processing phase: given: instance x q search: best local hypothesis return Examples: Nearest Neighbour Algorithm Distance Weighted Nearest Neighbour Locally Weighted Regression....

7 Motivation: Instance-Based Learning How are the instances represented? How can we measure the similarity of the instances? How can be computed?

Motivation Eager Learning Lazy Learning Instance-Based Learning k-Nearest Neighbour Learning(kNN) Distance-Weighted k-NN Locally Weighted Regression(LWR) Radial Basis Functions(RBF) Case-Based Reasoning(CBR) Summary Content

Nearest Neighbour Algorithm Idea: All instances correspond to the points in the n-dimensional space. Assign the value of the next, neighboured instance to the new instance Representation: Let be an instance, where denotes the value of the r-th attribute of an instance x Target Function: Discrete valued or real valued We may also use X ir instead of

1-Nearest Neighbor Four things make a memory based learner: 1. A distance metric Euclidian 2. How many nearby neighbors to look at? One 3. A weighting function (optional) Unused 4. How to fit with the local points? Just predict the same output as the nearest neighbor. 11

Nearest Neighbour Algorithm HOW IS FORMED? Discrete target function: where V: set of s classes (e.g., red, black, yellow…) Continuous target function: Let the next neighbour of 

13 1-Nearest neighbour: Given a query instance x q, first locate the nearest training example x n then f(x q ):= f(x n ) K-Nearest neighbour: Given a query instance x q, first locate the k nearest training examples If discrete values target function, take vote among its k nearest neighbour. (e.g., X, X, O, O, X, O, X, X)  X Else if real valued target faction, take the mean of the f values of the k nearest neighbour Nearest Neighbour Algorithm

How to choose “k” Average of k points more reliable when: noise in attributes noise in class labels classes partially overlap Large k: less sensitive to noise (particularly class noise) better probability estimates for discrete classes larger training sets allow larger values of k Small k: captures fine structure of problem space better may be necessary with small training sets Balance must be struck between large and small k As training set approaches infinity, and k grows large, kNN becomes Bayes optimal 14 if p (x) >.5 then predict 1, else 0

k-Nearest Neighbor Four things make a memory based learner: 1. A distance metric Euclidian 2. How many nearby neighbors to look at? k 3. A weighting function (optional) Unused 4. How to fit with the local points? Just predict the average output among the k nearest neighbors. 15

k-Nearest Neighbour Idea: If we choose k=1, then the algorithm assigns to the value where is the nearest training instance to For larger values of k the algorithm assigns the most common value among the k nearest training examples How can be established ? where if, otherwise Let x i, …x k denote the k instances from training examples that are nearest to x q

17 The distance between examples We need a measure of distance in order to know who are the neighbours Assume that we have T attributes for the learning problem. Then one example point x has elements x t  , t=1,…T. The distance between two points x i, x j is often defined as the Euclidean distance:

Similarity and Dissimilarity Between Objects Distances are normally used measures Minkowski distance: a generalization If q = 2, d is Euclidean distance If q = 1, d is Manhattan distance Weighed distance

k-Nearest Neighbour REFINEMENT: The weights of the neighbours are taken into account relative to their distance to the query point. The farther a neighbour the less is its influence... where To accommodate the case where the query point exactly matches one of the training instances and the denominator therefore is zero, we assign to be in this case Distance-weight for real-valued target function: where if, otherwise

20 Voronoi Diagram Example: 1NN: 5-NN: Voronoi Diagram Voronoi Diagram: The decision surface is induced by a 1-Nearest Neighbour algorithm for a typical set of training examples. The convex surrounding of each training example indicates the region of query points whose classification will be completely determined by the training example.

21 Characteristics of Inst-b-Learning An instance-based learner is a lazy-learner and does all the work when the test example is presented. This is opposed to so-called eager-learners, which build a parameterised compact model of the target. It produces local approximation to the target function (different with each test instance)

22 When to consider Nearest Neighbour algorithms? Instances map to points in Not more than say 20 attributes per instance Lots of training data Advantages: Training is very fast Can learn complex target functions Don’t lose information Disadvantages: ? (will see them shortly…)

23 two one four three five six seven Eight ?

24 Training data Test instance Number Lines Line types Rectangles Colours Mondrian? 1 6 1 10 4 No 2 4 2 8 5 3 5 2 7 4 Yes 4 5 1 8 4 5 5 1 10 5 No 6 6 1 8 6 Yes 7 7 1 14 5 No

25 Keep data in normalised form One way to normalise the data a r (x) to a´ r (x) is

26 Normalised training data Test instance

27 Distances of test instance from training data Classification After Normalize Before After Normalize 1-NN Yes No 3-NN Yes Yes 5-NN No Yes 7-NN No No

29 What if the target function is real valued? The k-nearest neighbour algorithm would just calculate the mean of the k nearest neighbours

Distance-Weighted KNN 30  Might want nearer neighbors with more heavy weight:  For discrete-valued target functions:  For real-valued target functions: Shepard method

Remarks on k-Nearest Neighbour Algorithm PROBLEM: The measurement of the distance between two instances considers every attribute. So even irrelevant attributes can influence the approximation. EXAMPLE: n =20 but only 2 attributes are relevant SOLUTION: Weight each attribute differently when calculating the distance between two neighbours: stretching the relevant axes in Euclidian space: shortening the axes that correspond to less relevant attributes lengthening the axes that correspond to more relevant attribute PROBLEM: Determine which weight belongs to which attribute automatically? Cross-validation Leave-one-out  see in next lecture

Remarks on k-Nearest Neighbour Algorithm 2 ADVANTAGE: The training phase is processed very fast Can learn complex target function Robust to noisy training data Quite effective when a sufficiently large set of training data is provided DISADVANTAGE: Alg. delays all processing until a new query is received => significant computation can be required to process; efficient memory indexing Processing is slow Sensibility about escape of the dimensions BIAS: Inductive bias corresponds to an assumption that the classification of an instance will be most similar to the classification of other instances that are nearby in Euclidean distance

Locally Weighted Regression IDEA: Generalization of Nearest Neighbour Alg. It constructs an explicit approximation to f over a local region surrounding x q. It uses nearby or distance-weighted training examples to form the local approximation to f. Local: The function is approximated based solely on the training data near the query point Weighted: The construction of each training example is weighted by its distance from the query point Regression: means approximating a real-valued target function

How to works Locally Weighted Regression 35

Locally Weighted Regression PROCEDURE: Given a new query x q, construct an approximation that fits the training examples in the neighbourhood surrounding x q This approximation is used to calculate, which is as the estimated target value assigned to the query instance. The description of may change, because a different local approximation will be calculated for each instance

Locally Weighted Linear Regression Special case of LWR, simple computation LINEAR HYPOTHESIS SPACE: where a r (x) is the rth attribute of x, x variable of the hypotheses space Define the error criterion E in order to emphasize the fitting of the local training example Minimise the squared error over just k nearest neighbours: Minimise the squared error over the entire set D using some kernel function K to decrease this error based on the distance Combine E 1 and E 2

Locally Weighted Linear Regression  Using linear function to approximate f:  Recall chapter 4:  gradient descent rule

Locally Weighted Linear Regression 2 The third error criterion is a good approximation to the second one and it has the advantage that computational costs are independent of the total number of training examples If E 3 is chosen and the gradient descent rule is rederived (see NN) the following training rule is obtained

Evaluation Locally Weighted Regression ADVANTAGE Pointwise approximation of a complex target function Earlier data has no influence on the new ones DISADVANTAGE The quality of the result depends on Choice of the function Choice of the kernel function K Choice of the hypothesis space H Sensibility against the relevant and irrelevant attributes

41 Difficulties with k-nearest neighbour algorithms Have to calculate the distance of the test case from all training cases There may be irrelevant attributes amongst the attributes – curse of dimensionality

Radial Basis Function (RBF) Networks Each prototype node computes a distance based kernel function (Gaussian is common) Prototype nodes form a hidden layer in a neural network Train top layer with simple delta rule to get outputs Thus, prototype nodes learn weightings for each class 43  blend of instance-based method and neural network method.

Radial Basis Function 44  Function to be learned: One common choice for is:  Global approximation to target function, in terms of linear combination of local approximations.  Related to distance-weighted regression, “ eager ” instead of “ lazy ”.

Training RBF Networks 45  Stage one: define hidden units by choosing k, x u and Allocate Gaussian kernel function for each training example. Choose set of kernel functions that is smaller than the number of training examples.  Scatter uniformly over instance space  Or nonuniformly  Stage two: train w u Gradient descent by global error criterion

Radial Basis Function Networks 46  a i (x) are attributes describing instance x.  The first layer computes various K u (d(x u,x)).  Second layer computes linear combination of first-layer unit values.  Hidden unit activation is close to 0 if x isn’t near x u

Case-based reasoning (CBR) Instance-based methods and locally weighted regression: lazy learning; They classify new query instances by analysing similar instances and ignoring the very different ones They represent instances as real-valued points in an n-dimensional Euclidian space CBR: first two principles and instances are represented by using a richer symbolic description and the methods used to retrieval CBR is an advanced instance based learning applied to more complex instance objects Objects may include complex structural descriptions of cases & adaptation rules

CBR cannot use Euclidean distance measures Must define distance measures for those complex objects instead (e.g. semantic nets) CBR tries to model human problem-solving uses past experience (cases) to solve new problems retains solutions to new problems CBR is an ongoing area of machine learning research with many applications

51 Applications of CBR Design landscape, building, mechanical, conceptual design of aircraft sub-systems Planning repair schedules Diagnosis medical Adversarial reasoning legal

CBR process New Case matching Matched Cases Retrieve Adapt? No Yes Closest Case Suggest solution Retain Learn Revise Reuse Case Base Knowledge and Adaptation rules

53 CBR example: Property pricing Test instance

54 How rules are generated There is no unique way of doing it. Here is one possibility: Examine cases and look for ones that are almost identical case 1 and case 2 R1: If recep-rooms changes from 2 to 1 then reduce price by £5,000 case 3 and case 4 R2: If Type changes from semi to terraced then reduce price by £7,000

55 Matching Comparing test instance matches(5,1) = 3 matches(5,2) = 3 matches(5,3) = 2 matches(5,4) = 1 Estimate price of case 5 is £25,000

56 Adapting Reverse rule 2 if type changes from terraced to semi then increase price by £7,000 Apply reversed rule 2 new estimate of price of property 5 is £32,000

57 Learning So far we have a new case and an estimated price nothing is added yet to the case base If later we find house sold for £35,000 then the case would be added could add a new rule if location changes from 8 to 7 increase price by £3,000

58 Problems with CBR How should cases be represented? How should cases be indexed for fast retrieval? How can good adaptation heuristics be developed? When should old cases be removed?

59 Advantages A local approximation is found for each test case Knowledge is in a form understandable to human beings Fast to train

Lazy and Eager Learning 60  Lazy: wait for query before generalizing KNN, locally weighted regression, CBR  Eager: generalize before seeing query RBF networks  Differences: Computation time Global and local approximations to the target function Use same H, lazy can represent more complex functions. (e.g., consider H=linear functions)

61 Summary  Differences and advantages  KNN algorithm: the most basic instance-based method.  Locally weighted regression: generalization of KNN.  RBF networks: blend of instance-based method and neural network method.  Case-based reasoning

62 Lazy and Eager Learning Lazy: wait for query before generalizing k-Nearest Neighbour, Case based reasoning Eager: generalize before seeing query RBF Networks, ID3, … Does it matter? Eager learner must create global approximation Lazy learner can create many local approximations

The End

1 Instance Based Learning Soongsil University Intelligent Systems Lab.

Similar presentations

Presentation on theme: "1 Instance Based Learning Soongsil University Intelligent Systems Lab."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Instance Based Learning Soongsil University Intelligent Systems Lab.

Similar presentations

Presentation on theme: "1 Instance Based Learning Soongsil University Intelligent Systems Lab."— Presentation transcript:

Similar presentations

About project

Feedback