Presentation is loading. Please wait.

Presentation is loading. Please wait.

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 8 - Instance based learning Prof. Giancarlo.

Similar presentations


Presentation on theme: "Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 8 - Instance based learning Prof. Giancarlo."— Presentation transcript:

1 Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 8 - Instance based learning Prof. Giancarlo Mauri

2 Instance-based Learning Key idea: Just store all training examples Classify a query instance by retrieving a set of “similar” instances Advantages Training is very fast - just storing examples Learn complex target functions Don’t lose information Disadvantages Slow at query time - need for efficient indexing of training examples Easily fooled by irrelevant attributes

3 Instance-based Learning Main approaches: k-Nearest Neighbor Locally weighted regression Radial basis functions Case-based reasoning Lazy and eager learning

4 K-Nearest Neighbor Learning Instances: points in R n Euclidean distance to measure similarity Target function f: R n  V The algorithm: Given query instance x q first locate nearest training example x n, then estimate f(x q ) = f(x n ) (for k=1) or take vote among its k nearest nbrs (k≠1 and f discrete-valued) take mean of f values of k nearest nbrs (k≠1 and f real-valued)

5 K-Nearest Neighbor Learning When to Consider Nearest Neighbor Instances map to points in R n Less than 20 attributes per instance Lots of training data

6 An example Boolean target function in 2D x classified + for k=1 and - for k=5 (left diagram) Right diagram (Voronoi diagram) shows the decision surface induced by 1-nbr for a given set of training examples (black dots) - - - - + + + + - x

7 Voronoi Diagram query point q f nearest neighbor q i

8 3-Nearest Neighbors query point q f 3 nearest neighbors 2x,1o

9 7-Nearest Neighbors query point q f 7 nearest neighbors 3x,4o

10 Nearest Neighbor (continuous) 1-nearest neighbor

11 Nearest Neighbor (continuous) 3-nearest neighbor

12 Nearest Neighbor (continuous) 5-nearest neighbor

13 Behavior in the Limit Let p(x) be the probability that instance x will be labeled 1 (positive) versus 0 (negative) Nearest neighbor As number of training examples  ∞, approaches Gibbs algorithm: with probability p(x) predict 1, else 0 K-Nearest neighbor As number of training examples  ∞ and k gets large, approaches Bayes optimal: if p(x)>.5 then predict 1, else 0 Note: Gibbs has at most twice the expected error of Bayes optimal

14 Might want weight nearer neighbors more heavily… where with d(x q,x i ) distance between x q and x i (if x q = x i, i.e. d(x q,x i ) = 0, then f(x q ) = f(x i )) Note: now it makes sense to use all training examples instead of just k (Shepard’s method, global, slow) Distance-Weighted k-NN

15 Curse of Dimensionality Imagine instances described by 20 attributes, but only 2 are relevant to target function nearest nbr is easily misled when high-dimensional X One approach: Stretch jth axis by weight z j, where z 1,…, z n chosen to minimize prediction error Use cross-validation to automatically choose weights z 1,…, z n Note setting z j to zero eliminates this dimension altogether

16 Locally weighted Regression Note that kNN forms local approximation to f for each query point x q : why not form an explicit approximation f’(x) for region surrounding x q ? Fit linear function to k nearest neighbors Fit quadratic… Produces “piecewise approximation” to f Several choices of error to minimize: Squared error over k nearest neighbors E 1 (x q ) = (∑ x  NNs of x q (f(x)-f’(x)) 2 )/2 Distance-weighted squared error over all nbrs E 1 (x q ) = (∑ x  D (f(x)-f’(x)) 2 K(d(x q,x))/2

17 Radial Basis Function Networks Global approximation to target function, in terms of linear combination of local approximations Used, e.g., for image classification A different kind of neural network Closely related to distance-weighted regression, but “eager” instead of “lazy”

18 Radial Basis Function Networks Where a i (x) are the attributes describing instance x, and One common choice for is

19 Training RBF Networks Q1: what x u to use for each kernel function K u (d(x u, x)) Scatter uniformly throughout instance space Or use training instances (reflects instance distribution) Q2: how to train weights (assume here Gaussian K u ) First choose variance (and perhaps mean) for each K u - e.g., use EM Then hold K u fixed, and train linear output layer - efficient methods to fit linear function

20 Radial Basis Function Network Global approximation to target function in terms of linear combination of local approximations Used, e.g. for image classification Similar to back-propagation neural network but activation function is Gaussian rather than sigmoid Closely related to distance-weighted regression but ”eager” instead of ”lazy”

21 Radial Basis Function Network input layer Kernel functions output f(x) xixi K n (d(x n,x))= exp(-1/2 d(x n,x) 2 /  2 ) wnwn linear parameters f(x)=w 0 +  n=1 k w n K n (d(x n,x))

22 Training Radial Basis Function Networks How to choose the center x n for each Kernel function K n ? scatter uniformly across instance space use distribution of training instances (clustering) How to train the weights? Choose mean x n and variance  n for each K n nonlinear optimization or EM Hold K n fixed and use local linear regression to compute the optimal weights w n

23 Radial Basis Network Example K 1 (d(x 1,x))= exp(-1/2 d(x 1,x) 2 /  2 ) w1 x+ w0 f ^ (x) = K 1 (w 1 x+ w 0 ) + K 2 (w 3 x + w 2 )

24 Case-based reasoning Can apply instance-based learning even when X ≠  need different “distance” metric Case-based reasoning is instance-based learning applied to instances with symbolic logic descriptions ((user-complaint error53-on-shutdown) (cpu-model PowerPC) (operating-system Windows) (network-connection PCIA) (memory 48meg) (installed-applications Excel Netscape VirusScan) (disk 1gig) (likely-cause???))

25 Case-based reasoning in CADET CADET: 75 stored examples of mechanical devices each training example: new query: desired function target value: mechanical structure for this function Distance metric: match qualitative function descriptions

26 Case-based Reasoning in CADET

27 Instances represented by rich structural descriptions Multiple cases retrieved (and combined) to form solution to new problem Tight coupling between case retrieval and problem solving Bottom line: Simple matching of cases useful for tasks such as answering help-desk queries Area of ongoing research

28 Lazy and Eager learning Lazy: wait for query before generalizing k-NEAREST NEIGHBOR, Case-based reasoning Eager: generalize before seein query Radial basis function networks, ID3, Backpropagation, NaiveBayes,… Does it matter? Eager learner must create global approximation Lazy learner can create many local approximations If they use same H, lazy can represent more complex fns (e.g., consider H = linear functions)

29 Machine Learning 2D5362 Instance Based Learning

30 Distance Weighted k-NN Give more weight to neighbors closer to the query point f ^ (x q ) =  i=1 k w i f(x i ) /  i=1 k w i where w i =K(d(x q,x i )) and d(x q,x i ) is the distance between x q and x i Instead of only k-nearest neighbors use all training examples (Shepard’s method)

31 Distance Weighted Average Weighting the data: f ^ (x q ) =  i f(x i ) K(d(x i,xq))/  i K(d(x i,x q )) Relevance of a data point (x i,f(x i )) is measured by calculating the distance d(x i,x q ) between the query x q and the input vector x i Weighting the error criterion: E(x q ) =  i (f ^ (x q )-f(x i )) 2 K(d(x i,xq)) the best estimate f ^ (x q ) will minimize the cost E(q), therefore  E(q)/  f ^ (x q )=0

32 Kernel Functions

33 Distance Weighted NN K(d(x q,x i )) = 1/ d(x q,x i ) 2

34 Distance Weighted NN K(d(x q,x i )) = 1/(d 0 +d(x q,x i )) 2

35 Distance Weighted NN K(d(x q,x i )) = exp(-(d(x q,x i )/  0 ) 2 )

36 Linear Global Models The model is linear in the parameters w k, which can be estimated using a least squares algorithm f ^ (x i ) =  k=1 D  k x ki or F(x) = X  Where x i =(x 1,…,x D ) i, i=1..N, with D the input dimension and N the number of data points. Estimate the w k by minimizing the error criterion E=  i=1 N (f ^ (x i ) – y i ) 2 (X T X)  = X T F(X)  = (X T X) -1 X T F(X)  k =  m=1 D  n=1 N (  l=1 D x T kl x lm ) -1 x T mn f(x n )

37 Linear Regression Example

38 Linear Local Models Estimate the parameters  k such that they locally (near the query point x q ) match the training data either by weighting the data: w i =K(d(x i,x q )) 1/2 and transforming z i =w i x i v i =w i y i or by weighting the error criterion: E=  i=1 N (x i T  – y i ) 2 K(d(x i,x q )) still linear in  with LSQ solution  = ((WX) T WX) -1 (WX) T WF(X)

39 Linear Local Model Example query point X q =0.35 Kernel K(x,x q ) Local linear model: f ^ (x)=b 1 x+b 0 f ^ (x q )=0.266

40 Linear Local Model Example

41 Design Issues in Local Regression Local model order (constant, linear, quadratic) Distance function d feature scaling: d(x,q)=(  j=1 d m j (x j -q j ) 2 ) 1/2 irrelevant dimensions m j =0 kernel function K smoothing parameter bandwidth h in K(d(x,q)/h) h=|m| global bandwidth h= distance to k-th nearest neighbor point h=h(q) depending on query point h=h i depending on stored data points See paper by Atkeson [1996] ”Locally Weighted Learning”

42 Local Linear Models

43 Local Linear Model Tree (LOLIMOT ) incremental tree construction algorithm partitions input space by axis-orthogonal splits adds one local linear model per iteration 1.start with an initial model (e.g. single LLM) 2.identify LLM with worst model error E i 3.check all divisions : split worst LLM hyper-rectangle in halves along each possible dimension 4.find best (smallest error) out of possible divisions 5.add new validity function and LLM 6.repeat from step 2. until termination criteria is met

44 LOLIMOT Initial global linear model Split along x1 or x2 Pick split that minimizes model error (residual)

45 LOLIMOT Example

46

47 Lazy and Eager Learning Lazy: wait for query before generalizing k-nearest neighbors, weighted linear regression Eager: generalize before seeing query Radial basis function networks, decision trees, back- propagation, LOLIMOT Eager learner must create global approximation Lazy learner can create local approximations If they use the same hypothesis space, lazy can represent more complex functions (H=linear functions)


Download ppt "Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 8 - Instance based learning Prof. Giancarlo."

Similar presentations


Ads by Google