Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning: k-Nearest Neighbor and Support Vector Machines skim 20.4, 20.6-20.7 CMSC 471.

Similar presentations


Presentation on theme: "Machine Learning: k-Nearest Neighbor and Support Vector Machines skim 20.4, 20.6-20.7 CMSC 471."— Presentation transcript:

1 Machine Learning: k-Nearest Neighbor and Support Vector Machines skim 20.4, 20.6-20.7 CMSC 471

2 Revised End-of-Semester Schedule Wed 11/21Machine Learning IV Wed 11/21Machine Learning IV Mon 11/26Philosophy of AI (You must read the three articles!) Mon 11/26Philosophy of AI (You must read the three articles!) Wed 11/28Special Topics Wed 11/28Special Topics Mon 12/3Special Topics Mon 12/3Special Topics Wed 12/5Review / Tournament dry run #2 (HW6 due) Wed 12/5Review / Tournament dry run #2 (HW6 due) Mon 12/10Tournament Mon 12/10Tournament Wed 12/19FINAL EXAM (1:00pm - 3:00pm) (Project and final report due) Wed 12/19FINAL EXAM (1:00pm - 3:00pm) (Project and final report due) NO LATE SUBMISSIONS ALLOWED! Special Topics Special Topics Robotics Robotics AI in Games AI in Games Natural language processing Natural language processing Multi-agent systems Multi-agent systems

3 k-Nearest Neighbor Instance-Based Learning Some material adapted from slides by Andrew Moore, CMU. Visit http://www.autonlab.org/tutorials/ for http://www.autonlab.org/tutorials/ Andrew’s repository of Data Mining tutorials.

4 1-Nearest Neighbor One of the simplest of all machine learning classifiers One of the simplest of all machine learning classifiers Simple idea: label a new point the same as the closest known point Simple idea: label a new point the same as the closest known point Label it red.

5 1-Nearest Neighbor A type of instance-based learning A type of instance-based learning Also known as “memory-based” learning Also known as “memory-based” learning Forms a Voronoi tessellation of the instance space Forms a Voronoi tessellation of the instance space

6 Distance Metrics Different metrics can change the decision surface Different metrics can change the decision surface Standard Euclidean distance metric: Standard Euclidean distance metric: Two-dimensional: Dist(a,b) = sqrt((a 1 – b 1 ) 2 + (a 2 – b 2 ) 2 ) Two-dimensional: Dist(a,b) = sqrt((a 1 – b 1 ) 2 + (a 2 – b 2 ) 2 ) Multivariate: Dist(a,b) = sqrt(∑ (a i – b i ) 2 ) Multivariate: Dist(a,b) = sqrt(∑ (a i – b i ) 2 ) Dist(a,b) =(a 1 – b 1 ) 2 + (a 2 – b 2 ) 2 Dist(a,b) =(a 1 – b 1 ) 2 + (3a 2 – 3b 2 ) 2 Adapted from “Instance-Based Learning” lecture slides by Andrew Moore, CMU.

7 Four Aspects of an Instance-Based Learner:   A distance metric   How many nearby neighbors to look at?   A weighting function (optional)   How to fit with the local points? Adapted from “Instance-Based Learning” lecture slides by Andrew Moore, CMU.

8 1-NN’s Four Aspects as an Instance-Based Learner:   A distance metric Euclidian   How many nearby neighbors to look at? One   A weighting function (optional) Unused   How to fit with the local points? Just predict the same output as the nearest neighbor. Adapted from “Instance-Based Learning” lecture slides by Andrew Moore, CMU.

9 Zen Gardens Mystery of renowned zen garden revealed [CNN Article] Thursday, September 26, 2002 Posted: 10:11 AM EDT (1411 GMT) LONDON (Reuters) -- For centuries visitors to the renowned Ryoanji Temple garden in Kyoto, Japan have been entranced and mystified by the simple arrangement of rocks. The five sparse clusters on a rectangle of raked gravel are said to be pleasing to the eyes of the hundreds of thousands of tourists who visit the garden each year. Scientists in Japan said on Wednesday they now believe they have discovered its mysterious appeal. "We have uncovered the implicit structure of the Ryoanji garden's visual ground and have shown that it includes an abstract, minimalist depiction of natural scenery," said Gert Van Tonder of Kyoto University. The researchers discovered that the empty space of the garden evokes a hidden image of a branching tree that is sensed by the unconscious mind. "We believe that the unconscious perception of this pattern contributes to the enigmatic appeal of the garden," Van Tonder added. He and his colleagues believe that whoever created the garden during the Muromachi era between 1333-1573 knew exactly what they were doing and placed the rocks around the tree image. By using a concept called medial-axis transformation, the scientists showed that the hidden branched tree converges on the main area from which the garden is viewed. The trunk leads to the prime viewing site in the ancient temple that once overlooked the garden. It is thought that abstract art may have a similar impact. "There is a growing realisation that scientific analysis can reveal unexpected structural features hidden in controversial abstract paintings," Van Tonder said Adapted from “Instance-Based Learning” lecture slides by Andrew Moore, CMU.

10 k – Nearest Neighbor Generalizes 1-NN to smooth away noise in the labels Generalizes 1-NN to smooth away noise in the labels A new point is now assigned the most frequent label of its k nearest neighbors A new point is now assigned the most frequent label of its k nearest neighbors Label it red, when k = 3 Label it blue, when k = 7

11 k-Nearest Neighbor (k = 9) A magnificent job of noise smoothing. Three cheers for 9- nearest-neighbor. But the lack of gradients and the jerkiness isn’t good. Appalling behavior! Loses all the detail that 1-nearest neighbor would give. The tails are horrible! Fits much less of the noise, captures trends. But still, frankly, pathetic compared with linear regression. Adapted from “Instance-Based Learning” lecture slides by Andrew Moore, CMU.

12 Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County Doing Really Well with Linear Decision Surfaces

13 Outline Prediction Prediction Why might predictions be wrong? Why might predictions be wrong? Support vector machines Support vector machines Doing really well with linear models Doing really well with linear models Kernels Kernels Making the non-linear linear Making the non-linear linear

14 Supervised ML = Prediction Given training instances (x,y) Given training instances (x,y) Learn a model f Learn a model f Such that f(x) = y Such that f(x) = y Use f to predict y for new x Use f to predict y for new x Many variations on this basic theme Many variations on this basic theme

15 Why might predictions be wrong? True Non-Determinism True Non-Determinism Flip a biased coin Flip a biased coin p(heads) =  p(heads) =  Estimate  Estimate  If  > 0.5 predict heads, else tails If  > 0.5 predict heads, else tails Lots of ML research on problems like this Lots of ML research on problems like this Learn a model Learn a model Do the best you can in expectation Do the best you can in expectation

16 Why might predictions be wrong? Partial Observability Partial Observability Something needed to predict y is missing from observation x Something needed to predict y is missing from observation x N-bit parity problem N-bit parity problem x contains N-1 bits (hard PO) x contains N-1 bits (hard PO) x contains N bits but learner ignores some of them (soft PO) x contains N bits but learner ignores some of them (soft PO)

17 Why might predictions be wrong? True non-determinism True non-determinism Partial observability Partial observability hard, soft hard, soft Representational bias Representational bias Algorithmic bias Algorithmic bias Bounded resources Bounded resources

18 Representational Bias Having the right features (x) is crucial Having the right features (x) is crucial XOOOOXXX X O O O O X X X

19 Support Vector Machines Doing Really Well with Linear Decision Surfaces

20 Strengths of SVMs Good generalization in theory Good generalization in theory Good generalization in practice Good generalization in practice Work well with few training instances Work well with few training instances Find globally best model Find globally best model Efficient algorithms Efficient algorithms Amenable to the kernel trick Amenable to the kernel trick

21 Linear Separators Training instances Training instances x   n x   n y  {-1, 1} y  {-1, 1} w   n w   n b   b   Hyperplane Hyperplane + b = 0 + b = 0 w 1 x 1 + w 2 x 2 … + w n x n + b = 0 w 1 x 1 + w 2 x 2 … + w n x n + b = 0 Decision function Decision function f(x) = sign( + b) f(x) = sign( + b) Math Review Inner (dot) product: = a · b = ∑ a i *b i = a · b = ∑ a i *b i = a 1 b 1 + a 2 b 2 + …+a n b n

22 Intuitions X X O O O O O O X X X X X X O O

23 Intuitions X X O O O O O O X X X X X X O O

24 Intuitions X X O O O O O O X X X X X X O O

25 Intuitions X X O O O O O O X X X X X X O O

26 A “Good” Separator X X O O O O O O X X X X X X O O

27 Noise in the Observations X X O O O O O O X X X X X X O O

28 Ruling Out Some Separators X X O O O O O O X X X X X X O O

29 Lots of Noise X X O O O O O O X X X X X X O O

30 Maximizing the Margin X X O O O O O O X X X X X X O O

31 “Fat” Separators X X O O O O O O X X X X X X O O

32 Why Maximize Margin? Increasing margin reduces capacity Increasing margin reduces capacity Must restrict capacity to generalize Must restrict capacity to generalize m training instances m training instances 2 m ways to label them 2 m ways to label them What if function class that can separate them all? What if function class that can separate them all? Shatters the training instances Shatters the training instances VC Dimension is largest m such that function class can shatter some set of m points VC Dimension is largest m such that function class can shatter some set of m points

33 VC Dimension Example X XX O XX X OX X XO O OX O XO X OO O OO

34 Bounding Generalization Error R[f] = risk, test error R[f] = risk, test error R emp [f] = empirical risk, train error R emp [f] = empirical risk, train error h = VC dimension h = VC dimension m = number of training instances m = number of training instances  = probability that bound does not hold  = probability that bound does not hold 1 m 2m h ln + 1 4  + ln h R[f]  R emp [f] +

35 Support Vectors X X O O O O O O O O X X X X X X

36 The Math Training instances Training instances x   n x   n y  {-1, 1} y  {-1, 1} Decision function Decision function f(x) = sign( + b) f(x) = sign( + b) w   n w   n b   b   Find w and b that Find w and b that Perfectly classify training instances Perfectly classify training instances Assuming linear separability Assuming linear separability Maximize margin Maximize margin

37 The Math For perfect classification, we want For perfect classification, we want y i ( + b) ≥ 0 for all i y i ( + b) ≥ 0 for all i Why? Why? To maximize the margin, we want To maximize the margin, we want w that minimizes |w| 2 w that minimizes |w| 2

38 Dual Optimization Problem Maximize over  Maximize over  W(  ) =  i  i - 1/2  i,j  i  j y i y j W(  ) =  i  i - 1/2  i,j  i  j y i y j Subject to Subject to  i  0  i  0  i  i y i = 0  i  i y i = 0 Decision function Decision function f(x) = sign(  i  i y i + b) f(x) = sign(  i  i y i + b)

39 What if Data Are Not Perfectly Linearly Separable? Cannot find w and b that satisfy Cannot find w and b that satisfy y i ( + b) ≥ 1 for all i y i ( + b) ≥ 1 for all i Introduce slack variables  i Introduce slack variables  i y i ( + b) ≥ 1 -  i for all i y i ( + b) ≥ 1 -  i for all i Minimize Minimize |w| 2 + C   i |w| 2 + C   i

40 Strengths of SVMs Good generalization in theory Good generalization in theory Good generalization in practice Good generalization in practice Work well with few training instances Work well with few training instances Find globally best model Find globally best model Efficient algorithms Efficient algorithms Amenable to the kernel trick … Amenable to the kernel trick …

41 What if Surface is Non- Linear? X X X X X X O O O O O O O O O O O O O O O O O O O O Image from http://www.atrandomresearch.com/iclass/

42 Kernel Methods Making the Non-Linear Linear

43 When Linear Separators Fail XOOOOXXX x1x1 x2x2 X O O O O X X X x1x1 x12x12

44 Mapping into a New Feature Space Rather than run SVM on x i, run it on  (x i ) Rather than run SVM on x i, run it on  (x i ) Find non-linear separator in input space Find non-linear separator in input space What if  (x i ) is really big? What if  (x i ) is really big? Use kernels to compute it implicitly! Use kernels to compute it implicitly!  : x  X =  (x)  (x 1,x 2 ) = (x 1,x 2,x 1 2,x 2 2,x 1 x 2 ) Image from http://web.engr.oregonstate.edu/http://web.engr.oregonstate.edu/ ~afern/classes/cs534/

45 Kernels Find kernel K such that Find kernel K such that K(x 1,x 2 ) = K(x 1,x 2 ) = Computing K(x 1,x 2 ) should be efficient, much more so than computing  (x 1 ) and  (x 2 ) Computing K(x 1,x 2 ) should be efficient, much more so than computing  (x 1 ) and  (x 2 ) Use K(x 1,x 2 ) in SVM algorithm rather than Use K(x 1,x 2 ) in SVM algorithm rather than Remarkably, this is possible Remarkably, this is possible

46 The Polynomial Kernel K(x 1,x 2 ) = 2 K(x 1,x 2 ) = 2 x 1 = (x 11, x 12 ) x 1 = (x 11, x 12 ) x 2 = (x 21, x 22 ) x 2 = (x 21, x 22 ) = (x 11 x 21 + x 12 x 22 ) = (x 11 x 21 + x 12 x 22 ) 2 = (x 11 2 x 21 2 + x 12 2 x 22 2 + 2x 11 x 12 x 21 x 22 ) 2 = (x 11 2 x 21 2 + x 12 2 x 22 2 + 2x 11 x 12 x 21 x 22 )  (x 1 ) = (x 11 2, x 12 2, √2x 11 x 12 )  (x 1 ) = (x 11 2, x 12 2, √2x 11 x 12 )  (x 2 ) = (x 21 2, x 22 2, √2x 21 x 22 )  (x 2 ) = (x 21 2, x 22 2, √2x 21 x 22 ) K(x 1,x 2 ) = K(x 1,x 2 ) =

47 The Polynomial Kernel  (x) contains all monomials of degree d  (x) contains all monomials of degree d Useful in visual pattern recognition Useful in visual pattern recognition Number of monomials Number of monomials 16x16 pixel image 16x16 pixel image 10 10 monomials of degree 5 10 10 monomials of degree 5 Never explicitly compute  (x)! Never explicitly compute  (x)! Variation - K(x 1,x 2 ) = ( + 1) 2 Variation - K(x 1,x 2 ) = ( + 1) 2

48 Kernels What does it mean to be a kernel? What does it mean to be a kernel? K(x 1,x 2 ) = for some  K(x 1,x 2 ) = for some  What does it take to be a kernel? What does it take to be a kernel? The Gram matrix G ij = K(x i, x j ) The Gram matrix G ij = K(x i, x j ) Positive definite matrix Positive definite matrix  ij c i c j G ij  0 for c i, c j    ij c i c j G ij  0 for c i, c j   Positive definite kernel Positive definite kernel For all samples of size m, induces a positive definite Gram matrix For all samples of size m, induces a positive definite Gram matrix

49 A Few Good Kernels Dot product kernel Dot product kernel K(x 1,x 2 ) = K(x 1,x 2 ) = Polynomial kernel Polynomial kernel K(x 1,x 2 ) = d (Monomials of degree d) K(x 1,x 2 ) = d (Monomials of degree d) K(x 1,x 2 ) = ( + 1) d (All monomials of degree 1,2,…,d) K(x 1,x 2 ) = ( + 1) d (All monomials of degree 1,2,…,d) Gaussian kernel Gaussian kernel K(x 1,x 2 ) = exp(-| x 1 -x 2 | 2 /2  2 ) K(x 1,x 2 ) = exp(-| x 1 -x 2 | 2 /2  2 ) Radial basis functions Radial basis functions Sigmoid kernel Sigmoid kernel K(x 1,x 2 ) = tanh( + ) K(x 1,x 2 ) = tanh( + ) Neural networks Neural networks Establishing “kernel-hood” from first principles is non- trivial Establishing “kernel-hood” from first principles is non- trivial

50 The Kernel Trick “Given an algorithm which is formulated in terms of a positive definite kernel K 1, one can construct an alternative algorithm by replacing K 1 with another positive definite kernel K 2 ”  SVMs can use the kernel trick

51 Using a Different Kernel in the Dual Optimization Problem For example, using the polynomial kernel with d = 4 (including lower-order terms). For example, using the polynomial kernel with d = 4 (including lower-order terms). Maximize over  Maximize over  W(  ) =  i  i - 1/2  i,j  i  j y i y j W(  ) =  i  i - 1/2  i,j  i  j y i y j Subject to Subject to  i  0  i  0  i  i y i = 0  i  i y i = 0 Decision function Decision function f(x) = sign(  i  i y i + b) f(x) = sign(  i  i y i + b) ( + 1) 4 X X These are kernels! So by the kernel trick, we just replace them!

52 Exotic Kernels Strings Strings Trees Trees Graphs Graphs The hard part is establishing kernel-hood The hard part is establishing kernel-hood

53 Conclusion SVMs find optimal linear separator SVMs find optimal linear separator The kernel trick makes SVMs non-linear learning algorithms The kernel trick makes SVMs non-linear learning algorithms


Download ppt "Machine Learning: k-Nearest Neighbor and Support Vector Machines skim 20.4, 20.6-20.7 CMSC 471."

Similar presentations


Ads by Google