Support Vector Machines, Kernels, and Development of Representations Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore.

Support Vector Machines, Kernels, and Development of Representations Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County

Outline Prediction Prediction Support vector machines Support vector machines Kernels Kernels Development of representations Development of representations

Outline Prediction Prediction Why might predictions be wrong? Why might predictions be wrong? Support vector machines Support vector machines Doing really well with linear models Doing really well with linear models Kernels Kernels Making the non-linear linear Making the non-linear linear Development of representations Development of representations Learning representations by learning kernels Learning representations by learning kernels Beyond the veil of perception Beyond the veil of perception

Prediction

Supervised ML = Prediction Given training instances (x,y) Given training instances (x,y) Learn a model f Learn a model f Such that f(x) = y Such that f(x) = y Use f to predict y for new x Use f to predict y for new x Many variations on this basic theme Many variations on this basic theme

Why might predictions be wrong? True non-determinism True non-determinism Partial observability Partial observability hard, soft hard, soft Representational bias Representational bias Algorithmic bias Algorithmic bias Bounded resources Bounded resources

True Non-Determinism Flip a biased coin Flip a biased coin p(heads) =  p(heads) =  Estimate  Estimate  If  > 0.5 predict heads, else tails If  > 0.5 predict heads, else tails Lots of ML research on problems like this Lots of ML research on problems like this Learn a model Learn a model Do the best you can in expectation Do the best you can in expectation

Partial Observability Something needed to predict y is missing from observation x Something needed to predict y is missing from observation x N-bit parity problem N-bit parity problem x contains N-1 bits (hard PO) x contains N-1 bits (hard PO) x contains N bits but learner ignores some of them (soft PO) x contains N bits but learner ignores some of them (soft PO)

Representational Bias Having the right features (x) is crucial Having the right features (x) is crucial TD-Gammon TD-Gammon XOOOOXXX X O O O O X X X

Development of Representations The agent (e.g. a robot) The agent (e.g. a robot) Fixed sensor suite Fixed sensor suite Multiple tasks Multiple tasks Changing environment Changing environment Interaction with humans Interaction with humans Sensors, concepts, beliefs, desires, intentions Sensors, concepts, beliefs, desires, intentions Representations Representations Expressed solely in terms of observables Expressed solely in terms of observables Hand-coded, thus fixed Hand-coded, thus fixed

Other Reasons for Wrong Predictions Algorithmic bias Algorithmic bias Bounded resources Bounded resources

Support Vector Machines Doing Really Well with Linear Decision Surfaces

Strengths of SVMs Good generalization in theory Good generalization in theory Good generalization in practice Good generalization in practice Work well with few training instances Work well with few training instances Find globally best model Find globally best model Efficient algorithms Efficient algorithms Amenable to the kernel trick Amenable to the kernel trick

Linear Separators Training instances Training instances x   n x   n y  {-1, 1} y  {-1, 1} w   n w   n b   b   Hyperplane Hyperplane + b = 0 + b = 0 w 1 x 1 + w 2 x 2 … + w n x n + b = 0 w 1 x 1 + w 2 x 2 … + w n x n + b = 0 Decision function Decision function f(x) = sign( + b) f(x) = sign( + b)

Intuitions X X O O O O O O X X X X X X O O

A “Good” Separator X X O O O O O O X X X X X X O O

Noise in the Observations X X O O O O O O X X X X X X O O

Ruling Out Some Separators X X O O O O O O X X X X X X O O

Lots of Noise X X O O O O O O X X X X X X O O

Maximizing the Margin X X O O O O O O X X X X X X O O

“Fat” Separators X X O O O O O O X X X X X X O O

Why Maximize Margin? Increasing margin reduces capacity Increasing margin reduces capacity Must restrict capacity to generalize Must restrict capacity to generalize m training instances m training instances 2 m ways to label them 2 m ways to label them What if function class that can separate them all? What if function class that can separate them all? Shatters the training instances Shatters the training instances VC Dimension is largest m such that function class can shatter some set of m points VC Dimension is largest m such that function class can shatter some set of m points

VC Dimension Example X XX O XX X OX X XO O OX O XO X OO O OO

Bounding Generalization Error R[f] = risk, test error R[f] = risk, test error R emp [f] = empirical risk, train error R emp [f] = empirical risk, train error h = VC dimension h = VC dimension m = number of training instances m = number of training instances  = probability that bound does not hold  = probability that bound does not hold 1 m 2m h ln + 1 4  + ln h R[f]  R emp [f] +

Support Vectors X X O O O O O O O O X X X X X X

The Math Training instances Training instances x   n x   n y  {-1, 1} y  {-1, 1} Decision function Decision function f(x) = sign( + b) f(x) = sign( + b) w   n w   n b   b   Find w and b that Find w and b that Perfectly classify training instances Perfectly classify training instances Assuming linear separability Assuming linear separability Maximize margin Maximize margin

The Math For perfect classification, we want For perfect classification, we want y i ( + b) ≥ 0 for all i y i ( + b) ≥ 0 for all i Why? Why? To maximize the margin, we want To maximize the margin, we want w that minimizes |w| 2 w that minimizes |w| 2

Dual Optimization Problem Maximize over  Maximize over  W(  ) =  i  i - 1/2  i,j  i  j y i y j W(  ) =  i  i - 1/2  i,j  i  j y i y j Subject to Subject to  i  0  i  0  i  i y i = 0  i  i y i = 0 Decision function Decision function f(x) = sign(  i  i y i + b) f(x) = sign(  i  i y i + b)

What if Data Are Not Perfectly Linearly Separable? Cannot find w and b that satisfy Cannot find w and b that satisfy y i ( + b) ≥ 1 for all i y i ( + b) ≥ 1 for all i Introduce slack variables  i Introduce slack variables  i y i ( + b) ≥ 1 -  i for all i y i ( + b) ≥ 1 -  i for all i Minimize Minimize |w| 2 + C   i |w| 2 + C   i

Strengths of SVMs Good generalization in theory Good generalization in theory Good generalization in practice Good generalization in practice Work well with few training instances Work well with few training instances Find globally best model Find globally best model Efficient algorithms Efficient algorithms Amenable to the kernel trick … Amenable to the kernel trick …

What if Surface is Non- Linear? X X X X X X O O O O O O O O O O O O O O O O O O O O

Kernel Methods Making the Non-Linear Linear

When Linear Separators Fail XOOOOXXX x1x1 x2x2 X O O O O X X X x1x1 x12x12

Mapping into a New Feature Space Rather than run SVM on x i, run it on  (x i ) Rather than run SVM on x i, run it on  (x i ) Find non-linear separator in input space Find non-linear separator in input space What if  (x i ) is really big? What if  (x i ) is really big? Use kernels to compute it implicitly! Use kernels to compute it implicitly!  : x  X =  (x)  (x 1,x 2 ) = (x 1,x 2,x 1 2,x 2 2,x 1 x 2 )

Kernels Find kernel K such that Find kernel K such that K(x 1,x 2 ) = K(x 1,x 2 ) = Computing K(x 1,x 2 ) should be efficient, much more so than computing  (x 1 ) and  (x 2 ) Computing K(x 1,x 2 ) should be efficient, much more so than computing  (x 1 ) and  (x 2 ) Use K(x 1,x 2 ) in SVM algorithm rather than Use K(x 1,x 2 ) in SVM algorithm rather than Remarkably, this is possible Remarkably, this is possible

The Polynomial Kernel K(x 1,x 2 ) = 2 K(x 1,x 2 ) = 2 x 1 = (x 11, x 12 ) x 1 = (x 11, x 12 ) x 2 = (x 21, x 22 ) x 2 = (x 21, x 22 ) = (x 11 x 21 + x 12 x 22 ) = (x 11 x 21 + x 12 x 22 ) 2 = (x 11 2 x 21 2 + x 12 2 x 22 2 + 2x 11 x 12 x 21 x 22 ) 2 = (x 11 2 x 21 2 + x 12 2 x 22 2 + 2x 11 x 12 x 21 x 22 )  (x 1 ) = (x 11 2, x 12 2, √2x 11 x 12 )  (x 1 ) = (x 11 2, x 12 2, √2x 11 x 12 )  (x 2 ) = (x 21 2, x 22 2, √2x 21 x 22 )  (x 2 ) = (x 21 2, x 22 2, √2x 21 x 22 ) K(x 1,x 2 ) = K(x 1,x 2 ) =

The Polynomial Kernel  (x) contains all monomials of degree d  (x) contains all monomials of degree d Useful in visual pattern recognition Useful in visual pattern recognition Number of monomials Number of monomials 16x16 pixel image 16x16 pixel image 10 10 monomials of degree 5 10 10 monomials of degree 5 Never explicitly compute  (x)! Never explicitly compute  (x)! Variation - K(x 1,x 2 ) = ( + 1) 2 Variation - K(x 1,x 2 ) = ( + 1) 2

Kernels What does it mean to be a kernel? What does it mean to be a kernel? K(x 1,x 2 ) = for some  K(x 1,x 2 ) = for some  What does it take to be a kernel? What does it take to be a kernel? The Gram matrix G ij = K(x i, x j ) The Gram matrix G ij = K(x i, x j ) Positive definite matrix Positive definite matrix  ij c i c j G ij  0 for c i, c j    ij c i c j G ij  0 for c i, c j   Positive definite kernel Positive definite kernel For all samples of size m, induces a positive definite Gram matrix For all samples of size m, induces a positive definite Gram matrix

A Few Kernels Dot product kernel Dot product kernel K(x 1,x 2 ) = K(x 1,x 2 ) = Polynomial kernel Polynomial kernel K(x 1,x 2 ) = d K(x 1,x 2 ) = d Monomials of degree d Monomials of degree d Gaussian kernel Gaussian kernel K(x 1,x 2 ) = exp(-| x 1 -x 2 | 2 /2  2 ) K(x 1,x 2 ) = exp(-| x 1 -x 2 | 2 /2  2 ) Radial basis functions Radial basis functions Sigmoid kernel Sigmoid kernel K(x 1,x 2 ) = tanh( + ) K(x 1,x 2 ) = tanh( + ) Neural networks Neural networks Establishing “kernel-hood” from first principles is non- trivial Establishing “kernel-hood” from first principles is non- trivial

The Kernel Trick “Given an algorithm which is formulated in terms of a positive definite kernel K1, one can construct an alternative algorithm by replacing K 1 with another positive definite kernel K 2 ”  SVMs can use the kernel trick

Exotic Kernels Strings Strings Trees Trees Graphs Graphs The hard part is establishing kernel-hood The hard part is establishing kernel-hood

Development of Representations

Motivation, Again The agent (e.g. a robot) The agent (e.g. a robot) Fixed sensor suite Fixed sensor suite Multiple tasks Multiple tasks Changing environment Changing environment Interaction with humans Interaction with humans Sensors, concepts, beliefs, desires, intentions Sensors, concepts, beliefs, desires, intentions Representations Representations Expressed solely in terms of observables Expressed solely in terms of observables Hand-coded, thus fixed Hand-coded, thus fixed

An Old Idea Constructive induction Constructive induction Search over feature space Search over feature space Generate new features from existing ones Generate new features from existing ones f new = f 1 * f 2 f new = f 1 * f 2 Test new features by running learning algorithm Test new features by running learning algorithm Retain those that improve performance Retain those that improve performance

Making an Old Idea New Co-chaired Workshop on Development of Representations at ICML 2002 Co-chaired Workshop on Development of Representations at ICML 2002 Conclusions Conclusions Work in this area is vitally important Work in this area is vitally important Hand-coded representations Hand-coded representations Keep humans in the loop Keep humans in the loop Limited time and ingenuity Limited time and ingenuity TD-Gammon TD-Gammon Need big success to draw more attention from community Need big success to draw more attention from community

Change of Feature Space K(x,y) = K 1 (x,y) + (1 - )K 2 (x,y) K(x,y) = K 1 (x,y) + (1 - )K 2 (x,y) K(x,y) =  K 1 (x,y) K(x,y) =  K 1 (x,y) K(x,y) = K 1 (x,y) K 2 (x,y) K(x,y) = K 1 (x,y) K 2 (x,y) K(x,y) = f(x) f(y) K(x,y) = f(x) f(y) K(x,y) = K 3 (  (x),  (y)) K(x,y) = K 3 (  (x),  (y)) … and so on … and so on

Searching Through Feature Spaces (not Feature Space) Each kernel corresponds to a distinct feature space  Each kernel corresponds to a distinct feature space  Search over kernel space (feature spaces) Search over kernel space (feature spaces) Start with known kernels Start with known kernels Operators generate new (composite) kernels Operators generate new (composite) kernels Evaluate according to performance Evaluate according to performance Keep learning algorithm constant, vary representation Keep learning algorithm constant, vary representation

Current Work in Lab Identify/construct datasets where existing kernels are insufficient Identify/construct datasets where existing kernels are insufficient Determine conditions under which this occurs Determine conditions under which this occurs Search kernel space by hand Search kernel space by hand Implement automated search methods Implement automated search methods Developing naturally occurring datasets Developing naturally occurring datasets Robotic applications Robotic applications Methods for triggering search over kernel space Methods for triggering search over kernel space

Beyond the Veil of Perception Human perception is limited Human perception is limited Sight, hearing, taste, smell, touch Sight, hearing, taste, smell, touch We overcome this limitation We overcome this limitation Genes, atoms, gravity, tectonic plates, germs, dark matter, electricity, black holes Genes, atoms, gravity, tectonic plates, germs, dark matter, electricity, black holes We can’t see, hear, taste, smell, or touch a black hole We can’t see, hear, taste, smell, or touch a black hole Most physicists believe they exist Most physicists believe they exist Theoretical entities Theoretical entities Causally efficacious entities of the world that cannot be sensed directly Causally efficacious entities of the world that cannot be sensed directly

Our Claim Theoretical entities are of fundamental importance to the development of knowledge in individuals Theoretical entities are of fundamental importance to the development of knowledge in individuals Core concepts in which all others ground out Core concepts in which all others ground out Weight, distance, velocity, orientation, animacy Weight, distance, velocity, orientation, animacy Beliefs, desires, intentions - theory of mind Beliefs, desires, intentions - theory of mind Robots need algorithms for discovering, validating, and using theoretical entities if they are to move beyond the veil of perception Robots need algorithms for discovering, validating, and using theoretical entities if they are to move beyond the veil of perception

Discovering Theoretical Entities Actively explore environment Actively explore environment Child/robot as scientist Child/robot as scientist Posit TEs to explain non-determinism in action outcomes Posit TEs to explain non-determinism in action outcomes Keep those that are mutually predictive Keep those that are mutually predictive Flipping a coin Flipping a coin Posit a bit somewhere in the universe whose value determines outcome Posit a bit somewhere in the universe whose value determines outcome Flip coin to determine current value of that bit Flip coin to determine current value of that bit Use the bit’s value to make other predictions Use the bit’s value to make other predictions

The Concept “Weight” Do you have a weight sensor? Do you have a weight sensor? Are children born with knowledge that objects have a property called weight? Are children born with knowledge that objects have a property called weight? How does this concept arise? How does this concept arise? We claim that “weight” is a theoretical entity, just like dark matter and black holes We claim that “weight” is a theoretical entity, just like dark matter and black holes

The Crate-Stacking Robot slide fast slow force high low stack topple stable weight heavy light

What Did the Robot Accomplish? Prediction Prediction Push crate and predict outcome of stacking Push crate and predict outcome of stacking Control Control Build taller, more stable stacks Build taller, more stable stacks Understanding Understanding Acquired fundamental concept about the world, grounded in action Acquired fundamental concept about the world, grounded in action Extended innate sensory endowment Extended innate sensory endowment “Weight” cannot be represented as a simple combination of sensor data “Weight” cannot be represented as a simple combination of sensor data

Triggering Search for Kernels Why might predictions about action outcomes be wrong? Why might predictions about action outcomes be wrong? Recall first few slides Recall first few slides TEs directly address partial observability TEs directly address partial observability If representation is wrong, action outcomes that are deterministic may appear to be random If representation is wrong, action outcomes that are deterministic may appear to be random Use TEs to trigger search for new feature space that renders these outcomes (nearly) deterministic Use TEs to trigger search for new feature space that renders these outcomes (nearly) deterministic

Conclusion SVMs find optimal linear separator SVMs find optimal linear separator The kernel trick makes SVMs non-linear learning algorithms The kernel trick makes SVMs non-linear learning algorithms Development of representations Development of representations Searching through feature spaces Searching through feature spaces Triggering search with theoretical entities Triggering search with theoretical entities

Support Vector Machines, Kernels, and Development of Representations Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore.

Similar presentations

Presentation on theme: "Support Vector Machines, Kernels, and Development of Representations Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Support Vector Machines, Kernels, and Development of Representations Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore.

Similar presentations

Presentation on theme: "Support Vector Machines, Kernels, and Development of Representations Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore."— Presentation transcript:

Similar presentations

About project

Feedback