Neural networks and support vector machines

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Discriminative and generative methods for bags of features
Lecture 14 – Neural Networks
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Greg GrudicIntro AI1 Introduction to Artificial Intelligence CSCI 3202: The Perceptron Algorithm Greg Grudic.
Lecture 10: Support Vector Machines
Machine learning Image source:
An Introduction to Support Vector Machines Martin Law.
Machine learning Image source:
Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Support Vector Machine & Image Classification Applications
CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 24 – Classifiers 1.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
Non-Bayes classifiers. Linear discriminants, neural networks.
CS 1699: Intro to Computer Vision Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh October 29, 2015.
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Machine Learning Supervised Learning Classification and Regression
CSSE463: Image Recognition Day 14
CS 9633 Machine Learning Support Vector Machines
Support Vector Machine
Deep Feedforward Networks
ECE 5424: Introduction to Machine Learning
Learning with Perceptrons and Neural Networks
Real Neurons Cell structures Cell body Dendrites Axon
COMP24111: Machine Learning and Optimisation
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
An Introduction to Support Vector Machines
An Introduction to Support Vector Machines
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Linear Discriminators
CSC 578 Neural Networks and Deep Learning
Support Vector Machines
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
CS 2750: Machine Learning Support Vector Machines
Classification Neural Networks 1
Hyperparameters, bias-variance tradeoff, validation
Perceptron as one Type of Linear Discriminants
CSSE463: Image Recognition Day 14
Neural Network - 2 Mayank Vatsa
CSSE463: Image Recognition Day 14
Neural Networks Geoff Hulten.
Neural Networks ICS 273A UC Irvine Instructor: Max Welling
CSSE463: Image Recognition Day 14
Support Vector Machines
CSSE463: Image Recognition Day 14
Support Vector Machines and Kernels
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Introduction to Radial Basis Function Networks
CSCE833 Machine Learning Lecture 9 Linear Discriminant Analysis
Neural networks (1) Traditional multi-layer perceptrons
CSSE463: Image Recognition Day 14
Image Classification & Training of Neural Networks
Chapter - 3 Single Layer Percetron
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Linear Discrimination
Introduction to Neural Networks
Image recognition.
Support Vector Machines 2
Outline Announcement Neural networks Perceptrons - continued
Presentation transcript:

Neural networks and support vector machines

Perceptron . Input Weights x1 w1 x2 w2 Output: sgn(wx + b) x3 w3 Can incorporate bias as component of the weight vector by always including a feature with value set to 1 wD xD

Loose inspiration: Human neurons From Wikipedia: At the majority of synapses, signals are sent from the axon of one neuron to a dendrite of another... All neurons are electrically excitable, maintaining voltage gradients across their membranes… If the voltage changes by a large enough amount, an all-or-none electrochemical pulse called an action potential is generated, which travels rapidly along the cell's axon, and activates synaptic connections with other cells when it arrives.

Linear separability

Perceptron training algorithm Initialize weights Cycle through training examples in multiple passes (epochs) For each training example: If classified correctly, do nothing If classified incorrectly, update weights

Perceptron update rule For each training instance x with label y: Classify with current weights: y’ = sgn(wx) Update weights: w  w + α(y-y’)x α is a learning rate that should decay as a function of epoch t, e.g., 1000/(1000+t) What happens if y’ is correct? Otherwise, consider what happens to individual weights wi  wi + α(y-y’)xi If y = 1 and y’ = -1, wi will be increased if xi is positive or decreased if xi is negative  wx will get bigger If y = -1 and y’ = 1, wi will be decreased if xi is positive or increased if xi is negative  wx will get smaller

Convergence of perceptron update rule Linearly separable data: converges to a perfect solution Non-separable data: converges to a minimum-error solution assuming learning rate decays as O(1/t) and examples are presented in random sequence

Implementation details Bias (add feature dimension with value fixed to 1) vs. no bias Initialization of weights: all zeros vs. random Learning rate decay function Number of epochs (passes through the training data) Order of cycling through training examples (random)

Multi-class perceptrons One-vs-others framework: Need to keep a weight vector wc for each class c Decision rule: c = argmaxc wc x Update rule: suppose example from class c gets misclassified as c’ Update for c: wc  wc + α x Update for c’: wc’  wc’ – α x

Multi-class perceptrons One-vs-others framework: Need to keep a weight vector wc for each class c Decision rule: c = argmaxc wc x Softmax: Max Perceptrons w/ weights wc Inputs

Visualizing linear classifiers Source: Andrej Karpathy, http://cs231n.github.io/linear-classify/

Differentiable perceptron Input Weights x1 w1 x2 w2 Output: (wx + b) x3 w3 . Sigmoid function: wd xd

Update rule for differentiable perceptron Define total classification error or loss on the training set: Update weights by gradient descent: w1 w2

Update rule for differentiable perceptron Define total classification error or loss on the training set: Update weights by gradient descent: For a single training point, the update is:

Update rule for differentiable perceptron For a single training point, the update is: This is called stochastic gradient descent Compare with update rule with non-differentiable perceptron:

Network with a hidden layer weights w’ weights wj Can learn nonlinear functions (provided each perceptron is nonlinear and differentiable)

Network with a hidden layer Hidden layer size and network capacity: Source: http://cs231n.github.io/neural-networks-1/

Training of multi-layer networks Find network weights to minimize the error between true and estimated labels of training examples: Update weights by gradient descent: This requires perceptrons with a differentiable nonlinearity Sigmoid: Rectified linear unit (ReLU): g(t) = max(0,t)

Training of multi-layer networks Find network weights to minimize the error between true and estimated labels of training examples: Update weights by gradient descent: Back-propagation: gradients are computed in the direction from output to input layers and combined using chain rule Stochastic gradient descent: compute the weight update w.r.t. one training example (or a small batch of examples) at a time, cycle through training examples in random order in multiple epochs

Regularization It is common to add a penalty on weight magnitudes to the objective function: This encourages network to use all of its inputs “a little” rather than a few inputs “a lot” Source: http://cs231n.github.io/neural-networks-1/

Multi-Layer Network Demo http://playground.tensorflow.org/

Neural networks: Pros and cons Flexible and general function approximation framework Can build extremely powerful models by adding more layers Cons Hard to analyze theoretically (e.g., training is prone to local optima) Huge amount of training data, computing power required to get good performance The space of implementation choices is huge (network architectures, parameters)

Support vector machines When the data is linearly separable, there may be more than one separator (hyperplane) Which separator is best?

Support vector machines Find hyperplane that maximizes the margin between the positive and negative examples Support vectors Margin C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

SVM parameter learning Separable data: Non-separable data: Maximize margin Classify training data correctly Maximize margin Minimize classification mistakes

SVM parameter learning Margin +1 -1 Demo: http://cs.stanford.edu/people/karpathy/svmjs/demo

Nonlinear SVMs General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable Φ: x → φ(x) Image source

Nonlinear SVMs Linearly separable dataset in 1D: Non-separable dataset in 1D: We can map the data to a higher-dimensional space: x x x2 x Slide credit: Andrew Moore

The kernel trick General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable The kernel trick: instead of explicitly computing the lifting transformation φ(x), define a kernel function K such that K(x , y) = φ(x) · φ(y) (to be valid, the kernel function must satisfy Mercer’s condition)

The kernel trick Linear SVM decision function: learned weight Support vector C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

The kernel trick Linear SVM decision function: Kernel SVM decision function: This gives a nonlinear decision boundary in the original feature space C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

Polynomial kernel:

Gaussian kernel Also known as the radial basis function (RBF) kernel: K(x, y) ||x – y||

Gaussian kernel SV’s

SVMs: Pros and cons Pros Cons Kernel-based framework is very powerful, flexible Training is convex optimization, globally optimal solution can be found Amenable to theoretical analysis SVMs work very well in practice, even with very small training sample sizes Cons No “direct” multi-class SVM, must combine two-class SVMs (e.g., with one-vs-others) Computation, memory (esp. for nonlinear SVMs)

Neural networks vs. SVMs (a.k.a. “deep” vs. “shallow” learning)