Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Slides from: Doug Gray, David Poole
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Support Vector Machines
1 Support Vector Machines Some slides were borrowed from Andrew Moore’s PowetPoint slides on SVMs. Andrew’s PowerPoint repository is here:
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
Machine Learning Neural Networks
Support Vector Machine
Lecture 14 – Neural Networks
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines
Support Vector Machines
Artificial Neural Networks
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.
Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
Classification Part 3: Artificial Neural Networks
Support Vector Machine & Image Classification Applications
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.
Copyright © 2001, Andrew W. Moore Support Vector Machines Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University.
Computer Science and Engineering
Neural NetworksNN 11 Neural netwoks thanks to: Basics of neural network theory and practice for supervised and unsupervised.
DIGITAL IMAGE PROCESSING Dr J. Shanbehzadeh M. Hosseinajad ( J.Shanbehzadeh M. Hosseinajad)
Chapter 9 Neural Network.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification - SVM CS 685: Special Topics in Data Mining Jinze Liu.
Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural.
Mestrado em Ciência de Computadores Mestrado Integrado em Engenharia de Redes e Sistemas Informáticos VC 14/15 – TP19 Neural Networks & SVMs Miguel Tavares.
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
1 CSC 4510, Spring © Paula Matuszek CSC 4510 Support Vector Machines 2 (SVMs)
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification - SVM CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
1 CMSC 671 Fall 2010 Class #24 – Wednesday, November 24.
1 Support Vector Machines Chapter Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School.
1 Support Vector Machines. Why SVM? Very popular machine learning technique –Became popular in the late 90s (Vapnik 1995; 1998) –Invented in the late.
1 CSC 4510, Spring © Paula Matuszek CSC 4510 Support Vector Machines (SVMs)
Machine Learning Lecture 7: SVM Moshe Koppel Slides adapted from Andrew Moore Copyright © 2001, 2003, Andrew W. Moore.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification - SVM CS 685: Special Topics in Data Mining Jinze Liu.
CS 1699: Intro to Computer Vision Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh October 29, 2015.
Dec 21, 2006For ICDM Panel on 10 Best Algorithms Support Vector Machines: A Survey Qiang Yang, for ICDM 2006 Panel Partially.
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
1 Support Vector Machines Some slides were borrowed from Andrew Moore’s PowetPoint slides on SVMs. Andrew’s PowerPoint repository is here:
Support Vector Machines Louis Oliphant Cs540 section 2.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Support Vector Machine Slides from Andrew Moore and Mingyue Tan.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Support Vector Machines
Neural networks and support vector machines
CS 9633 Machine Learning Support Vector Machines
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Support Vector Machines
CSC 578 Neural Networks and Deep Learning
Support Vector Machines
Neural Network - 2 Mayank Vatsa
CS 485: Special Topics in Data Mining Jinze Liu
Class #212 – Thursday, November 12
Support Vector Machines
Support Vector Machines
Presentation transcript:

Data Mining Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew Moore lecture notes

Data Mining Volinsky - Columbia University Outline Special Topics –Neural Networks –Support Vector Machines 2

Data Mining Volinsky - Columbia University Neural Networks Agenda The biological inspiration Structure of neural net models Using neural net models Training neural net models Strengths and weaknesses An example 3

What the heck are neural nets? A data mining algorithm, inspired by biological processes A type of non-linear regression/classification An ensemble method –Although not usually thought of as such A black box! Data Mining Volinsky - Columbia University 4

Inspiration from Biology Information processing inspired by biological nervous systems Data Mining Volinsky - Columbia University Structure of the nervous system: A large number of neurons (information processing units) connected together A neuron’s response depends on the states of other neurons it is connected to and to the ‘strength’ of those connections. The ‘strengths’ are learned based on experience. 5

Data Mining Volinsky - Columbia University From Real to Artificial 6

Data Mining Volinsky - Columbia University Nodes: A Closer Look Input values weights Summing function Bias b Activation function Output y x1x1 x2x2 xmxm w2w2 wmwm w1w1 7

Data Mining Volinsky - Columbia University Nodes: A Closer Look A node (neuron) is the basic information processing unit of a neural net. It has: A set of inputs with weights w 1, w 2, …, w m along with a default input called the bias An adder function (linear combiner) that computes the weighted sum of the inputs An Activation function (squashing function) that transforms v, usually non-linearly. 8

Data Mining Volinsky - Columbia University A Simple Node: A Perceptron A simple activation function: A signing threshold x1x1 x2x2 xnxn w2w2 w1w1 wnwn b (bias) vy  (v) 9

Data Mining Volinsky - Columbia University Common Activation Functions Step function Sigmoid (logistic) function Hyperbolic Tangent (Tanh) function The s-shape adds non-linearity [Hornick (1989)]: combining many of these simple functions is sufficient to approximate any continuous non-linear function arbitrarily well over a compact interval. 10

Neural Network: Architecture Big idea: a combination of simple non-linear models working together to model a complex function How many layers? Nodes? What is the function? –Magic –Luckily, defaults do well Data Mining Volinsky - Columbia University Input layer Output layer Hidden Layer(s) 11

Data Mining Volinsky - Columbia University Neural Networks: The Model Model has two components –A particular architecture Number of hidden layers Number of nodes in the input, output and hidden layers Specification of the activation function(s) –The associated set of weights Weights and complexity are “learned” from the data –Supervised learning, applied iteratively –Out-of-sample methods; Cross-validation 12

Data Mining Volinsky - Columbia University Fitting a Neural Net: Feed Forward Supply attribute values at input nodes Obtain predictions from the output node(s) –Predicting classes Two classes – single output node with threshold Multiple classes – use multiple outputs, one for each class Predicted class = output node with highest value Multiple class problems are one of the main uses of NN! 13

Data Mining Volinsky - Columbia University A Simple NN: Regression A one-node neural network: –Called a ‘perceptron’ –Use identity function as the activation function –What’s the output? Weighted sum of inputs Logistic regression just changes the activation function to the logistic function Data Mining - Columbia University x1x1 x2x2 xnxn w2w2 w1w1 wnwn b (bias) vy  (v) 14

Data Mining Volinsky - Columbia University Training a NN: What does it learn? It fits/learns the weights that best translates inputs into outputs given its architecture Hidden units can be thought to learn some higher order regularities or features of the inputs that can be used to predict outputs. “Multi layer perceptron” 15

Data Mining Volinsky - Columbia University Perceptron Training Rule Perceptron = Adder + Threshold 1. Start with a random set of small weights. 2. Calculate an example 3. Change the weight by an amount proportional to the difference between the desired output and the actual output. Δ W i = η * (D-Y).I i Learning rate/ Step size Desired output Input Actual output 16

Data Mining Volinsky - Columbia University Training NNs: Back Propagation How to train a neural net (find the optimal weights): –Present a training sample to the neural network. –Calculate the error in each output neuron. –For each neuron, calculate what the output should have been, and a scaling factor, how much lower or higher the output must be adjusted to match the desired output. This is the local error. –Adjust the weights of each neuron to lower the local error. –Assign "blame" for the local error to neurons at the previous level, giving greater responsibility to neurons connected by stronger weights. –Repeat on the neurons at the previous level, using each one's "blame" as its error. This ‘propogates’ the error backward. The sequence of forward and backward fits is called ‘back propogation’. 17

Data Mining Volinsky - Columbia University Training NNs: How to do it A “Gradient Descent” algorithm is typically used to fit back propogation You can imagine a surface in an n-dimensional space such that –Each dimension is a weight –Each point in this space is a particular combination of weights –Each point on the “surface” is the output error that corresponds to that combination of weights –You want to minimize error i.e. find the “valleys” on this surface –Note the potential for ‘local minima’ 18

Training NNs: Gradient Descent Find the gradient in each direction: Move according to these gradients will result in the move of ‘steepest descent’ Note potential problem with ‘local minima’. Data Mining Volinsky - Columbia University 19

Gradient Descent Direction of steepest descent can be found mathematical ly or via computation al estimation Data Mining Volinsky - Columbia University Via A. Moore 20

Data Mining Volinsky - Columbia University Neural Nets: Strengths Can model very complex functions, very accurately – non linearity is built into the model Handles noisy data quite well Provides fast predictions Good for multiple category problems –Many-class classification –Image detection –Speech recognition –Financial models Good for multiple stage problems 21

Data Mining Volinsky - Columbia University Neural Nets: Weaknesses A black-box. Hard to explain or gain intuition. For complex problems, training time could be quite high Many, many training parameters –Layers, neurons per layer, output layers, bias, training algs, learning rate Highly prone to overfitting –Balance between complexity with parsimony can be learned through cross-validation 22

Example: Face Detection Data Mining Volinsky - Columbia University Architecture of the complete system: they use another neural net to estimate orientation of the face, then rectify it. They search over scales to find bigger/smaller faces. Figure from “Rotation invariant neural-network based face detection,” H.A. Rowley, S. Baluja and T. Kanade, Proc. Computer Vision and Pattern Recognition, 1998, copyright 1998, IEEE 23

Data Mining Volinsky - Columbia University Rowley, Baluja and Kanade’s (1998) Image Size: 20 x 20 Input Layer: 400 units Hidden Layer: 15 units 24

Neural Nets: Face Detection Data Mining Volinsky - Columbia University Goal: detect “face or no face” 25

Data Mining Volinsky - Columbia University Face Detection: Results 26

Data Mining Volinsky - Columbia University Face Detection Results: A Few Misses 27

Neural Nets Face detection in actionFace detection For more: –See Hastie, et al Chapter 11 R packages –Basic : nnet –Better: amore Data Mining Volinsky - Columbia University 28

Support Vector Machines Data Mining Volinsky - Columbia University 29

SVM Classification technique Start with a BIG assumption –The classes can be separated linearly Data Mining Volinsky - Columbia University 30

Data Mining Volinsky - Columbia University 31 Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?

Data Mining Volinsky - Columbia University 32 Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?

Data Mining Volinsky - Columbia University 33 Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?

Data Mining Volinsky - Columbia University 34 Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?

Data Mining Volinsky - Columbia University 35 Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) Any of these would be fine....but which is best?

Data Mining Volinsky - Columbia University 36 Classifier Margin f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

Data Mining Volinsky - Columbia University 37 Maximum Margin f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM

Data Mining Volinsky - Columbia University 38 Maximum Margin f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against Linear SVM

Data Mining Volinsky - Columbia University 39 Why Maximum Margin? denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against 1.Intuitively this feels safest. 2.If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification. 3.LOOCV is easy since the model is immune to removal of any non- support-vector datapoints. 4.There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing. 5.Empirically it works very very well.

Data Mining Volinsky - Columbia University 40 Specifying a line and margin How do we represent this mathematically? …in m input dimensions? Plus-Plane Minus-Plane Classifier Boundary “Predict Class = +1” zone “Predict Class = -1” zone

Data Mining Volinsky - Columbia University 41 Specifying a line and margin Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } Plus-Plane Minus-Plane Classifier Boundary “Predict Class = +1” zone “Predict Class = -1” zone Classify as.. +1ifw. x + b >= 1 ifw. x + b <= -1 Universe explodes if-1 < w. x + b < 1 wx+b=1 wx+b=0 wx+b=-1

Data Mining Volinsky - Columbia University 42 Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } Claim: The vector w is perpendicular to the Plus Plane. “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b?

Data Mining Volinsky - Columbia University 43 Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } The vector w is perpendicular to the Plus Plane Let x - be any point on the minus plane Let x + be the closest plus-plane-point to x -. “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b? x-x- x+x+ Any location in  m : not necessarily a datapoint Any location in R m : not necessarily a datapoint

Data Mining Volinsky - Columbia University 44 Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } The vector w is perpendicular to the Plus Plane Let x - be any point on the minus plane Let x + be the closest plus-plane-point to x -. Claim: x + = x - + w for some value of. “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b? x-x- x+x+

Data Mining Volinsky - Columbia University 45 Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } The vector w is perpendicular to the Plus Plane Let x - be any point on the minus plane Let x + be the closest plus-plane-point to x -. Claim: x + = x - + w for some value of. Why? “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b? x-x- x+x+ The line from x - to x + is perpendicular to the planes. So to get from x - to x + travel some distance in direction w.

Data Mining Volinsky - Columbia University 46 Computing the margin width What we know: w. x + + b = +1 w. x - + b = -1 x + = x - + w |x + - x - | = M It’s now easy to get M in terms of w and b “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width x-x- x+x+

Data Mining Volinsky - Columbia University 47 Computing the margin width What we know: w. x + + b = +1 w. x - + b = -1 x + = x - + w |x + - x - | = M It’s now easy to get M in terms of w and b “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width w. (x - + w) + b = 1 => w. x - + b + w.w = 1 => -1 + w.w = 1 => x-x- x+x+

Data Mining Volinsky - Columbia University 48 Computing the margin width What we know: w. x + + b = +1 w. x - + b = -1 x + = x - + w |x + - x - | = M “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width = M = |x + - x - | =| w |= x-x- x+x+

Data Mining Volinsky - Columbia University 49 Learning the Maximum Margin Classifier Given a guess of w and b we can Compute whether all data points in the correct half-planes Compute the width of the margin Search the space of w’s and b’s to find the widest margin that matches all the datapoints. “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width = x-x- x+x+

Data Mining Volinsky - Columbia University 50 Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do?

Data Mining Volinsky - Columbia University 51 Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 1: Find minimum w.w, while minimizing number of training set errors. Problemette: Two things to minimize makes for an ill-defined optimization

Data Mining Volinsky - Columbia University 52 Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 1.1: Minimize w.w + C (#train errors) And: Use a trick Tradeoff parameter

Data Mining Volinsky - Columbia University 53 Suppose we’re in 1-dimension What would SVMs do with this data? x=0

Data Mining Volinsky - Columbia University 54 Suppose we’re in 1-dimension Not a big surprise Positive “plane” Negative “plane” x=0

Data Mining Volinsky - Columbia University 55 Harder 1-dimensional dataset What can be done about this? x=0

Data Mining Volinsky - Columbia University 56 Harder 1-dimensional dataset Embed the data in a higher dimensional space x=0

Data Mining Volinsky - Columbia University 57 Harder 1-dimensional dataset x=0

Data Mining Volinsky - Columbia University 58 SVM Kernel Functions Embedding the data in a higher dimensional space where it is separable is called the “kernel trick” Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right Kernel Function –Radial-Basis-style Kernel Function: –Neural-net-style Kernel Function:

Data Mining Volinsky - Columbia University 59 SVM Performance Trick: find linear boundaries in an enlarged space –Translate to nonlinear boundaries in the original space Magic: for more details, see Hastie et al 12.3 Anecdotally they work very very well indeed. Example: They are currently the best-known classifier on a well-studied hand-written-character recognition benchmark There is a lot of excitement and religious fervor about SVMs. Despite this, some practitioners are a little skeptical.

Data Mining Volinsky - Columbia University 60

Data Mining Volinsky - Columbia University 61 Doing multi-class classification SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2). What can be done? Answer: with output arity N, learn N SVM’s –SVM 1 learns “Output==1” vs “Output != 1” –SVM 2 learns “Output==2” vs “Output != 2” –: –SVM N learns “Output==N” vs “Output != N” Then to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region.

Data Mining Volinsky - Columbia University 62 References Hastie, et al Chapter 11 (NN); Chapter 12 (SVM) Andrew Moore notes on Neural netsAndrew Moore notes Andrew Moore notes on SVM Andrew Moore notes Wikipedia has very good pages on both topics An excellent tutorial on VC-dimension and Support Vector Machines by C. Burges. –A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2): , 1998.A tutorial on support vector machines for pattern recognition. The SVM Bible: Statistical Learning Theory by Vladimir Vapnik, Wiley-Interscience; 1998