Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Similar presentations


Presentation on theme: "Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew."— Presentation transcript:

1 Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew Moore lecture notes

2 Data Mining - 2011 - Volinsky - Columbia University Outline Special Topics –Neural Networks –Support Vector Machines 2

3 Data Mining - 2011 - Volinsky - Columbia University Neural Networks Agenda The biological inspiration Structure of neural net models Using neural net models Training neural net models Strengths and weaknesses An example 3

4 What the heck are neural nets? A data mining algorithm, inspired by biological processes A type of non-linear regression/classification An ensemble method –Although not usually thought of as such A black box! Data Mining - 2011 - Volinsky - Columbia University 4

5 Inspiration from Biology Information processing inspired by biological nervous systems Data Mining - 2011 - Volinsky - Columbia University Structure of the nervous system: A large number of neurons (information processing units) connected together A neuron’s response depends on the states of other neurons it is connected to and to the ‘strength’ of those connections. The ‘strengths’ are learned based on experience. 5

6 Data Mining - 2011 - Volinsky - Columbia University From Real to Artificial 6

7 Data Mining - 2011 - Volinsky - Columbia University Nodes: A Closer Look Input values weights Summing function Bias b Activation function Output y x1x1 x2x2 xmxm w2w2 wmwm w1w1 7

8 Data Mining - 2011 - Volinsky - Columbia University Nodes: A Closer Look A node (neuron) is the basic information processing unit of a neural net. It has: A set of inputs with weights w 1, w 2, …, w m along with a default input called the bias An adder function (linear combiner) that computes the weighted sum of the inputs An Activation function (squashing function) that transforms v, usually non-linearly. 8

9 Data Mining - 2011 - Volinsky - Columbia University A Simple Node: A Perceptron A simple activation function: A signing threshold x1x1 x2x2 xnxn w2w2 w1w1 wnwn b (bias) vy  (v) 9

10 Data Mining - 2011 - Volinsky - Columbia University Common Activation Functions Step function Sigmoid (logistic) function Hyperbolic Tangent (Tanh) function The s-shape adds non-linearity [Hornick (1989)]: combining many of these simple functions is sufficient to approximate any continuous non-linear function arbitrarily well over a compact interval. 10

11 Neural Network: Architecture Big idea: a combination of simple non-linear models working together to model a complex function How many layers? Nodes? What is the function? –Magic –Luckily, defaults do well Data Mining - 2011 - Volinsky - Columbia University Input layer Output layer Hidden Layer(s) 11

12 Data Mining - 2011 - Volinsky - Columbia University Neural Networks: The Model Model has two components –A particular architecture Number of hidden layers Number of nodes in the input, output and hidden layers Specification of the activation function(s) –The associated set of weights Weights and complexity are “learned” from the data –Supervised learning, applied iteratively –Out-of-sample methods; Cross-validation 12

13 Data Mining - 2011 - Volinsky - Columbia University Fitting a Neural Net: Feed Forward Supply attribute values at input nodes Obtain predictions from the output node(s) –Predicting classes Two classes – single output node with threshold Multiple classes – use multiple outputs, one for each class Predicted class = output node with highest value Multiple class problems are one of the main uses of NN! 13

14 Data Mining - 2011 - Volinsky - Columbia University A Simple NN: Regression A one-node neural network: –Called a ‘perceptron’ –Use identity function as the activation function –What’s the output? Weighted sum of inputs Logistic regression just changes the activation function to the logistic function Data Mining - Columbia University x1x1 x2x2 xnxn w2w2 w1w1 wnwn b (bias) vy  (v) 14

15 Data Mining - 2011 - Volinsky - Columbia University Training a NN: What does it learn? It fits/learns the weights that best translates inputs into outputs given its architecture Hidden units can be thought to learn some higher order regularities or features of the inputs that can be used to predict outputs. “Multi layer perceptron” 15

16 Data Mining - 2011 - Volinsky - Columbia University Perceptron Training Rule Perceptron = Adder + Threshold 1. Start with a random set of small weights. 2. Calculate an example 3. Change the weight by an amount proportional to the difference between the desired output and the actual output. Δ W i = η * (D-Y).I i Learning rate/ Step size Desired output Input Actual output 16

17 Data Mining - 2011 - Volinsky - Columbia University Training NNs: Back Propagation How to train a neural net (find the optimal weights): –Present a training sample to the neural network. –Calculate the error in each output neuron. –For each neuron, calculate what the output should have been, and a scaling factor, how much lower or higher the output must be adjusted to match the desired output. This is the local error. –Adjust the weights of each neuron to lower the local error. –Assign "blame" for the local error to neurons at the previous level, giving greater responsibility to neurons connected by stronger weights. –Repeat on the neurons at the previous level, using each one's "blame" as its error. This ‘propogates’ the error backward. The sequence of forward and backward fits is called ‘back propogation’. 17

18 Data Mining - 2011 - Volinsky - Columbia University Training NNs: How to do it A “Gradient Descent” algorithm is typically used to fit back propogation You can imagine a surface in an n-dimensional space such that –Each dimension is a weight –Each point in this space is a particular combination of weights –Each point on the “surface” is the output error that corresponds to that combination of weights –You want to minimize error i.e. find the “valleys” on this surface –Note the potential for ‘local minima’ 18

19 Training NNs: Gradient Descent Find the gradient in each direction: Move according to these gradients will result in the move of ‘steepest descent’ Note potential problem with ‘local minima’. Data Mining - 2011 - Volinsky - Columbia University 19

20 Gradient Descent Direction of steepest descent can be found mathematical ly or via computation al estimation Data Mining - 2011 - Volinsky - Columbia University Via A. Moore 20

21 Data Mining - 2011 - Volinsky - Columbia University Neural Nets: Strengths Can model very complex functions, very accurately – non linearity is built into the model Handles noisy data quite well Provides fast predictions Good for multiple category problems –Many-class classification –Image detection –Speech recognition –Financial models Good for multiple stage problems 21

22 Data Mining - 2011 - Volinsky - Columbia University Neural Nets: Weaknesses A black-box. Hard to explain or gain intuition. For complex problems, training time could be quite high Many, many training parameters –Layers, neurons per layer, output layers, bias, training algs, learning rate Highly prone to overfitting –Balance between complexity with parsimony can be learned through cross-validation 22

23 Example: Face Detection Data Mining - 2011 - Volinsky - Columbia University Architecture of the complete system: they use another neural net to estimate orientation of the face, then rectify it. They search over scales to find bigger/smaller faces. Figure from “Rotation invariant neural-network based face detection,” H.A. Rowley, S. Baluja and T. Kanade, Proc. Computer Vision and Pattern Recognition, 1998, copyright 1998, IEEE 23

24 Data Mining - 2011 - Volinsky - Columbia University Rowley, Baluja and Kanade’s (1998) Image Size: 20 x 20 Input Layer: 400 units Hidden Layer: 15 units 24

25 Neural Nets: Face Detection Data Mining - 2011 - Volinsky - Columbia University Goal: detect “face or no face” 25

26 Data Mining - 2011 - Volinsky - Columbia University Face Detection: Results 26

27 Data Mining - 2011 - Volinsky - Columbia University Face Detection Results: A Few Misses 27

28 Neural Nets Face detection in actionFace detection For more: –See Hastie, et al Chapter 11 R packages –Basic : nnet –Better: amore Data Mining - 2011 - Volinsky - Columbia University 28

29 Support Vector Machines Data Mining - 2011 - Volinsky - Columbia University 29

30 SVM Classification technique Start with a BIG assumption –The classes can be separated linearly Data Mining - 2011 - Volinsky - Columbia University 30

31 Data Mining - 2011 - Volinsky - Columbia University 31 Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?

32 Data Mining - 2011 - Volinsky - Columbia University 32 Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?

33 Data Mining - 2011 - Volinsky - Columbia University 33 Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?

34 Data Mining - 2011 - Volinsky - Columbia University 34 Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?

35 Data Mining - 2011 - Volinsky - Columbia University 35 Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) Any of these would be fine....but which is best?

36 Data Mining - 2011 - Volinsky - Columbia University 36 Classifier Margin f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

37 Data Mining - 2011 - Volinsky - Columbia University 37 Maximum Margin f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM

38 Data Mining - 2011 - Volinsky - Columbia University 38 Maximum Margin f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against Linear SVM

39 Data Mining - 2011 - Volinsky - Columbia University 39 Why Maximum Margin? denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against 1.Intuitively this feels safest. 2.If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification. 3.LOOCV is easy since the model is immune to removal of any non- support-vector datapoints. 4.There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing. 5.Empirically it works very very well.

40 Data Mining - 2011 - Volinsky - Columbia University 40 Specifying a line and margin How do we represent this mathematically? …in m input dimensions? Plus-Plane Minus-Plane Classifier Boundary “Predict Class = +1” zone “Predict Class = -1” zone

41 Data Mining - 2011 - Volinsky - Columbia University 41 Specifying a line and margin Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } Plus-Plane Minus-Plane Classifier Boundary “Predict Class = +1” zone “Predict Class = -1” zone Classify as.. +1ifw. x + b >= 1 ifw. x + b <= -1 Universe explodes if-1 < w. x + b < 1 wx+b=1 wx+b=0 wx+b=-1

42 Data Mining - 2011 - Volinsky - Columbia University 42 Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } Claim: The vector w is perpendicular to the Plus Plane. “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b?

43 Data Mining - 2011 - Volinsky - Columbia University 43 Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } The vector w is perpendicular to the Plus Plane Let x - be any point on the minus plane Let x + be the closest plus-plane-point to x -. “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b? x-x- x+x+ Any location in  m : not necessarily a datapoint Any location in R m : not necessarily a datapoint

44 Data Mining - 2011 - Volinsky - Columbia University 44 Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } The vector w is perpendicular to the Plus Plane Let x - be any point on the minus plane Let x + be the closest plus-plane-point to x -. Claim: x + = x - + w for some value of. “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b? x-x- x+x+

45 Data Mining - 2011 - Volinsky - Columbia University 45 Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } The vector w is perpendicular to the Plus Plane Let x - be any point on the minus plane Let x + be the closest plus-plane-point to x -. Claim: x + = x - + w for some value of. Why? “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b? x-x- x+x+ The line from x - to x + is perpendicular to the planes. So to get from x - to x + travel some distance in direction w.

46 Data Mining - 2011 - Volinsky - Columbia University 46 Computing the margin width What we know: w. x + + b = +1 w. x - + b = -1 x + = x - + w |x + - x - | = M It’s now easy to get M in terms of w and b “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width x-x- x+x+

47 Data Mining - 2011 - Volinsky - Columbia University 47 Computing the margin width What we know: w. x + + b = +1 w. x - + b = -1 x + = x - + w |x + - x - | = M It’s now easy to get M in terms of w and b “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width w. (x - + w) + b = 1 => w. x - + b + w.w = 1 => -1 + w.w = 1 => x-x- x+x+

48 Data Mining - 2011 - Volinsky - Columbia University 48 Computing the margin width What we know: w. x + + b = +1 w. x - + b = -1 x + = x - + w |x + - x - | = M “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width = M = |x + - x - | =| w |= x-x- x+x+

49 Data Mining - 2011 - Volinsky - Columbia University 49 Learning the Maximum Margin Classifier Given a guess of w and b we can Compute whether all data points in the correct half-planes Compute the width of the margin Search the space of w’s and b’s to find the widest margin that matches all the datapoints. “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width = x-x- x+x+

50 Data Mining - 2011 - Volinsky - Columbia University 50 Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do?

51 Data Mining - 2011 - Volinsky - Columbia University 51 Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 1: Find minimum w.w, while minimizing number of training set errors. Problemette: Two things to minimize makes for an ill-defined optimization

52 Data Mining - 2011 - Volinsky - Columbia University 52 Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 1.1: Minimize w.w + C (#train errors) And: Use a trick Tradeoff parameter

53 Data Mining - 2011 - Volinsky - Columbia University 53 Suppose we’re in 1-dimension What would SVMs do with this data? x=0

54 Data Mining - 2011 - Volinsky - Columbia University 54 Suppose we’re in 1-dimension Not a big surprise Positive “plane” Negative “plane” x=0

55 Data Mining - 2011 - Volinsky - Columbia University 55 Harder 1-dimensional dataset What can be done about this? x=0

56 Data Mining - 2011 - Volinsky - Columbia University 56 Harder 1-dimensional dataset Embed the data in a higher dimensional space x=0

57 Data Mining - 2011 - Volinsky - Columbia University 57 Harder 1-dimensional dataset x=0

58 Data Mining - 2011 - Volinsky - Columbia University 58 SVM Kernel Functions Embedding the data in a higher dimensional space where it is separable is called the “kernel trick” Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right Kernel Function –Radial-Basis-style Kernel Function: –Neural-net-style Kernel Function:

59 Data Mining - 2011 - Volinsky - Columbia University 59 SVM Performance Trick: find linear boundaries in an enlarged space –Translate to nonlinear boundaries in the original space Magic: for more details, see Hastie et al 12.3 Anecdotally they work very very well indeed. Example: They are currently the best-known classifier on a well-studied hand-written-character recognition benchmark There is a lot of excitement and religious fervor about SVMs. Despite this, some practitioners are a little skeptical.

60 Data Mining - 2011 - Volinsky - Columbia University 60

61 Data Mining - 2011 - Volinsky - Columbia University 61 Doing multi-class classification SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2). What can be done? Answer: with output arity N, learn N SVM’s –SVM 1 learns “Output==1” vs “Output != 1” –SVM 2 learns “Output==2” vs “Output != 2” –: –SVM N learns “Output==N” vs “Output != N” Then to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region.

62 Data Mining - 2011 - Volinsky - Columbia University 62 References Hastie, et al Chapter 11 (NN); Chapter 12 (SVM) Andrew Moore notes on Neural netsAndrew Moore notes Andrew Moore notes on SVM Andrew Moore notes Wikipedia has very good pages on both topics An excellent tutorial on VC-dimension and Support Vector Machines by C. Burges. –A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998.A tutorial on support vector machines for pattern recognition. The SVM Bible: Statistical Learning Theory by Vladimir Vapnik, Wiley-Interscience; 1998


Download ppt "Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew."

Similar presentations


Ads by Google