EE-M /7: IS L7&8 1/24, v3.0 Lectures 7&8: Non-linear Classification and Regression using Layered Perceptrons Dr Martin Brown Room: E1k Telephone: x1x1 x2x2 m(x, ) = 0
EE-M /7: IS L7&8 2/24, v3.0 Lectures 7&8: Outline 1.What approaches are possible for non-linear classification and regression problems 2.Non-linear polynomial networks 1.Potential and problems using flexible models 3.Sigmoidal-type non-linear transformations 1.Modelling capabilities 2.Regression and classification interpretation 3.Parameter optimization using gradient descent 4.Non-linear logical functions and layered Perceptron nets Lead onto Multi-Layer Perceptron (MLP) models next week
EE-M /7: IS L7&8 3/24, v3.0 Lecture 7&8: Resources These slides are largely self-contained, but extra, background material can be found in: Machine Learning, T Mitchell, McGraw Hill, 1997 Machine Learning, Neural and Statistical Classification, D Michie, DJ Spiegelhalter and CC Taylor, 1994: In addition, there are many on-line sources for multi-layer perceptrons (MLPs) and error back propagation (EBP), just search on google Advanced text: Information Theory, Inference and Learning Algorithms, D MacKay, Cambridge University Press, 2003
EE-M /7: IS L7&8 4/24, v3.0 Non-Linear Regression and Classification Most real-world modelling problems are not linear: A task is non-linear if it cannot be represented using a linear model Classification the number of classification errors is too large Regression the noise variance is too large Using non-linear models/relationships may help to approximate f() x1x1 x2x2
EE-M /7: IS L7&8 5/24, v3.0 Non-Linear Classification Consider the following 2-class classification problem Always compare to prior error rate Exercise: What are the error rates for prior, optimal linear and non-linear models? Type of non-linear function is important Data is generated by (with classification errors): x1x1 x2x2
EE-M /7: IS L7&8 6/24, v3.0 Non-linear Regression Need to balance model complexity against data accuracy How much signal is reproducible: x bias model y, y ^ x non-linear model y, y ^ x non-linear interpolation model y, y ^ x linear model y, y ^
EE-M /7: IS L7&8 7/24, v3.0 Polynomial Non-Linear Models A simple and convenient way to extend linear models is to consider polynomial expansions, such as quadratic: Expansion to any order is possible: cubic, quadratic, subset of terms Linear model is produced when: A polynomial model is linear in its parameters Approximate any continuous function, arbitrarily closely if a high enough polynomial expansion is used (Taylor series) bilinearquadraticlinearbias
EE-M /7: IS L7&8 8/24, v3.0 Example: Quadratic Decision Boundary A quadratic 2-class classifier is given by: This has a decision boundary given by: an 2-dimensional ellipse Example of quadratic classification boundary for the Iris Setosa data Modify Perceptron simulation to work on this?
EE-M /7: IS L7&8 9/24, v3.0 Polynomial Regression “Overfitting” … Optimal, least squares parameter estimator is given by: where X is the data matrix, each row represents a data point, each column is one polynomial basis term. Which polynomial terms should be used - polynomials are flexible but can be quite oscillatory (high frequency components), usually not appropriate Example 20 data points, x randomly drawn from a unit variance, normal distribution, y=exp(- x.^2), fitted by a fifth order polynomial. y, y ^
EE-M /7: IS L7&8 10/24, v3.0 Sigmoidal Non-Linear Transformations Lets consider another way to introduce non-linearities into a basic linear model, by producing a continuous, non- linear transformation of a weighted sum: What sort of single input, single output functions, f(), are possible? To estimate parameters using gradient descent, it should be differentiable To use for classification and regression, is should be able to represent linear and step functions, as appropriate y x 0 =1 x1x1 xnxn 00 11 nn
EE-M /7: IS L7&8 11/24, v3.0 Tanh() Function Consider the tanh() function whose output lies in (-1,1) When there is a single input: u = 0 +x 1 When 1 is large (= 4) Almost a step function When 1 is small (= 0.25) Almost a linear relationship 0 shifts tanh() horizontally u f(u) 1 small 1 large
EE-M /7: IS L7&8 12/24, v3.0 Tanh Function in 2D X-Space Such functions are often known as ridge functions, because they are constant along a line in input space! u = x T = c
EE-M /7: IS L7&8 13/24, v Sigmoid Many books/notes use the following sigmoid function: which has an output lying in the range (0,1). In these notes, we’ll refer to both transformation functions as sigmoidal functions, because of their “lazy S” shape In fact, they’re just transformations of each other:
EE-M /7: IS L7&8 14/24, v3.0 Sigmoidal Parameter Estimation Gradient descent update for a single training datum: For the i th training pattern: Using the chain rule: Giving an update rule: Similar to the LMS rule, apart from the extra sigmoidal derivative term, f’().
EE-M /7: IS L7&8 15/24, v3.0 Sigmoidal Parameter Estimation (ii) Sigmoidal function’s derivative (tanh): df/du f(u)f(u)
EE-M /7: IS L7&8 16/24, v3.0 Layered Perceptron Networks In this section, we’re going to consider how these sigmoidal nodes can be connected together into layers to give greater/more flexible non-linear modelling behaviour Two central questions: 1.What are the non-linear modelling capabilities? 2.How to estimate the non-linear parameters? x1x1 x2x2 x0x0 y h0h0 h2h2 h1h1
EE-M /7: IS L7&8 17/24, v3.0 Linearly Separable 2D Logical Functions Note class output values of 0 and 1 in next few slides AND OR NOT x1x1 x2x x1x1 x2x x2x x1x1 x2x x1x x 0 1 x
EE-M /7: IS L7&8 18/24, v3.0 Nonlinearly Separable 2D XOR eXclusive OR (XOR) - n bit parity: 2 inputs: Data generated by: y = (NOT x 2 AND x 1 ) OR (NOT x 1 AND x 2 ). Non-linear, polynomial input transformations: x 3 = x 1 *x 2, makes the problem separable How can multi-layer networks? x1x1 x2x x1x1 x2x
EE-M /7: IS L7&8 19/24, v3.0 Multi-Layer Network for 2D XOR Can be implemented as a two layer network (two layers of adjustable parameters) with two “hidden nodes” in the hidden layer –Empty circles represent linear Perceptron nodes –Solid circles represent a real signals –Arrows represent model parameters (NOT x 2 AND x 1 ) OR (NOT x 1 AND x 2 ) Is represented in a 2 layer network as: h 1 : (NOT x 2 AND x 1 ) h 2 : (NOT x 1 AND x 2 ),y = h 1 OR h 2 x1x1 x2x2 x 0 =1 y h 0 =1 h2h2 h1h1 output layer hidden layer
EE-M /7: IS L7&8 20/24, v3.0 Exercise: Determine the 9 Parameters Write down the parameter vectors for the 3 Perceptron nodes h 1 : (NOT x 2 AND x 1 ) h 2 : (NOT x 1 AND x 2 ), y: h 1 OR h 2 x1x1 x2x2 x 0 =1 y h 0 =1 h2h2 h1h1 output layer hidden layer
EE-M /7: IS L7&8 21/24, v3.0 Logical Functions and DNF Any logical function can be expressed as the union of “negation and conjunction” terms. It can be realized with a 2 layer Perceptron network. Each hidden layer unit to respond to exactly one positive example. Output layer is formed from the union of the hidden layer outputs. f = h 1 OR h 2 OR … OR h P Each data point/positive example is given its own “hidden unit”, which responds to only that point Essentially, it memorizes the positive training samples
EE-M /7: IS L7&8 22/24, v3.0 Lecture 7&8: Conclusions There are many ways to build and use non-linear models for classification and regression purposes Potentially get more accurate predictions/fewer errors if the data is generated by a non-linear relationship Parameter estimation is sometimes more complex No direct optimal parameter calculation Gradient-based estimation has local minima and differing curvatures Need to select an appropriate non-linear framework Multi-layer (sigmoidal) Perceptrons are one such framework Non-linearity controlled by nodes in hidden layer Parameters estimated using gradient descent Several factors need to be considered
EE-M /7: IS L7&8 23/24, v3.0 Lecture 7&8: Laboratory (i) Matlab Extend the basic Perceptron matlab script so that it now trains up a quadratic classifier (note that the plotting routines will no longer be appropriate). Implement the sigmoidal perceptron learning algorithm, where the model consists of a single layer with a tanh activation function and the parameters are updated after each presentation of a datum (see Slides 10-14) Test the algorithm on the logical AND and logical OR data, as you did for the normal Perceptron algorithm in the laboratory in IS2.ppt What are the similarities/differences of this model compared to the normal Perceptron algorithm described in IS2.ppt
EE-M /7: IS L7&8 24/24, v3.0 Lecture 7&8: Laboratory (ii) Theory Prove the relationship on Slide 13 between the two types of sigmoids Verify the derivative of the tanh function on Slide 15, and prove that the derivative of the (0,1) sigmoid on Slide 13 can be expressed as y(1-y) Calculate the optimal parameter values missing on Slides 17 and 20. Derive a generic rule for setting the parameter values on Slide 21 for an arbitrary logical function. You may assume that you know the number of positive examples, the number of features and the logical structure of each positive example