Download presentation
Presentation is loading. Please wait.
Published byHumberto Fairhurst Modified over 9 years ago
1
EE-M110 2006/7: IS L7&8 1/24, v3.0 Lectures 7&8: Non-linear Classification and Regression using Layered Perceptrons Dr Martin Brown Room: E1k Email: martin.brown@manchester.ac.uk Telephone: 0161 306 4672 http://www.eee.manchester.ac.uk/intranet/pg/coursematerial/ + + + + + + + + ++ + + + + + +....... x1x1 x2x2 m(x, ) = 0
2
EE-M110 2006/7: IS L7&8 2/24, v3.0 Lectures 7&8: Outline 1.What approaches are possible for non-linear classification and regression problems 2.Non-linear polynomial networks 1.Potential and problems using flexible models 3.Sigmoidal-type non-linear transformations 1.Modelling capabilities 2.Regression and classification interpretation 3.Parameter optimization using gradient descent 4.Non-linear logical functions and layered Perceptron nets Lead onto Multi-Layer Perceptron (MLP) models next week
3
EE-M110 2006/7: IS L7&8 3/24, v3.0 Lecture 7&8: Resources These slides are largely self-contained, but extra, background material can be found in: Machine Learning, T Mitchell, McGraw Hill, 1997 Machine Learning, Neural and Statistical Classification, D Michie, DJ Spiegelhalter and CC Taylor, 1994: http://www.amsta.leeds.ac.uk/~charles/statlog/ In addition, there are many on-line sources for multi-layer perceptrons (MLPs) and error back propagation (EBP), just search on google Advanced text: Information Theory, Inference and Learning Algorithms, D MacKay, Cambridge University Press, 2003
4
EE-M110 2006/7: IS L7&8 4/24, v3.0 Non-Linear Regression and Classification Most real-world modelling problems are not linear: A task is non-linear if it cannot be represented using a linear model Classification the number of classification errors is too large Regression the noise variance is too large Using non-linear models/relationships may help to approximate f(). + + + + + + + + ++ + + + + + +....... x1x1 x2x2
5
EE-M110 2006/7: IS L7&8 5/24, v3.0 Non-Linear Classification Consider the following 2-class classification problem Always compare to prior error rate Exercise: What are the error rates for prior, optimal linear and non-linear models? Type of non-linear function is important Data is generated by (with classification errors): + + + + + + + + ++ + + + + + +........ x1x1 x2x2
6
EE-M110 2006/7: IS L7&8 6/24, v3.0 Non-linear Regression Need to balance model complexity against data accuracy How much signal is reproducible: x bias model y, y ^ x non-linear model y, y ^ x non-linear interpolation model y, y ^ x linear model y, y ^
7
EE-M110 2006/7: IS L7&8 7/24, v3.0 Polynomial Non-Linear Models A simple and convenient way to extend linear models is to consider polynomial expansions, such as quadratic: Expansion to any order is possible: cubic, quadratic, subset of terms Linear model is produced when: A polynomial model is linear in its parameters Approximate any continuous function, arbitrarily closely if a high enough polynomial expansion is used (Taylor series) bilinearquadraticlinearbias
8
EE-M110 2006/7: IS L7&8 8/24, v3.0 Example: Quadratic Decision Boundary A quadratic 2-class classifier is given by: This has a decision boundary given by: an 2-dimensional ellipse Example of quadratic classification boundary for the Iris Setosa data Modify Perceptron simulation to work on this?
9
EE-M110 2006/7: IS L7&8 9/24, v3.0 Polynomial Regression “Overfitting” … Optimal, least squares parameter estimator is given by: where X is the data matrix, each row represents a data point, each column is one polynomial basis term. Which polynomial terms should be used - polynomials are flexible but can be quite oscillatory (high frequency components), usually not appropriate Example 20 data points, x randomly drawn from a unit variance, normal distribution, y=exp(- x.^2), fitted by a fifth order polynomial. y, y ^
10
EE-M110 2006/7: IS L7&8 10/24, v3.0 Sigmoidal Non-Linear Transformations Lets consider another way to introduce non-linearities into a basic linear model, by producing a continuous, non- linear transformation of a weighted sum: What sort of single input, single output functions, f(), are possible? To estimate parameters using gradient descent, it should be differentiable To use for classification and regression, is should be able to represent linear and step functions, as appropriate y x 0 =1 x1x1 xnxn 00 11 nn
11
EE-M110 2006/7: IS L7&8 11/24, v3.0 Tanh() Function Consider the tanh() function whose output lies in (-1,1) When there is a single input: u = 0 +x 1 When 1 is large (= 4) Almost a step function When 1 is small (= 0.25) Almost a linear relationship 0 shifts tanh() horizontally u f(u) 1 small 1 large
12
EE-M110 2006/7: IS L7&8 12/24, v3.0 Tanh Function in 2D X-Space Such functions are often known as ridge functions, because they are constant along a line in input space! u = x T = c
13
EE-M110 2006/7: IS L7&8 13/24, v3.0 0-1 Sigmoid Many books/notes use the following sigmoid function: which has an output lying in the range (0,1). In these notes, we’ll refer to both transformation functions as sigmoidal functions, because of their “lazy S” shape In fact, they’re just transformations of each other:
14
EE-M110 2006/7: IS L7&8 14/24, v3.0 Sigmoidal Parameter Estimation Gradient descent update for a single training datum: For the i th training pattern: Using the chain rule: Giving an update rule: Similar to the LMS rule, apart from the extra sigmoidal derivative term, f’().
15
EE-M110 2006/7: IS L7&8 15/24, v3.0 Sigmoidal Parameter Estimation (ii) Sigmoidal function’s derivative (tanh): df/du f(u)f(u)
16
EE-M110 2006/7: IS L7&8 16/24, v3.0 Layered Perceptron Networks In this section, we’re going to consider how these sigmoidal nodes can be connected together into layers to give greater/more flexible non-linear modelling behaviour Two central questions: 1.What are the non-linear modelling capabilities? 2.How to estimate the non-linear parameters? x1x1 x2x2 x0x0 y h0h0 h2h2 h1h1
17
EE-M110 2006/7: IS L7&8 17/24, v3.0 Linearly Separable 2D Logical Functions Note class output values of 0 and 1 in next few slides AND OR NOT x1x1 x2x2 0 1 0 1 1 x1x1 x2x2 1 0 0 1 x2x2 0 1 0 1 1 1 1 x1x1 x2x2 1 0 0 1 x1x1 1 1 0 x 0 1 x 1 1 1 1 1
18
EE-M110 2006/7: IS L7&8 18/24, v3.0 Nonlinearly Separable 2D XOR eXclusive OR (XOR) - n bit parity: 2 inputs: Data generated by: y = (NOT x 2 AND x 1 ) OR (NOT x 1 AND x 2 ). Non-linear, polynomial input transformations: x 3 = x 1 *x 2, makes the problem separable How can multi-layer networks? x1x1 x2x2 0 1 0 1 1 1 x1x1 x2x2 1 0 0 1 1 1
19
EE-M110 2006/7: IS L7&8 19/24, v3.0 Multi-Layer Network for 2D XOR Can be implemented as a two layer network (two layers of adjustable parameters) with two “hidden nodes” in the hidden layer –Empty circles represent linear Perceptron nodes –Solid circles represent a real signals –Arrows represent model parameters (NOT x 2 AND x 1 ) OR (NOT x 1 AND x 2 ) Is represented in a 2 layer network as: h 1 : (NOT x 2 AND x 1 ) h 2 : (NOT x 1 AND x 2 ),y = h 1 OR h 2 x1x1 x2x2 x 0 =1 y h 0 =1 h2h2 h1h1 output layer hidden layer
20
EE-M110 2006/7: IS L7&8 20/24, v3.0 Exercise: Determine the 9 Parameters Write down the parameter vectors for the 3 Perceptron nodes h 1 : (NOT x 2 AND x 1 ) h 2 : (NOT x 1 AND x 2 ), y: h 1 OR h 2 x1x1 x2x2 x 0 =1 y h 0 =1 h2h2 h1h1 output layer hidden layer
21
EE-M110 2006/7: IS L7&8 21/24, v3.0 Logical Functions and DNF Any logical function can be expressed as the union of “negation and conjunction” terms. It can be realized with a 2 layer Perceptron network. Each hidden layer unit to respond to exactly one positive example. Output layer is formed from the union of the hidden layer outputs. f = h 1 OR h 2 OR … OR h P Each data point/positive example is given its own “hidden unit”, which responds to only that point Essentially, it memorizes the positive training samples
22
EE-M110 2006/7: IS L7&8 22/24, v3.0 Lecture 7&8: Conclusions There are many ways to build and use non-linear models for classification and regression purposes Potentially get more accurate predictions/fewer errors if the data is generated by a non-linear relationship Parameter estimation is sometimes more complex No direct optimal parameter calculation Gradient-based estimation has local minima and differing curvatures Need to select an appropriate non-linear framework Multi-layer (sigmoidal) Perceptrons are one such framework Non-linearity controlled by nodes in hidden layer Parameters estimated using gradient descent Several factors need to be considered
23
EE-M110 2006/7: IS L7&8 23/24, v3.0 Lecture 7&8: Laboratory (i) Matlab Extend the basic Perceptron matlab script so that it now trains up a quadratic classifier (note that the plotting routines will no longer be appropriate). Implement the sigmoidal perceptron learning algorithm, where the model consists of a single layer with a tanh activation function and the parameters are updated after each presentation of a datum (see Slides 10-14) Test the algorithm on the logical AND and logical OR data, as you did for the normal Perceptron algorithm in the laboratory in IS2.ppt What are the similarities/differences of this model compared to the normal Perceptron algorithm described in IS2.ppt
24
EE-M110 2006/7: IS L7&8 24/24, v3.0 Lecture 7&8: Laboratory (ii) Theory Prove the relationship on Slide 13 between the two types of sigmoids Verify the derivative of the tanh function on Slide 15, and prove that the derivative of the (0,1) sigmoid on Slide 13 can be expressed as y(1-y) Calculate the optimal parameter values missing on Slides 17 and 20. Derive a generic rule for setting the parameter values on Slide 21 for an arbitrary logical function. You may assume that you know the number of positive examples, the number of features and the logical structure of each positive example
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.