EE-M110 2006/7: IS L7&8 1/24, v3.0 Lectures 7&8: Non-linear Classification and Regression using Layered Perceptrons Dr Martin Brown Room: E1k

Slides:



Advertisements
Similar presentations
Neural Networks and Kernel Methods
Advertisements

Applications of one-class classification
Multi-Layer Perceptron (MLP)
Perceptron Lecture 4.
EE-M /7, EF L17 1/12, v1.0 Lecture 17: ARMAX and other Linear Model Structures Dr Martin Brown Room: E1k, Control Systems Centre
Slides from: Doug Gray, David Poole
Learning in Neural and Belief Networks - Feed Forward Neural Network 2001 년 3 월 28 일 안순길.
EE 690 Design of Embodied Intelligence
G53MLE | Machine Learning | Dr Guoping Qiu
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
NEURAL NETWORKS Perceptron
Multilayer Perceptrons 1. Overview  Recap of neural network theory  The multi-layered perceptron  Back-propagation  Introduction to training  Uses.
Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Ch. 4: Radial Basis Functions Stephen Marsland, Machine Learning: An Algorithmic Perspective. CRC 2009 based on slides from many Internet sources Longin.
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
Perceptron.
Machine Learning Neural Networks
Simple Neural Nets For Pattern Classification
x – independent variable (input)
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
RBF Neural Networks x x1 Examples inside circles 1 and 2 are of class +, examples outside both circles are of class – What NN does.
Neural Networks Marco Loog.
Chapter 6: Multilayer Neural Networks
MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
CHAPTER 11 Back-Propagation Ming-Feng Yeh.
Aula 4 Radial Basis Function Networks
Neural Networks Lecture 8: Two simple learning algorithms
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
Radial Basis Function Networks
Classification Part 3: Artificial Neural Networks
Computer Science and Engineering
Multi-Layer Perceptrons Michael J. Watts
Introduction to Neural Networks Debrup Chakraborty Pattern Recognition and Machine Learning 2006.
Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
Classification / Regression Neural Networks 2
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
Artificial Intelligence Techniques Multilayer Perceptrons.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
A note about gradient descent: Consider the function f(x)=(x-x 0 ) 2 Its derivative is: By gradient descent. x0x0 + -
Linear Discrimination Reading: Chapter 2 of textbook.
Linear Classification with Perceptrons
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.
Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
Dr.Abeer Mahmoud ARTIFICIAL INTELLIGENCE (CS 461D) Dr. Abeer Mahmoud Computer science Department Princess Nora University Faculty of Computer & Information.
EEE502 Pattern Recognition
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.
RiskTeam/ Zürich, 6 July 1998 Andreas S. Weigend, Data Mining Group, Information Systems Department, Stern School of Business, NYU 2: 1 Nonlinear Models.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Machine Learning Supervised Learning Classification and Regression
Big data classification using neural network
Deep Feedforward Networks
Learning in Neural Networks
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Neural Networks and Backpropagation
Classification / Regression Neural Networks 2
Machine Learning Today: Reading: Maria Florina Balcan
CSC 578 Neural Networks and Deep Learning
Neuro-Computing Lecture 4 Radial Basis Function Network
Neural networks (1) Traditional multi-layer perceptrons
Neuro-Computing Lecture 2 Single-Layer Perceptrons
Outline Announcement Neural networks Perceptrons - continued
Presentation transcript:

EE-M /7: IS L7&8 1/24, v3.0 Lectures 7&8: Non-linear Classification and Regression using Layered Perceptrons Dr Martin Brown Room: E1k Telephone: x1x1 x2x2 m(x,  ) = 0

EE-M /7: IS L7&8 2/24, v3.0 Lectures 7&8: Outline 1.What approaches are possible for non-linear classification and regression problems 2.Non-linear polynomial networks 1.Potential and problems using flexible models 3.Sigmoidal-type non-linear transformations 1.Modelling capabilities 2.Regression and classification interpretation 3.Parameter optimization using gradient descent 4.Non-linear logical functions and layered Perceptron nets Lead onto Multi-Layer Perceptron (MLP) models next week

EE-M /7: IS L7&8 3/24, v3.0 Lecture 7&8: Resources These slides are largely self-contained, but extra, background material can be found in: Machine Learning, T Mitchell, McGraw Hill, 1997 Machine Learning, Neural and Statistical Classification, D Michie, DJ Spiegelhalter and CC Taylor, 1994: In addition, there are many on-line sources for multi-layer perceptrons (MLPs) and error back propagation (EBP), just search on google Advanced text: Information Theory, Inference and Learning Algorithms, D MacKay, Cambridge University Press, 2003

EE-M /7: IS L7&8 4/24, v3.0 Non-Linear Regression and Classification Most real-world modelling problems are not linear: A task is non-linear if it cannot be represented using a linear model Classification the number of classification errors is too large Regression the noise variance is too large Using non-linear models/relationships may help to approximate f() x1x1 x2x2

EE-M /7: IS L7&8 5/24, v3.0 Non-Linear Classification Consider the following 2-class classification problem Always compare to prior error rate Exercise: What are the error rates for prior, optimal linear and non-linear models? Type of non-linear function is important Data is generated by (with classification errors): x1x1 x2x2

EE-M /7: IS L7&8 6/24, v3.0 Non-linear Regression Need to balance model complexity against data accuracy How much signal is reproducible: x bias model y, y ^ x non-linear model y, y ^ x non-linear interpolation model y, y ^ x linear model y, y ^

EE-M /7: IS L7&8 7/24, v3.0 Polynomial Non-Linear Models A simple and convenient way to extend linear models is to consider polynomial expansions, such as quadratic: Expansion to any order is possible: cubic, quadratic, subset of terms Linear model is produced when: A polynomial model is linear in its parameters Approximate any continuous function, arbitrarily closely if a high enough polynomial expansion is used (Taylor series) bilinearquadraticlinearbias

EE-M /7: IS L7&8 8/24, v3.0 Example: Quadratic Decision Boundary A quadratic 2-class classifier is given by: This has a decision boundary given by: an 2-dimensional ellipse Example of quadratic classification boundary for the Iris Setosa data Modify Perceptron simulation to work on this?

EE-M /7: IS L7&8 9/24, v3.0 Polynomial Regression “Overfitting” … Optimal, least squares parameter estimator is given by: where X is the data matrix, each row represents a data point, each column is one polynomial basis term. Which polynomial terms should be used - polynomials are flexible but can be quite oscillatory (high frequency components), usually not appropriate Example 20 data points, x randomly drawn from a unit variance, normal distribution, y=exp(- x.^2), fitted by a fifth order polynomial. y, y ^

EE-M /7: IS L7&8 10/24, v3.0 Sigmoidal Non-Linear Transformations Lets consider another way to introduce non-linearities into a basic linear model, by producing a continuous, non- linear transformation of a weighted sum: What sort of single input, single output functions, f(), are possible? To estimate parameters using gradient descent, it should be differentiable To use for classification and regression, is should be able to represent linear and step functions, as appropriate y x 0 =1 x1x1 xnxn 00 11 nn

EE-M /7: IS L7&8 11/24, v3.0 Tanh() Function Consider the tanh() function whose output lies in (-1,1) When there is a single input: u =  0 +x  1 When  1 is large (= 4) Almost a step function When  1 is small (= 0.25) Almost a linear relationship  0 shifts tanh() horizontally u f(u)  1 small  1 large

EE-M /7: IS L7&8 12/24, v3.0 Tanh Function in 2D X-Space Such functions are often known as ridge functions, because they are constant along a line in input space! u = x T  = c

EE-M /7: IS L7&8 13/24, v Sigmoid Many books/notes use the following sigmoid function: which has an output lying in the range (0,1). In these notes, we’ll refer to both transformation functions as sigmoidal functions, because of their “lazy S” shape In fact, they’re just transformations of each other:

EE-M /7: IS L7&8 14/24, v3.0 Sigmoidal Parameter Estimation Gradient descent update for a single training datum: For the i th training pattern: Using the chain rule: Giving an update rule: Similar to the LMS rule, apart from the extra sigmoidal derivative term, f’().

EE-M /7: IS L7&8 15/24, v3.0 Sigmoidal Parameter Estimation (ii) Sigmoidal function’s derivative (tanh): df/du f(u)f(u)

EE-M /7: IS L7&8 16/24, v3.0 Layered Perceptron Networks In this section, we’re going to consider how these sigmoidal nodes can be connected together into layers to give greater/more flexible non-linear modelling behaviour Two central questions: 1.What are the non-linear modelling capabilities? 2.How to estimate the non-linear parameters? x1x1 x2x2 x0x0 y h0h0 h2h2 h1h1

EE-M /7: IS L7&8 17/24, v3.0 Linearly Separable 2D Logical Functions Note class output values of 0 and 1 in next few slides AND OR NOT x1x1 x2x x1x1 x2x x2x x1x1 x2x x1x x 0 1 x

EE-M /7: IS L7&8 18/24, v3.0 Nonlinearly Separable 2D XOR eXclusive OR (XOR) - n bit parity: 2 inputs: Data generated by: y = (NOT x 2 AND x 1 ) OR (NOT x 1 AND x 2 ). Non-linear, polynomial input transformations: x 3 = x 1 *x 2, makes the problem separable How can multi-layer networks? x1x1 x2x x1x1 x2x

EE-M /7: IS L7&8 19/24, v3.0 Multi-Layer Network for 2D XOR Can be implemented as a two layer network (two layers of adjustable parameters) with two “hidden nodes” in the hidden layer –Empty circles represent linear Perceptron nodes –Solid circles represent a real signals –Arrows represent model parameters  (NOT x 2 AND x 1 ) OR (NOT x 1 AND x 2 ) Is represented in a 2 layer network as: h 1 : (NOT x 2 AND x 1 ) h 2 : (NOT x 1 AND x 2 ),y = h 1 OR h 2 x1x1 x2x2 x 0 =1 y h 0 =1 h2h2 h1h1 output layer hidden layer

EE-M /7: IS L7&8 20/24, v3.0 Exercise: Determine the 9 Parameters Write down the parameter vectors for the 3 Perceptron nodes h 1 : (NOT x 2 AND x 1 ) h 2 : (NOT x 1 AND x 2 ), y: h 1 OR h 2 x1x1 x2x2 x 0 =1 y h 0 =1 h2h2 h1h1 output layer hidden layer

EE-M /7: IS L7&8 21/24, v3.0 Logical Functions and DNF Any logical function can be expressed as the union of “negation and conjunction” terms. It can be realized with a 2 layer Perceptron network. Each hidden layer unit to respond to exactly one positive example. Output layer is formed from the union of the hidden layer outputs. f = h 1 OR h 2 OR … OR h P Each data point/positive example is given its own “hidden unit”, which responds to only that point Essentially, it memorizes the positive training samples

EE-M /7: IS L7&8 22/24, v3.0 Lecture 7&8: Conclusions There are many ways to build and use non-linear models for classification and regression purposes Potentially get more accurate predictions/fewer errors if the data is generated by a non-linear relationship  Parameter estimation is sometimes more complex No direct optimal parameter calculation Gradient-based estimation has local minima and differing curvatures  Need to select an appropriate non-linear framework Multi-layer (sigmoidal) Perceptrons are one such framework Non-linearity controlled by nodes in hidden layer Parameters estimated using gradient descent Several factors need to be considered

EE-M /7: IS L7&8 23/24, v3.0 Lecture 7&8: Laboratory (i) Matlab Extend the basic Perceptron matlab script so that it now trains up a quadratic classifier (note that the plotting routines will no longer be appropriate). Implement the sigmoidal perceptron learning algorithm, where the model consists of a single layer with a tanh activation function and the parameters are updated after each presentation of a datum (see Slides 10-14) Test the algorithm on the logical AND and logical OR data, as you did for the normal Perceptron algorithm in the laboratory in IS2.ppt What are the similarities/differences of this model compared to the normal Perceptron algorithm described in IS2.ppt

EE-M /7: IS L7&8 24/24, v3.0 Lecture 7&8: Laboratory (ii) Theory Prove the relationship on Slide 13 between the two types of sigmoids Verify the derivative of the tanh function on Slide 15, and prove that the derivative of the (0,1) sigmoid on Slide 13 can be expressed as y(1-y) Calculate the optimal parameter values missing on Slides 17 and 20. Derive a generic rule for setting the parameter values on Slide 21 for an arbitrary logical function. You may assume that you know the number of positive examples, the number of features and the logical structure of each positive example