Lecture 2: Basics and definitions Networks as Data Models.

Slides:



Advertisements
Similar presentations
Artificial Neural Networks
Advertisements

Multi-Layer Perceptron (MLP)
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
1 Image Classification MSc Image Processing Assignment March 2003.
Artificial Neural Networks (1)
Perceptron Learning Rule
NEURAL NETWORKS Perceptron
Deep Learning and Neural Nets Spring 2015
Data Mining Classification: Alternative Techniques
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
What is Statistical Modeling
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
An introduction to: Deep Learning aka or related to Deep Neural Networks Deep Structural Learning Deep Belief Networks etc,
Chapter 1: Introduction to Pattern Recognition
Simple Neural Nets For Pattern Classification
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Lecture Notes for CMPUT 466/551 Nilanjan Ray
Chapter 5 NEURAL NETWORKS
Chapter 2: Pattern Recognition
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 15: Introduction to Artificial Neural Networks Martin Russell.
Un Supervised Learning & Self Organizing Maps Learning From Examples
Ensemble Learning: An Introduction
September 21, 2010Neural Networks Lecture 5: The Perceptron 1 Supervised Function Approximation In supervised learning, we train an ANN with a set of vector.
Introduction to Neural Networks Simon Durrant Quantitative Methods December 15th.
Support Vector Machines Kernel Machines
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.
October 7, 2010Neural Networks Lecture 10: Setting Backpropagation Parameters 1 Creating Data Representations On the other hand, sets of orthogonal vectors.
Part I: Classification and Bayesian Learning
Radial Basis Function Networks
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
July 11, 2001Daniel Whiteson Support Vector Machines: Get more Higgs out of your data Daniel Whiteson UC Berkeley.
MSE 2400 EaLiCaRA Spring 2015 Dr. Tom Way
Classification. An Example (from Pattern Classification by Duda & Hart & Stork – Second Edition, 2001)
Kumar Srijan ( ) Syed Ahsan( ). Problem Statement To create a Neural Networks based multiclass object classifier which can do rotation,
Artificial Neural Nets and AI Connectionism Sub symbolic reasoning.
© Negnevitsky, Pearson Education, Will neural network work for my problem? Will neural network work for my problem? Character recognition neural.
 The most intelligent device - “Human Brain”.  The machine that revolutionized the whole world – “computer”.  Inefficiencies of the computer has lead.
Machine Learning CSE 681 CH2 - Supervised Learning.
Artificial Intelligence Techniques Multilayer Perceptrons.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Learning to perceive how hand-written digits were drawn Geoffrey Hinton Canadian Institute for Advanced Research and University of Toronto.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
1 E. Fatemizadeh Statistical Pattern Recognition.
Linear Models for Classification
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
CSC321: Lecture 7:Ways to prevent overfitting
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
1 Statistics & R, TiP, 2011/12 Neural Networks  Technique for discrimination & regression problems  More mathematical theoretical foundation  Works.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
3.Learning In previous lecture, we discussed the biological foundations of of neural computation including  single neuron models  connecting single neuron.
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
Pattern recognition – basic concepts. Sample input attribute, attribute, feature, input variable, independent variable (atribut, rys, příznak, vstupní.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Artificial Neural Networks This is lecture 15 of the module `Biologically Inspired Computing’ An introduction to Artificial Neural Networks.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Big data classification using neural network
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
The Naïve Bayes (NB) Classifier
Computer Vision Lecture 19: Object Recognition III
An introduction to: Deep Learning aka or related to Deep Neural Networks Deep Structural Learning Deep Belief Networks etc,
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Lecture 2: Basics and definitions Networks as Data Models

Last lecture: an artificial neuron Bias: input = 1 x 0 = 1 x2x2 x1x1 xmxm y1y1 yiyi w 1m w 12 w 11 b1b1

Thus the artificial neuron is defined by the components: 1. A set of inputs, x i.inputs 2. A set of weights, w ij.weights 3. A bias, b i. 4. An activation function, f.activation function 5. Neuron output, y The subscript i indicates the i-th input or weight. As the inputs and output are external, the parameters of this model are therefore the weights, bias and activation function and thus DEFINE the model

xnxn x1x1 x2x2 Input (visual input) Output (Motor output) More layers means more of the same parameters (and several subscripts) Hidden layers

Network as a data model Can view a network as a model which has a set of parameters associated with it Networks transform input data into an output Transformation is defined by the network parameters Parameters set/adapted by optimisation/adaptive procedure: ‘learning’ (Haykin, 99) Idea is that given a set of data points network (model) can be trained so as to generalise …

NNs for function approximation … that is, network learns a (‘correct’) mapping from inputs to outputs Thus NNs can be seen as being a multivariate non-linear mapping and are often used for function approximation 2 main categories: –Classification: given an input say which class it is in –Regression: given an input what is the expected output

Mapping/function needs to be learnt: various methods available Learning process used shapes final solution Supervised learning: have a teacher, telling you where to go Unsupervised learning: no teacher, net learns by itself Reinforcement learning: have a critic, wrong or correct Type of learning used depends on task at hand. We will deal mainly with supervised and unsupervised learning. Reinforcement learning will be taught in Adaptive Systems course or can be found in eg Haykin or Hertz et al. or Sutton R.S., and Barto A.G. (1998): Reinforcement learning: an introduction MIT Press LEARNING: extracting principles from data

Pattern: the opposite of chaos; it is an entity, vaguely defined, that could be given a name or a classification Examples: Fingerprints, Handwritten characters, Human face, Speech (or deer/whale/bat etc) signals, Iris patterns Medical imaging (various screening procedures) Remote sensing etc etc etc. Pattern recognition

Given a pattern: a. supervised classification (discriminant analysis) in which the input pattern is identified as a member of a predefined class b. unsupervised classification (e.g. clustering ) in which the patter is assigned to a hitherto unknown class. Unsupervised methods will be discussed further in future lectures

Eg Handwritten digit classification: First need a data set to learn from: sets of characters How are they represented? Eg as an input vector x = (x 1, …, x n ) to the network (eg vector of ones and zeroes for each pixel according to whether it is black/white). Set of input vectors is our Training Set X which has already been classified into a’s and b’s (note capitals for set, X, underlined small letters for an instance of set, x i ie the i’th training pattern/vector) Given a training set X, our goal is to tell if a new image is an a or b ie classify it into one of 2 classes C 1 (all a’s) or C 2 (all b’s) (in general one of k classes C 1.. C k ) a b

Generalisation Q. How do we tell if a new unseen image is an a or b? A. Brute force: have a library of all possible images But 256 x 256 pixels => x 256 = ,000 images Impossible! Typically have less than a few thousand images in training set Therefore, system must be able to classify UNSEEN patterns from the patterns it has seen I.e. Must be able to generalise from the data in the training set Intuition: real neural networks do this well, so maybe artificial ones can do the same. As they are also shaped by experiences maybe we’ll also learn about how the brain does it …...

For 2 class classification we want the network output y (a function of the inputs and network parameters) to be: y(x, w) = 1 if x is an a y(x, w) = -1 if x is a b where x is an input vector and the network parameters are grouped as a vector w. y is known as a discriminant function: it discriminates between 2 classes

As the network mapping is defined by the parameters we must use the data set to perform Learning (training, adaptation) ie: change weights or interaction between neurons according to the training examples (and possibly prior knowledge of the problem) Where the purpose of learning is to minimize:  training errors on learning data: learning error  prediction errors on new, unseen data: generalization error Since when the errors are minimised, the network discriminates between the 2 classes We therefore need an error function to measure the network performance based on the training error An optimisation algorithms can then be used to minimise the learning errors and train the network

Feature Extraction However, if we use all the pixels as inputs we are going to have a long training procedure and a very big network May want to analyse the data first (pre-process it) and extract some (lower dimensional) salient features to be the inputs to the network x x*x* feature extraction pattern space (data) feature space Could use the ratio of height and width of letter as b’s will tend to be higher than a’s (Prior knowledge) Also, scale invariant

Could then make a decision based on this feature. Suppose we make a histogram of the values of x* for the input vectors in the training set X x*x* C1C1 C2C2 A For a new input with an x* value of A we would classify it as C 1 as it is more likely to belong to this class

Therefore, get the idea of a Decision Boundary Points on one side of the boundary are in one class, and on the other are in the other class ie if x* < d pattern is in C 1 else it is in C 2 Intuitively it makes sense (and is optimal in a Bayesian sense) to place it where the 2 histograms cross x*x* C1C1 C2C2 A Decision Boundary x* = d

Can then view pattern recognition as the process of assigning patterns to one of a number of classes by dividing up the feature space with decision boundaries, which thus divides the original space x y feature extraction classification Decision space pattern space (data) feature space

However, can be lots of overlap in this case so could use a rejection threshold e where if x* < d - e pattern is in C 1 if x* > d + e pattern is in C 2 else use refer to a better/different classifier Related to the idea of minimising Risk where it may be more important to not misclassify in one class rather than the other Especially important in medical applications. Can serve to shift the decision boundary one way or the other based on the Loss function which defines the relative importance/cost of the different errors x*x* C1C1 C2C2 A ?

Alternatively can use more features x x x x x x1*x1* x2*x2* x x However, cannot keep increasing number of features as there will come a point where the performance starts to degrade as there is not enough data to provide a good estimate (cf using 256 x256 pixels) Here, use of any one feature leads to significant overlap (imagine projections onto the axes) but use of both gives a good separation

Curse of dimensionality Geometric eg: suppose we want to approximate a 1d function y from m-dimensional training data. We could: –divide each dimension into intervals (like histogram) –y value for interval = mean y value of all points in the interval –Increase precision by increasing number of intervals –However, need at least 1 point in each interval –For k intervals in each dimension need > k m data points –Thus number of data points grows at least exponentially with the input dimension Known as the Curse of Dimensionality: “A function defined in high dimensional space is likely to be much more complex than a function defined in a lower dimensional space and those complications are harder to discern” (Friedman 95, in Haykin, 99)

Of course, the above is a particularly inefficient way of using data and most NNs are less susceptible However, only practical way to beat the curse is to incorporate correct prior knowledge In practice, we must make the underlying function smoother (ie less complex) with increasing input dimensionality Also try to reduce the input dimension by pre-processing Mainly, learn to live with the fact that perfect performance is not possible: data in the real world sometimes overlaps. Treat input data as random variables and instead look for a model which has smallest probability of making a mistake

Multivariate regression Type of function approximation: try to approximate a function from a set of (noisy) training data. Eg suppose we have the function: y = sin(2  x), We generate training data at equal intervals of x and add a little ransom Gaussian noise with s.d We add noise since in practical applications data will inevitably be noisy We then test the model by plugging in many values of x and viewing the resultant function. This gives an idea of the Generalisation peformance of the model

Eg: suppose we have the function: y = sin(2  x), We generate training data at equal intervals of x (red circles) and add a little random Gaussian noise with s.d and the model The model is trained on this data

This gives an idea of the Generalisation peformance of the model We then test the model (in this case a piecewise linear model) by plugging in many values of x and viewing the resultant function (solid blue line)

Model Complexity In the previous picture used a piecwise linear function to approximate the data. Better to use a polynomial y =  a i x i to approximate the data ie: y = a 0 + a 1 x1 st order (straight line) y = a 0 + a 1 x + a 2 x 2 2nd order (quadratic) y = a 0 + a 1 x + a 2 x 2 + a 3 x 3 3rd order y = a 0 + a 1 x + a 2 x 2 + a 3 x 3 +… + a n x n nth order As the order (highest power of x) increases, so does the potential complexity of the model/polynomial This means that it can represent a more complex (non-smooth) function and thus approximate the data more accurately

1 st order: model too simple 10th order: more accurate in terms of passing thru data points but is too complex and non-smooth (curvy) 3rd order: models underlying function well

Note though that training error continues to go down as model matches the fine-scale detail of the data (ie the noise) Rather want to model the intrinsic dimensionality of the data otherwise get the problem of overfitting Analagous to the problem of overtraining where a model is trained for too long and models the data too exactly and loses its generality As the model complexity grows performance improves for a while but starts to degrade sfter reaching an optimal level

Similar problems occur in classification problems: A model with too much flexibility does not genralise well resulting in a non-smooth decision boundary. Somewhat like giving a system enough capacity to ‘remember’ all training point: no need to generalise. Less memory => it must generalise to be able to model training data Trade-off between being a good fit to the training data and achieving a good generalisation: cf Bias-Variance trade-off (later)