EE645 Neural Networks and Learning Theory

Slides:

Advertisements

Similar presentations

Artificial Neural Networks

Advertisements

Pattern Recognition and Machine Learning

Neural Networks and Kernel Methods

Introduction to Support Vector Machines (SVM)

Slides from: Doug Gray, David Poole

Introduction to Neural Networks Computing

2806 Neural Computation Self-Organizing Maps Lecture Ari Visa.

Support Vector Machines

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Supervised Learning Recap

5/16/2015Intelligent Systems and Soft Computing1 Introduction Introduction Hebbian learning Hebbian learning Generalised Hebbian learning algorithm Generalised.

Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Machine Learning Neural Networks

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Lecture 14 – Neural Networks

1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation.

Self Organization: Hebbian Learning CS/CMPE 333 – Neural Networks.

Goals of Adaptive Signal Processing Design algorithms that learn from training data Algorithms must have good properties: attain good solutions, simple.

EE491D Special Topics in Communications Adaptive Signal Processing Spring 2005 Prof. Anthony Kuh POST 205E Dept. of Elec. Eng. University of Hawaii Phone:

Artificial Neural Networks

Aula 4 Radial Basis Function Networks

Lecture 09 Clustering-based Learning

Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.

Radial Basis Function (RBF) Networks

Radial Basis Function Networks

Collaborative Filtering Matrix Factorization Approach

Neural Networks Lecture 8: Two simple learning algorithms

CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:

Radial Basis Function Networks

MSE 2400 EaLiCaRA Spring 2015 Dr. Tom Way

Biointelligence Laboratory, Seoul National University

Presentation on Neural Networks.. Basics Of Neural Networks Neural networks refers to a connectionist model that simulates the biophysical information.

Neural NetworksNN 11 Neural netwoks thanks to: Basics of neural network theory and practice for supervised and unsupervised.

Introduction to Neural Networks Debrup Chakraborty Pattern Recognition and Machine Learning 2006.

Neural Networks Ellen Walker Hiram College. Connectionist Architectures Characterized by (Rich & Knight) –Large number of very simple neuron-like processing.

Chapter 3 Neural Network Xiu-jun GONG (Ph. D) School of Computer Science and Technology, Tianjin University

Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.

NEURAL NETWORKS FOR DATA MINING

LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.

An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.

Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.

1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

Unsupervised Learning Motivation: Given a set of training examples with no teacher or critic, why do we learn? Feature extraction Data compression Signal.

CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.

1 Lecture 6 Neural Network Training. 2 Neural Network Training Network training is basic to establishing the functional relationship between the inputs.

CHAPTER 10 Widrow-Hoff Learning Ming-Feng Yeh.

Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.

Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22

EEE502 Pattern Recognition

Neural Networks 2nd Edition Simon Haykin

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.

Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.

CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.

Today’s Lecture Neural networks Training

Machine Learning Supervised Learning Classification and Regression

LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.

Learning with Perceptrons and Neural Networks

Learning in Neural Networks

CSC 578 Neural Networks and Deep Learning

Collaborative Filtering Matrix Factorization Approach

Neuro-Computing Lecture 4 Radial Basis Function Network

Artificial Neural Networks

Capabilities of Threshold Neurons

Artificial Intelligence Chapter 3 Neural Networks

Introduction to Radial Basis Function Networks

Presentation transcript:

EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427 Email: kuh@spectra.eng.hawaii.edu

I. Introduction to neural networks Goal: study computational capabilities of neural network and learning systems. Multidisciplinary field Algorithms, Analysis, Applications

A. Motivation Why study neural networks and machine learning? Biological inspiration (natural computation) Nonparametric models: adaptive learning systems, learning from examples, analysis of learning models Implementation Applications Cognitive (Human vs. Computer Intelligence): Humans superior to computers in pattern recognition, associative recall, learning complex tasks. Computers superior to humans in arithmetic computations, simple repeatable tasks. Biological: (study human brain) 10^10 to 10^11 neurons in cerebral cortex with on average of 10^3 interconnections / neuron.

A neuron Schematic of one neuron

Neural Network Connection of many neurons together forms a neural network. Neural network properties: Highly parallel (distributed computing) Robust and fault tolerant Flexible (short and long term learning) Handles variety of information (often random, fuzzy, and inconsistent) Small, compact, dissipates very little power

B. Single Neuron (Computational node) g( ) w x  y s w s=w T x + w0 ; synaptic strength (linearly weighted sum of inputs). y=g(s); activation or squashing function

Activation functions Linear units: g(s) = s. Linear threshold units: g(s) = sgn (s). Sigmoidal units: g(s) = tanh (Bs), B >0. Neural networks generally have nonlinear activation functions. Most popular models: linear threshold units and sigmoidal units. Other types of computational units : receptive units (radial basis functions).

C. Neural Network Architectures Systems composed of interconnected neurons output inputs Neural network represented by directed graph: edges represent weights and nodes represent computational units.

Definitions Feedforward neural network has no loops in directed graph. Neural networks are often arranged in layers. Single layer feedforward neural network has one layer of computational nodes. Multilayer feedforward neural network has two or more layers of computational nodes. Computational nodes that are not output nodes are called hidden units.

D. Learning and Information Storage Neural networks have computational capabilities. Where is information stored in a neural network? What are parameters of neural network? How does a neural network work? (two phases) Training or learning phase (equivalent to write phase in conventional computer memory): weights are adjusted to meet certain desired criterion. Recall or test phase (equivalent to read phase in conventional computer memory): weights are fixed as neural network realizes some task.

Learning and Information (continued) 3) What can neural network models learn? Boolean functions Pattern recognition problems Function approximation Dynamical systems 4) What type of learning algorithms are there? Supervised learning (learning with a teacher) Unsupervised learning (no teacher) Reinforcement learning (learning with a critic)

Learning and Information (continued) 5) How do neural networks learn? Iterative algorithm: weights of neural network are adjusted on-line as training data is received. w(k+1) = L(w(k),x(k),d(k)) for supervised learning where d(k) is desired output. Need cost criterion: common cost criterion Mean Squared Error: for one output J(w) =  (y(k) – d(k)) 2 Goal is to find minimum J(w) over all possible w. Iterative techniques often use gradient descent approaches.

Learning and Information (continued) 6)Learning and Generalization Learning algorithm takes training examples as inputs and produces concept, pattern or function to be learned. How good is learning algorithm? Generalization ability measures how well learning algorithm performs. Sufficient number of training examples. (LLN, typical sequences) Occam’s razor: “simplest explanation is the best”. + + + + + + Regression problem

Learning and Information (continued) Generalization error g = emp + model Empirical error: average error from training data (desired output vs. actual output) Model error: due to dimensionality of class of functions or patterns Desire class to be large enough so that empirical error is small and small enough so that model error is small.

II. Linear threshold units A. Preliminaries sgn( ) w x  y s w 1, if s>=0 -1, if s<0 sgn(s)=

Linearly separable Consider a set of points with two labels: + and o. Set of points is linearly separable if a linear threshold function can partition the + points from the o points. o + + o o + Set of linearly separable points

Not linearly separable A set of labeled points that cannot be partitioned by a linear threshold function is not linearly separable. o + + o Set of points that are not linearly separable

B. Perceptron Learning Algorithm An iterative learning algorithm that can find linear threshold function to partition two set of points. w(0) arbitrary Pick point (x(k),d(k)). If w(k) T x(k)d(k) > 0 go to 5) w(k+1) = w(k ) + x(k)d(k) k=k+1, check if cycled through data, if not go to 2 Otherwise stop.

PLA comments Perceptron convergence theorem (requires margins) Sketch of proof Updating threshold weights Algorithm is based on cost function J(w) = - (sum of synaptic strengths of misclassified points) w(k+1) = w(k) - (k)J(w(k)) (gradient descent)

Perceptron Convergence Theorem Assumptions: w* solutions and ||w*||=1, no threshold and w(0)=0. Let max||x(k)||= and min y(k)x(k)Tw*=. <w(k),w*>=<w(k-1) + x(k-1)y(k-1),w*>  <w(k-1),w*> +   k . ||w(k)||2  ||w(k-1)||2 + ||x(k-1)||2  ||w(k-1)||2 +  2  k 2 . Implies that k  ( /  ) 2 (max number of updates).

III. Linear Units A. Preliminaries w x  s=y

Model Assumptions and Parameters Training examples (x(k),d(k)) drawn randomly Parameters Inputs: x(k) Outputs: y(k) Desired outputs: d(k) Weights: w(k) Error: e(k)= d(k)-y(k) Error criterion (MSE) min J(w) = E [.5(e(k)) 2]

Wiener solution Define P= E(x(k)d(k)) and R=E(x(k)x(k)T). J(w) =.5 E[(d(k)-y(k))2] = .5E(d(k)2)- E(x(k)d(k)) Tw +wT E(x(k)x(k) T)w = .5E[d(k) 2] –PTw +.5wTRw Note J(w) is a quadratic function of w. To minimize J(w) find gradient, J(w) and set to 0. J(w) = -P + Rw = 0 Rw=P (Wiener solution) If R is nonsingular, then w= R-1 P. Resulting MSE = .5E[d(k)2]-PTR-1P

^ Iterative algorithms Steepest descent algorithm (move in direction of negative gradient) w(k+1) = w(k) - J(w(k)) = w(k) +  (P-Rw(k)) Least mean square algorithm (approximate gradient from training example) J(w(k))= -e(k)x(k) w(k+1) = w(k) + e(k)x(k) ^

Steepest Descent Convergence w(k+1) = w(k) +  (P-Rw(k)); Let w* be solution. Center weight vector v=w-w* v(k+1) = v(k) -  (Rw(k)); Assume R is nonsingular. Decorrelate weight vector u= Q-1v where R=Q Q-1 is the transformation that diagonalizes R. u(k+1) = (I -   ), u(k) = (I -   )k u(0). Conditions for convergence 0<  < 2/max .

LMS Algorithm Properties Steepest Descent and LMS algorithm convergence depends on step size  and eigenvalues of R. LMS algorithm is simple to implement. LMS algorithm convergence is relatively slow. Tradeoff between convergence speed and excess MSE. LMS algorithm can track training data that is time varying.

Adaptive MMSE Methods Training data Blind algorithms Linear MMSE: LMS, RLS algorithms Nonlinear Decision feedback detectors Blind algorithms Second order statistics Minimum Output Energy Methods Reduced order approximations: PCA, multistage Wiener Filter Higher order statistics Cumulants, Information based criteria

Designing a learning system Given a set of training data, design a system that can realize the desired task. Inputs Signal Processing Feature Extraction Neural Network Outputs

IV. Multilayer Networks A. Capabilities Depend directly on total number of weights and threshold values. A one hidden layer network with sufficient number of hidden units can arbitrarily approximate any boolean function, pattern recognition problems, and well behaved function approximation problems. Sigmoidal units more powerful than linear threshold units.

B. Error backpropagation Error backpropagation algorithm: methodical way of implementing LMS algorithm for multilayer neural networks. Two passes: forward pass (computational pass), backward pass (weight correction pass). Analog computations based on MSE criterion. Hidden units usually sigmoidal units. Initialization: weights take on small random values. Algorithm may not converge to global minimum. Algorithm converges slower than for linear networks. Representation is distributed.

BP Algorithm Comments s are error terms computed from output layer back to first layer in dual network. Training is usually done online. Examples presented in random or sequential order. Update rule is local as weight changes only involve connections to weight. Computational complexity depends on number of computational units. Initial weights randomized to avoid converging to local minima.

BP Algorithm Comment continued Threshold weights updated in similar manner to other weights (input =1). Momentum term added to speed up convergence. Step size set to small value. Sigmoidal activation derivatives simple to compute.

BP Architecture Output of computational values calculated Forward network Output of error terms calculated Sensitivity network

Modifications to BP Algorithm Batch procedure Variable step size Better approximation of gradient method (momentum term, conjugate gradient) Newton methods (Hessian) Alternate cost functions Regularization Network construction algorithms Incorporating time

When to stop training First major features captured. As training continues minor features captured. Look at training error. Crossvalidation (training, validation, and test sets) testing error training error Learning typically slow and may find flat learning areas with little improvement in energy function.

C. Radial Basis Functions Use locally receptive units (potential functions) Transform input space to hidden unit space via potential functions. Output unit is linear.   output inputs Linear unit  Potential units (x) = exp (-.5||x-c|| 2 / 2

Transformation of input space  X X O X O O X O Input space Feature space : X Z

Training Radial basis functions Use gradient descent on unknown parameters: centers, widths, and output weights Separate tasks for quicker training: (first layer centers, widths), (second layer weights) First layer Fix widths, centers determined from lattice structure Fix widths, clustering algorithm for centers Resource allocation network Second layer: use LMS to learn weights

Comparisons between RBFs and BP Algorithm RBF single hidden layer and BP algorithm can have many hidden layers. RBF (potential functions) locally receptive units versus BP algorithm (sigmoidal units) distributed representations. RBF typically many more hidden units. RBF training typically quicker training.

V. Alternate Detection Method Consider detection methods based on optimum margin classifiers or Support Vector Machines (SVM) SVM are based on concepts from statistical learning theory. SVM are easily extended to nonlinear decision regions via kernel functions. SVM solutions involve solving quadratic programming problems.

Optimal Marginal Classifiers X Given a set of points that are linearly separable: X X X Which hyperplane should you choose to separate points? O O O Choose hyperplane that maximizes distance between two sets of points.

Finding Optimal Hyperplane margins Draw convex hull around each set of points. Find shortest line segment connecting two convex hulls. Find midpoint of line segment. Optimal hyperplane intersects line segment at midpoing perpendicular to line segment. X X X w X O O O Optimal hyperplane

Alternative Characterization of Optimal Margin Classifiers Maximizing margins equivalent to minimizing magnitude of weight vector. 2m X X X T W (u-v) = 2 w T X W (u-v)/ W = 2/ W =2m u O T W u+ b = 1 O v O T W v+ b = -1

Solution in 1 Dimension O O O O O X O X X O X X X Points on wrong side of hyperplane If C is large SV include If C is small SV include all points (scaled MMSE solution) Note that weight vector depends most heavily on outer support vectors.

Comments on 1 Dimensional Solution Simple algorithm can be implemented to solve 1D problem. Solution in multiple dimensions is finding weight and then projecting down to 1D. Min. probability of error threshold depends on likelihood ratio. MMSE solution depends on all points where as SVM depends on SV (points that are under margin (closer to min. probability of error). Min. probability of error, MMSE solution, and SVM in general give different detectors.

Kernel Methods In many classification and detection problems a linear classifier is not sufficient. However, working in higher dimensions can lead to “curse of dimensionality”. Solution: Use kernel methods where computations done in dual observation space.  X X O X O O X O Input space Feature space : X Z

Solving QP problem SVM require solving large QP problems. However, many s are zero (not support vectors). Breakup QP into subproblem. Chunking : (Vapnik 1979) numerical solution. Ossuna algorithm: (1997) numerical solution. Platt algorithm: (1998) Sequential Minimization Optimization (SMO) analytical solution.

SMO Algorithm Sequential Minimization Optimization breaks up QP program into small subproblems that are solved analytically. SMO solves dual QP SVM problem by examining points that violate KKT conditions. Algorithm converges and consists of: Search for 2 points that violate KKT conditions. Solve QP program for 2 points. Calculate threshold value b. Continue until all points satisfy KKT conditions. On numerous benchmarks time to convergence of SMO varied from O (l) to O (l 2.2 ) . Convergence time depends on difficulty of classification problem and kernel functions used.

SVM Summary SVM are based on optimum margin classifiers and are solved using quadratic programming methods. SVM are easily extended to problems that are not linearly separable. SVM can create nonlinear separating surfaces via kernel functions. SVM can be efficiently programmed via the SMO algorithm. SVM can be extended to solve regression problems.

VI.Unsupervised Learning Motivation Given a set of training examples with no teacher or critic, why do we learn? Feature extraction Data compression Signal detection and recovery Self organization Information can be found about data from inputs.

B. Principal Component Analysis Introduction Consider a zero mean random vector x  R n with autocorrelation matrix R = E(xxT). R has eigenvectors q(1),… ,q(n) and associated eigenvalues (1)…  (n). Let Q = [ q(1) | …| q(n)] and  be a diagonal matrix containing eigenvalues along diagonal. Then R = Q  QT can be decomposed into eigenvector and eigenvalue decomposition.

First Principal Component Find max xTRx subject to ||x||=1. Maximum obtained when x= q(1) as this corresponds to xTRx = (1). q(1) is first principal component of x and also yields direction of maximum variance. y(1) = q(1)T x is projection of x onto first principal component. x q(1) y(1)

Other Principal Components ith principal component denoted by q(i) and projection denoted by y(i) = q(i)T x with E(y(i)) = 0 and E(y(i)2)= (i). Note that y= QTx and we can obtain data vector x from y by noting that x=Qy. We can approximate x by taking first m principal components (PC) to get z: z= q(1)x(1) +…+ q(m)x(m). Error given by e= x-z. e is orthogonal to q(i) when 1 i  m.

Diagram of PCA Second PC First PC x x x x x Second PC x x x First PC x x x x x x x x x x x x First PC gives more information than second PC.

Learning algorithms for PCA Hebbian learning rule: when presynaptic and postsynaptic signal are postive, then weigh associated with synapse increase in strength. w x y w =  x y

Oja’s rule Use normalize Hebbian rule applied to linear neuron. w x  s=y Need normalized Hebbian rule otherwise weight vector will grow unbounded.

Oja’s rule continued wi (k+1) = wi(k) +  xi (k) y(k) (apply Hebbian rule) w(k+1)= w(k+1) / ||w(k+1)|| (renormalize weight) Unfortunately above rule is difficult to implement so modification approximates above rule giving wi (k+1) = wi(k) +  y(k)(xi (k)- y(k) wi(k)) Similar to Hebbian rule with modified input. Can show that w(k)  q(1) with probability one given that x(k) is zero mean second order and drawn from a fixed distribution.

Learning other PCs Adaptive learning rules (subtract larger PCs out) Generalized Hebbian Algorithm APEX Batch Algorithm (singular value decomposition) Approximate correlation matrix R with time averages.

Applications of PCA Matched Filter problem: x(k) = s(k) + v(k). Multiuser communications: CDMA Image coding (data compression) GHA quantizer PCA

Kernel Methods In many classification and detection problems a linear classifier is not sufficient. However, working in higher dimensions can lead to “curse of dimensionality”. Solution: Use kernel methods where computations done in dual observation space.  X X O X O O X O Input space Feature space : X Z

C. Independent Component Analysis PCA decorrelates inputs. However in many instances we may want to make outputs independent. U Y A X W Inputs U assumed independent and user sees X. Goal is to find W so that Y is independent. A

ICA Solution Y = DPU where D is a diagonal matrix and P is a permutation matrix. Algorithm is unsupervised. What are assumptions where learning is possible? All components of U except possibly one are nongaussian. Establish criterion to learn from (use higher order statistics): information based criteria, kurtosis function. Kullback Leibler Divergence: D(f,g) =  f(x) log (f(x)/g(x)) dx

ICA Information Criterion Kullback Leibler Divergence nonnegative. Set f to joint density of Y and g to products of marginals of Y then D(f,g) = -H(Y) + H(Yi) which is minized when components of Y are independent. When outputs are independent they can be a permutation and scaled version of U.

Learning Algorithms Can learn weights by approximating divergence cost function using contrast functions. Iterative gradient estimate algorithms can be used. Faster convergence can be achieved with fixed point algorithms that approximate Newton’s methods. Algorithms have been shown to converge.

Applications of ICA Array antenna processing Blind source separation: speech separation, biomedical signals, financial data

D. Competitive Learning Motivation: Neurons compete with one another with only one winner emerging. Brain is a topologically ordered computational map. Array of neurons self organize. Generalized competitive learning algorithm. Initialize weights Randomly choose inputs Pick winner. Update weights associated with winner. Go to 2).

Competitive Learning Algorithm K means algorithm (no topological ordering) Online algorithm Update centers Reclassify points Converges to local minima Kohonen Self Organization Feature Map (topological ordering) Neurons arranged on lattice Weight that are updated depend on winner, step size, and neighborhood. Decrease step size and neighborhood size to get topological ordering.

KSOFM 2 dimensional lattice

Neural Network Applications Backgammon (Feedforward network) 459-24-24-1 network to rate moves Hand crafted examples, noise helped in training 59% winning percentage against SUN gammontools Later versions used reinforcement learning Handwritten zip code (Feedforward network) 16-768-192-30-10 network to distinguish numbers Preprocessed data, 2 hidden layers act as feature detectors 7291 training examples, 2000 test examples Training data .14%, test data 5%, test/reject data 1%,12%

Neural Network Applications Speech recognition KSOFM map followed by feedforward neural network 40 – 120 frames mapped onto 12 by 12 Kohonen map Each frame composed of 600 to 1800 analog vector Output of Kohonen map fed to feedforward network Reduced search using KSOFM map TI 20 word data base 98-99% correct on speaker dependent classsification

Other topics Reinforcement learning Associative networks Neural dynamics and control Computational learning theory Bayesian learning Neuroscience Cognitive science Hardware implementation