Introduction to Predictive Learning LECTURE SET 6 Neural Network Learning Electrical and Computer Engineering
OUTLINE Objectives - show examples using synthetic and real-life data - introduce biologically inspired NN learning methods for clustering, regression and classification - explain similarities and differences between statistical and NN methods - show examples using synthetic and real-life data Brief history and motivation for artificial neural networks Sequential estimation of model parameters Methods for supervised learning Methods for unsupervised learning Summary and discussion
Brief history and motivation for ANN Huge interest in understanding the nature and mechanism of biological/ human learning Biologists + psychologists do not adopt classical parametric statistical learning, because: - parametric modeling is not biologically plausible - biological info processing is clearly different from algorithmic models of computation Mid 1980’s: growing interest in applying biologically inspired computational models to: - developing computer models (of human brain) - various engineering applications New field Artificial Neural Networks (~1986 – 1987) ANN’s represent nonlinear estimators implementing the ERM approach (usually squared-loss function)
History and motivation (cont’d) Relationship to the problem of inductive learning: The same learning problem setting Neural-style learning algorithm: - on-line (flow through) - simple processing Biological terminology
Neural vs Algorithmic computation Biological systems do not use principles of digital circuits Digital Biological Connectivity 1~10 ~10,000 Signal digital analog Timing synchronous asynchronous Signal propag. feedforward feedback Redundancy no yes Parallel proc. no yes Learning no yes Noise tolerance no yes
Neural vs Algorithmic computation Computers excel at algorithmic tasks (well-posed mathematical problems) Biological systems are superior to digital systems for ill-posed problems with noisy data Example: object recognition [Hopfield, 1987] PIGEON: ~ 10^^9 neurons, cycle time ~ 0.1 sec, each neuron sends 2 bits to ~ 1K other neurons 2x10^^13 bit operations per sec OLD PC: ~ 10^^7 gates, cycle time 10^^-7, connectivity=2 10x10^^14 bit operations per sec Both have similar raw processing capability, but pigeons are better at recognition tasks
Neural terminology and artificial neurons Some general descriptions of ANN’s: http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html http://en.wikipedia.org/wiki/Neural_network McCulloch-Pitts neuron (1943) Threshold (indicator) function of weighted sum of inputs
Goals of ANN’s Develop models of computation inspired by biological systems Study computational capabilities of networks of interconnected neurons Apply these models to real-life applications Learning in NNs = modification (adaptation) of synaptic connections (weights) in response to external inputs
Historical highlights of ANN McCulloch-Pitts neuron 1949 Hebbian learning 1960’s Rosenblatt (perceptron), Widrow 60’s-70’s dominance of ‘hard’ AI 1980’s resurgence of interest (PDP group, MLP, SOM etc.) 1990’s connection to statistics/VC-theory 2000’s mature field/ fragmentation
OUTLINE Objectives Brief history and motivation for artificial neural networks Sequential estimation of model parameters Methods for supervised learning Methods for unsupervised learning Summary and Discussion
Sequential estimation of model parameters Batch vs on-line (iterative) learning - Algorithmic (statistical) approaches ~ batch - Neural-network inspired methods ~ on-line BUT the only difference is on the implementation level (so both types of methods should yield similar generalization) Recall ERM inductive principle (for regression): Assume dictionary parameterization with fixed basis fcts
Sequential (on-line) least squares minimization Training pairs presented sequentially On-line update equations for minimizing empirical risk (MSE) wrt parameters w are: (gradient descent learning) where the gradient is computed via the chain rule: the learning rate is a small positive value (decreasing with k)
On-line least-squares minimization algorithm Known as delta-rule (Widrow and Hoff, 1960): Given initial parameter estimates w(0), update parameters during each presentation of k-th training sample x(k),y(k) Step 1: forward pass computation - estimated output Step 2: backward pass computation - error term (delta)
Neural network interpretation of delta rule Forward pass Backward pass Biological learning
Theoretical basis for on-line learning Standard inductive learning: given training data find the model providing min of prediction risk Stochastic Approximation guarantees minimization of risk (asymptotically): under general conditions on the learning rate:
Practical issues for on-line learning Given finite training set (n samples): this set is presented sequentially to a learning algorithm many times. Each presentation of n samples is called an epoch, and the process of repeated presentations is called recycling (of training data) Learning rate schedule: initially set large, then slowly decreasing with k (iteration number). Typically ’good’ learning rate schedules are data-dependent. Stopping conditions: (1) monitor the gradient (i.e., stop when the gradient falls below some small threshold) (2) early stopping can be used for complexity control
OUTLINE Objectives Brief history and motivation for artificial neural networks Sequential estimation of model parameters Methods for supervised learning - MultiLayer Perceptron (MLP) networks - Radial Basis Function (RBF) Networks Methods for unsupervised learning Summary and discussion
Multilayer Perceptrons (MLP) Recall graphical NN representation for dictionary methods: where How to estimate parameters (weights) via ERM?
Learning for a single neuron (delta rule): Forward pass Backward pass How to implement gradient-descent learning in a network of neurons?
Backpropagation training Minimization of with respect to parameters (weights) W, V Gradient descent optimization for where Careful application of gradient descent leads leads to the backpropagation algorithm
Backpropagation: forward pass for training input x(k), estimate predicted output
Backpropagation: backward pass update the weights by propagating the error
Details of backpropagation Sigmoid activation - picture? simple derivative Poor behaviour for large t ~ saturation How to avoid saturation? - Proper initialization (small weights) - Pre-scaling of inputs (zero mean, unit variance) Learning rate schedule (initial, final) Stopping rules, number of epochs Number of hidden units
Regularization Effect of Backpropagation Backpropagation ~ iterative optimization Final model (weights) depends on: - initial point + final point (stopping rules) initialization and/ or stopping rules can be used for model complexity control
Various forms of complexity control MLP topology ~ number of hidden units Constraints on parameters (weights) ~ weight decay Type of optimization algorithm (many versions of backprop., other opt. methods) Stopping rules Initial conditions (initial ‘small’ weights) Multiple factors make it difficult to control complexity; usually vary one complexity parameter while keeping all others fixed
Example: univariate regression Data set: 30 samples generated using sine-squared target function with Gaussian noise (st. deviation 0.1). MLP network (two hidden units) underfitting
Example: univariate regression Data set: 30 samples generated using sine-squared target function with Gaussian noise (st. deviation 0.1). MLP network (five hidden units) near optimal
Example: univariate regression Data set: 30 samples generated using sine-squared target function with Gaussian noise (st. deviation 0.1). MLP network (20 hidden units) little overfitting
Backpropagation for classification Original MLP is for regression (as shown) For classification: - sigmoid output unit (~ logistic regression using log-likelihood loss – see textbook) - during training, use real-values 0/1 for class labels - during operation, threshold the output of a trained MLP classifier at 0.5 to predict class labels
Classification example (Ripley’s data set) Data set: 250 samples ~ mixture of gaussians, where Class 0 data has centers (-0.3, 0.7) and (0.4, 0.7), and Class 1 data has centers (-0.7, 0.3) and (0.3, 0.3). The variance of all gaussians is 0.03. MLP classifier (two hidden units) underfitting
Classification Example MLP classifier (three hidden units) ~ near optimal solution
Classification Example MLP classifier (six hidden units) some overfitting
MLP software MLP software widely available in public domain Can handle multi-class problems For example, Netlab toolbox (in Matlab) at http://www1.aston.ac.uk/eas/research/groups/ncrg/resources/netlab/ Many commercial products (full of marketing hype) ’Nearly 80% Accurate Market Forecasting Software Get FREE up to date predictions and see for yourself!’
NetTalk (Sejnowski and Rosenberg, 1987) One of the first successful applications of backpropagation: http://www.cnl.salk.edu/ParallelNetsPronounce/index.php Goal: Learning to read (English text) aloud, i.e. Learn Mapping: English text phonemes using MLP classifier network Network inputs encode 7-letter window (the 4-th letter in the middle needs to be pronounced) Network outputs (26 units) encode phonemes that drive a speech synthesizer The MLP network is trained using labeled data (both individual words and unrestricted text)
NetTalk architecture Input encoding: 7x29 = 203 units Output encoding: 26 units (phonemes) Hidden layer: 80 hidden units
Listening to NetTalk-generated speech Listen to tape recordings illustrating NETtalk operation available on Youtube http://www.youtube.com/watch?v=gakJlr3GecE These three recordings contain 3 different audio outputs of NETtalk: (a) during the first 5 minutes of training, starting with weights initialized to zero. (b) after training using the set of 10,000 words. This training set corresponds to 20 passes (epochs) over 500-word text. (c) generated with new text input that was not part of the training set. After listening to these recordings, answer and comment on the following questions: - can you recognize words in the recording (a), (b) and (c)? – Explain why. - compare the quality of outputs (b) and (c). Which one seems closer to human speech and why? Question for discussion: Problem 6.8 - Why NETtalk uses a seven-letter window?
Radial Basis Function (RBF) Networks Dictionary parameterization: - each b.f. is (usually) local - center and width i.e. Gaussian: Typically used for regression or classification
RBF network training RBF training (learning) ~ estimation of (1) RBF parameters (centers, width) (2) linear weights w’s Non-adaptive implementation: (1) Estimate RBF parameters via unsupervised learning (only x-values of training data) – can use SOM, GLA etc. (2) Estimate weights w via linear least squares Advantages: - fast training; - when x-samples are plenty, but (x,y) data are few Limitations: cannot discard irrelevant inputs the curse of dimensionalty
Non-adaptive RBF training algorithm Choose the number of basis functions (centers) m. Estimate centers using x-values of training data via unsupervised learning (SOM, GLA, clustering etc.) Determine width parameters using heuristic: For a given center (a) find the distance to the closest center: for all (b) set the width parameter where parameter controls degree of overlap between adjacent basis functions. Typically 4. Estimate weights w via linear least squares (minimization of the empirical risk MSE).
RBF network complexity control RBF model complexity can be controlled by The number of RBFs: Goal: select opt number of units (RBFs) RBF width: Goal: select opt width parameter (for large number of RBF’s) Penalization of large weights w’s See toy examples next (using the number of units as the complexity parameter)
Example: RBF regression Data set: 30 samples generated using sine-squared target function with Gaussian noise (st. deviation 0.1). RBF network: automatic width selection (via x-validation) 2 RBF’s underfitting
Example: RBF regression Data set: 30 samples generated using sine-squared target function with Gaussian noise (st. deviation 0.1). RBF network: automatic width selection 5 RBF’s ~ optimal
Example: RBF regression Data set: 30 samples generated using sine-squared target function with Gaussian noise (st. deviation 0.1). RBF network: automatic width selection 20 RBF’s overfitting
RBF Classification example (Ripley’s data) Data set: 250 samples ~ mixture of gaussians, where Class 0 data has centers (-0.3, 0.7) and (0.4, 0.7), and Class 1 data has centers (-0.7, 0.3) and (0.3, 0.3). The variance of all gaussians is 0.03. RBF classifier (4 units) some underfitting
RBF Classification example (cont’d) RBF classifier (9 units) Optimal
RBF Classification example (cont’d) RBF classifier (25 units) Little overfitting
OUTLINE Objectives Brief history and motivation for artificial neural networks Sequential estimation of model parameters Methods for supervised learning Methods for unsupervised learning - clustering and vector quantization - Self-Organizing Maps (SOM) - Application example Summary and discussion
Overview Recall from Lecture Set 2: unsupervised learning data reduction approach Example: Training data represented by 3 ‘centers’ H
Two types of problems 1. Data reduction: VQ + clustering ‘Model’ ~ m points Vector Quantizer Q: VQ setting: given n training samples find the coordinates of m centers (prototypes) such that the total squared error distortion is minimized
Dimensionality reduction: linear nonlinear ‘Model’ ~ projection of high-dim. data onto low-dim. space. Note: the goal is to estimate a mapping from d-dimensional input space (d=2) to low-dimensional feature space (d*=1)
Vector Quantization and Clustering Two complementary goals of VQ: 1. partition the input space into disjoint regions 2. find positions of units (coordinates of prototypes) Note: optimal partitioning into regions is according to the nearest-neighbor rule (~ the Voronoi regions)
Generalized Lloyd Algorithm(GLA) for VQ Given data points , loss function L (i.e., squared loss) and initial centers Perform the following updates upon presentation of 1. Find the nearest center to the data point (the winning unit): 2. Update the winning unit coordinates (only) via Increment k and iterate steps (1) – (2) above Note: - the learning rate decreases with iteration number k - biological interpretations of steps (1)-(2) exist
Batch version of GLA Given data points , loss function L (i.e., squared loss) and initial centers Iterate the following two steps 1. Partition the data (assign sample to unit j ) using the nearest neighbor rule. Partitioning matrix Q: 2. Update unit coordinates as centroids of the data: Note: final solution may depend on initialization (local min) – potential problem for both on-line and batch GLA
Numeric Example of univariate VQ Given data: {2,4,10,12,3,20,30,11,25}, set m=2 Initialization (random): c1=3,c2=4 Iteration 1 Projection: P1={2,3} P2={4,10,12,20,30,11,25} Expectation (averaging): c1=2.5, c2=16 Iteration 2 Projection: P1={2,3,4}, P2={10,12,20,30,11,25} Expectation(averaging): c1=3, c2=18 Iteration 3 Projection: P1={2,3,4,10},P2={12,20,30,11,25} Expectation(averaging): c1=4.75, c2=19.6 Iteration 4 Projection: P1={2,3,4,10,11,12}, P2={20,30,25} Expectation(averaging): c1=7, c2=25 Stop as the algorithm is stabilized with these values
GLA Example 1 Modeling doughnut distribution using 5 units (a) initialization (b) final position (of units)
GLA Example 2 Modeling doughnut distribution using 3 units: Bad initialization poor local minimum
GLA Example 3 Modeling doughnut distribution using 20 units: 7 units were never moved by the GLA the problem of unused units (dead units)
Avoiding local minima with GLA Starting with many random initializations, and then choosing the best GLA solution Conscience mechanism: forcing ‘dead’ units to participate in competition, by keeping the frequency count (of past winnings) for each unit, i.e. for on-line version of GLA in Step 1 Self-Organizing Map: introduce topological relationship (map), thus forcing the neighbors of the winning unit to move towards the data.
Clustering methods Clustering: separating a data set into several groups (clusters) according to some measure of similarity Goals of clustering: interpretation (of resulting clusters) exploratory data analysis preprocessing for supervised learning often the goal is not formally stated VQ-style methods (GLA) often used for clustering, i.e. k-means or c-means Many other clustering methods as well
Clustering (cont’d) Clustering: partition a set of n objects (samples) into k disjoint groups, based on some similarity measure. Assumptions: - similarity ~ distance metric dist (i,j) - usually k given a priori (but not always!) Intuitive motivation: similar objects into one cluster dissimilar objects into different clusters the goal is not formally stated Similarity (distance) measure is critical but usually hard to define (objectively). Distance needs to be defined for different types of input variables.
Self-Organizing Maps History and biological motivation Brain changes its internal structure to reflect life experiences interaction with environment is critical at early stages of brain development (first 1-2 years of life) Existence of various regions (maps) in the brain How these maps may be formed? i.e. information-processing model leading to map formation T. Kohonen (early 1980’s) proposed SOM
Goal of SOM Dimensionality reduction: project given (high-dim.) data onto low-dimensional space (called a map) Feature space (Z-space) is 1D or 2D and is discretized as a number of units, i.e., 10x10 map Z-space has distance metric ordering of units Similarities and differences between VQ and SOM
Self-Organizing Map Discretization of 2D space via 10x10 map. In this discrete space, distance relations exist between all pairs of units. Distance relation ~ map topology
SOM Algorithm (flow through) Given data points , distance metric in the input space (~ Euclidean), map topology (in z-space), initial position of units (in x-space) Perform the following updates upon presentation of 1. Find the nearest unit to the data point (the winning unit denoted as z(k)): 2. Update all units around the winning unit via Increment k, decrease the learning rate and the neighborhood width, and repeat steps (1) – (2) above
SOM example (one iteration) Step 1: Step 2:
SOM example (next iteration) Step 1: Step 2: Final map
Hyper-parameters of SOM SOM performance depends on parameters (~ user-defined): Map dimension and topology (usually 1D or 2D) Number of SOM units ~ quantization level (of z-space) Neighborhood function ~ usually rectangular or gaussian (shape not important) Neighborhood width decrease schedule (important), i.e. exponential decrease for Gaussian with user defined: Also linear decrease of neighborhood width Learning rate schedule (important) (also linear decrease) Note: learning rate and neighborhood decrease should be set jointly
Modeling uniform distribution via SOM (a) 300 random samples (b) 10X10 map SOM neighborhood: Gaussian Learning rate: linear decrease
Position of SOM units: (a) initial, (b) after 50 iterations, (c) after 100 iterations, (d) after 10,000 iterations
Batch SOM (similar to Batch GLA) Given data points , distance metric (i.e., squared loss), map topology and initial centers Iterate the following two steps 1. Partition the data into clusters using the minimum distance rule. This results in assignment of n samples to m clusters (units) according to assignment matrix Q 2. Update center coordinates as the weighted average of all data samples (in each cluster): Decrease the neighborhood width, and iterate.
Example: effect of the final neighborhood width 90% 50% 10%
SOM Applications Two types of applications: Vector Quantization Clustering of multivariate data Main web site: http://www.cis.hut.fi/research/som-research/ Numerous Applications Marketing surveys/ segmentation Financial/ stock market data Text data / document map – WEBSOM Image data / picture map - PicSOM see HUT web site
Practical Issues for SOM Pre-scaling of inputs, usually to [0, 1] range. Why? Map topology: usually 1D or 2D Number of map units (per dimension) Learning rate schedule (for on-line version) Neighborhood type and schedule: Initial size (~1), final size Final neighborhood size + the number of units affect model complexity.
Modeling US states using 1D SOM (performed by Feng Cai) Purpose: clustering of US states Data encoding: each state described by 5 socio-economic indicators: obesity index, result of 2004 presidential elections, median income, mean NAEP, IQ score Data scaling: each input scaled independently to [0,1] range SOM specs: 1D map, 9 units, initial neighborhood width 1, final width 0.05
SOM Modeling 1 of US states
SOM Modeling 2 of US states - remove the voting input and apply 1D SOM:
SOM Modeling 2 of US states (cont’d) - remove voting input and apply 1D SOM:
Clustering of European Languages Background: historical linguistics studies relatedness btwn languages based on phonology, morphology, syntax and lexicon Difficulty of the problem: due to evolving nature of human languages and globalization. Hypothesis: similarity based on analysis of a small ‘stable’ word set. See glottochronology, Swadesh list, at http://en.wikipedia.org/wiki/Glottochronology
SOM Clustering of European Languages Modeling approach: language ~ 10 word set. Assuming words in different languages are encoded in the same alphabet, it is possible to perform clustering using some distance measure. Issues: selection of a stable word set data encoding + distance metric Stable word set: numbers 1 to 10 Data encoding: Latin alphabet, use 3 first letters (in each word)
Numbers word set in 18 European languages Each language is a feature vector encoding 10 words
Data Encoding Word ~ feature vector encoding 3 first letters Alphabet ~ 26 letters + 1 symbol ‘BLANK’ vector encoding: For example, ONE : ‘O’~14 ‘N’~15 ‘E’~05
Word Encoding (cont’d) Word 27-dimensional feature vector Encoding is insensitive to order (of 3 letters) Encoding of 10-word set: concatenate feature vectors of all words: ‘one’ + ‘two’ + …+ ‘ten’ word set encoded as vector of dim. [1 X 270]
SOM Modeling Approach 2-Dimensional SOM (Batch Algorithm) Number of Units per dimension=4 Initial Neighborhood =1 Final Neighborhood = 0.15 Total Number of Iterations= 70
OUTLINE Objectives Brief history and motivation for artificial neural networks Sequential estimation of model parameters Methods for supervised learning Methods for unsupervised learning Summary and discussion
Summary and Discussion Neural Network methods (vs statistical approaches): - new techniques/ grad descent style methods - simple (brute-force) computational approaches - black-box models (e.g. MLP network) - biological motivation The same fundamental issues: small-sample problems, curse-of-dimensionality, non-linear optimization, complexity control Neural network methods implement ERM or SRM approach (under predictive learning setting) Hype and controversy