Introduction to Predictive Learning

Slides:



Advertisements
Similar presentations
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Advertisements

Introduction to Neural Networks Computing
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
2806 Neural Computation Self-Organizing Maps Lecture Ari Visa.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
Artificial Neural Networks - Introduction -
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
Machine Learning Neural Networks
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Decision Support Systems
x – independent variable (input)
Radial Basis Functions
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Back-Propagation Algorithm
Artificial Neural Networks
Neural Networks Lecture 17: Self-Organizing Maps
CS 484 – Artificial Intelligence
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
Radial Basis Function (RBF) Networks
Last lecture summary.
Radial-Basis Function Networks
Radial Basis Function Networks
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
CS-424 Gregory Dudek Today’s Lecture Neural networks –Backprop example Clustering & classification: case study –Sound classification: the tapper Recurrent.
Radial Basis Function Networks
Artificial Neural Networks
IE 585 Introduction to Neural Networks. 2 Modeling Continuum Unarticulated Wisdom Articulated Qualitative Models Theoretic (First Principles) Models Empirical.
Chapter 9 Neural Network.
Machine Learning Chapter 4. Artificial Neural Networks
Chapter 3 Neural Network Xiu-jun GONG (Ph. D) School of Computer Science and Technology, Tianjin University
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
NEURAL NETWORKS FOR DATA MINING
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
Predictive Learning from Data
1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation.
So Far……  Clustering basics, necessity for clustering, Usage in various fields : engineering and industrial fields  Properties : hierarchical, flat,
Neural Networks - Lecture 81 Unsupervised competitive learning Particularities of unsupervised learning Data clustering Neural networks for clustering.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
1 Lecture 6 Neural Network Training. 2 Neural Network Training Network training is basic to establishing the functional relationship between the inputs.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
Chapter 6 Neural Network.
1 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 5 Nonlinear Optimization Strategies.
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
11 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 6 Methods for Data Reduction and Dimensionality Reduction.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Today’s Lecture Neural networks Training
Machine Learning Supervised Learning Classification and Regression
Big data classification using neural network
Predictive Learning from Data
Deep Learning Amin Sobhani.
Neural Networks Winter-Spring 2014
第 3 章 神经网络.
Radial Basis Function G.Anuradha.
Machine Learning Today: Reading: Maria Florina Balcan
CSC 578 Neural Networks and Deep Learning
Neuro-Computing Lecture 4 Radial Basis Function Network
Predictive Learning from Data
Artificial Intelligence Chapter 3 Neural Networks
Predictive Learning from Data
Artificial Intelligence Chapter 3 Neural Networks
Machine Learning: Lecture 4
Artificial Intelligence Chapter 3 Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Introduction to Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Presentation transcript:

Introduction to Predictive Learning LECTURE SET 6 Neural Network Learning Electrical and Computer Engineering

OUTLINE Objectives - show examples using synthetic and real-life data - introduce biologically inspired NN learning methods for clustering, regression and classification - explain similarities and differences between statistical and NN methods - show examples using synthetic and real-life data Brief history and motivation for artificial neural networks Sequential estimation of model parameters Methods for supervised learning Methods for unsupervised learning Summary and discussion

Brief history and motivation for ANN Huge interest in understanding the nature and mechanism of biological/ human learning Biologists + psychologists do not adopt classical parametric statistical learning, because: - parametric modeling is not biologically plausible - biological info processing is clearly different from algorithmic models of computation Mid 1980’s: growing interest in applying biologically inspired computational models to: - developing computer models (of human brain) - various engineering applications  New field Artificial Neural Networks (~1986 – 1987) ANN’s represent nonlinear estimators implementing the ERM approach (usually squared-loss function)

History and motivation (cont’d) Relationship to the problem of inductive learning: The same learning problem setting Neural-style learning algorithm: - on-line (flow through) - simple processing Biological terminology

Neural vs Algorithmic computation Biological systems do not use principles of digital circuits Digital Biological Connectivity 1~10 ~10,000 Signal digital analog Timing synchronous asynchronous Signal propag. feedforward feedback Redundancy no yes Parallel proc. no yes Learning no yes Noise tolerance no yes

Neural vs Algorithmic computation Computers excel at algorithmic tasks (well-posed mathematical problems) Biological systems are superior to digital systems for ill-posed problems with noisy data Example: object recognition [Hopfield, 1987] PIGEON: ~ 10^^9 neurons, cycle time ~ 0.1 sec, each neuron sends 2 bits to ~ 1K other neurons  2x10^^13 bit operations per sec OLD PC: ~ 10^^7 gates, cycle time 10^^-7, connectivity=2  10x10^^14 bit operations per sec Both have similar raw processing capability, but pigeons are better at recognition tasks

Neural terminology and artificial neurons Some general descriptions of ANN’s: http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html http://en.wikipedia.org/wiki/Neural_network McCulloch-Pitts neuron (1943) Threshold (indicator) function of weighted sum of inputs

Goals of ANN’s Develop models of computation inspired by biological systems Study computational capabilities of networks of interconnected neurons Apply these models to real-life applications Learning in NNs = modification (adaptation) of synaptic connections (weights) in response to external inputs

Historical highlights of ANN McCulloch-Pitts neuron 1949 Hebbian learning 1960’s Rosenblatt (perceptron), Widrow 60’s-70’s dominance of ‘hard’ AI 1980’s resurgence of interest (PDP group, MLP, SOM etc.) 1990’s connection to statistics/VC-theory 2000’s mature field/ fragmentation

OUTLINE Objectives Brief history and motivation for artificial neural networks Sequential estimation of model parameters Methods for supervised learning Methods for unsupervised learning Summary and Discussion

Sequential estimation of model parameters Batch vs on-line (iterative) learning - Algorithmic (statistical) approaches ~ batch - Neural-network inspired methods ~ on-line BUT the only difference is on the implementation level (so both types of methods should yield similar generalization) Recall ERM inductive principle (for regression): Assume dictionary parameterization with fixed basis fcts

Sequential (on-line) least squares minimization Training pairs presented sequentially On-line update equations for minimizing empirical risk (MSE) wrt parameters w are: (gradient descent learning) where the gradient is computed via the chain rule: the learning rate is a small positive value (decreasing with k)

On-line least-squares minimization algorithm Known as delta-rule (Widrow and Hoff, 1960): Given initial parameter estimates w(0), update parameters during each presentation of k-th training sample x(k),y(k) Step 1: forward pass computation - estimated output Step 2: backward pass computation - error term (delta)

Neural network interpretation of delta rule Forward pass Backward pass Biological learning

Theoretical basis for on-line learning Standard inductive learning: given training data find the model providing min of prediction risk Stochastic Approximation guarantees minimization of risk (asymptotically): under general conditions on the learning rate:

Practical issues for on-line learning Given finite training set (n samples): this set is presented sequentially to a learning algorithm many times. Each presentation of n samples is called an epoch, and the process of repeated presentations is called recycling (of training data) Learning rate schedule: initially set large, then slowly decreasing with k (iteration number). Typically ’good’ learning rate schedules are data-dependent. Stopping conditions: (1) monitor the gradient (i.e., stop when the gradient falls below some small threshold) (2) early stopping can be used for complexity control

OUTLINE Objectives Brief history and motivation for artificial neural networks Sequential estimation of model parameters Methods for supervised learning - MultiLayer Perceptron (MLP) networks - Radial Basis Function (RBF) Networks Methods for unsupervised learning Summary and discussion

Multilayer Perceptrons (MLP) Recall graphical NN representation for dictionary methods: where How to estimate parameters (weights) via ERM?

Learning for a single neuron (delta rule): Forward pass Backward pass How to implement gradient-descent learning in a network of neurons?

Backpropagation training Minimization of with respect to parameters (weights) W, V Gradient descent optimization for where Careful application of gradient descent leads leads to the backpropagation algorithm

Backpropagation: forward pass for training input x(k), estimate predicted output

Backpropagation: backward pass update the weights by propagating the error

Details of backpropagation Sigmoid activation - picture? simple derivative  Poor behaviour for large t ~ saturation How to avoid saturation? - Proper initialization (small weights) - Pre-scaling of inputs (zero mean, unit variance) Learning rate schedule (initial, final) Stopping rules, number of epochs Number of hidden units

Regularization Effect of Backpropagation Backpropagation ~ iterative optimization Final model (weights) depends on: - initial point + final point (stopping rules)  initialization and/ or stopping rules can be used for model complexity control

Various forms of complexity control MLP topology ~ number of hidden units Constraints on parameters (weights) ~ weight decay Type of optimization algorithm (many versions of backprop., other opt. methods) Stopping rules Initial conditions (initial ‘small’ weights) Multiple factors make it difficult to control complexity; usually vary one complexity parameter while keeping all others fixed

Example: univariate regression Data set: 30 samples generated using sine-squared target function with Gaussian noise (st. deviation 0.1). MLP network (two hidden units) underfitting

Example: univariate regression Data set: 30 samples generated using sine-squared target function with Gaussian noise (st. deviation 0.1). MLP network (five hidden units) near optimal

Example: univariate regression Data set: 30 samples generated using sine-squared target function with Gaussian noise (st. deviation 0.1). MLP network (20 hidden units) little overfitting

Backpropagation for classification Original MLP is for regression (as shown) For classification: - sigmoid output unit (~ logistic regression using log-likelihood loss – see textbook) - during training, use real-values 0/1 for class labels - during operation, threshold the output of a trained MLP classifier at 0.5 to predict class labels

Classification example (Ripley’s data set) Data set: 250 samples ~ mixture of gaussians, where Class 0 data has centers (-0.3, 0.7) and (0.4, 0.7), and Class 1 data has centers (-0.7, 0.3) and (0.3, 0.3). The variance of all gaussians is 0.03. MLP classifier (two hidden units) underfitting

Classification Example MLP classifier (three hidden units) ~ near optimal solution

Classification Example MLP classifier (six hidden units) some overfitting

MLP software MLP software widely available in public domain Can handle multi-class problems For example, Netlab toolbox (in Matlab) at http://www1.aston.ac.uk/eas/research/groups/ncrg/resources/netlab/ Many commercial products (full of marketing hype) ’Nearly 80% Accurate Market Forecasting Software Get FREE up to date predictions and see for yourself!’

NetTalk (Sejnowski and Rosenberg, 1987) One of the first successful applications of backpropagation: http://www.cnl.salk.edu/ParallelNetsPronounce/index.php Goal: Learning to read (English text) aloud, i.e. Learn Mapping: English text  phonemes using MLP classifier network Network inputs encode 7-letter window (the 4-th letter in the middle needs to be pronounced) Network outputs (26 units) encode phonemes that drive a speech synthesizer The MLP network is trained using labeled data (both individual words and unrestricted text)

NetTalk architecture Input encoding: 7x29 = 203 units Output encoding: 26 units (phonemes) Hidden layer: 80 hidden units

Listening to NetTalk-generated speech Listen to tape recordings illustrating NETtalk operation available on Youtube http://www.youtube.com/watch?v=gakJlr3GecE These three recordings contain 3 different audio outputs of NETtalk: (a) during the first 5 minutes of training, starting with weights initialized to zero. (b) after training using the set of 10,000 words. This training set corresponds to 20 passes (epochs) over 500-word text. (c) generated with new text input that was not part of the training set. After listening to these recordings, answer and comment on the following questions: - can you recognize words in the recording (a), (b) and (c)? – Explain why. - compare the quality of outputs (b) and (c). Which one seems closer to human speech and why? Question for discussion: Problem 6.8 - Why NETtalk uses a seven-letter window?

Radial Basis Function (RBF) Networks Dictionary parameterization: - each b.f. is (usually) local - center and width i.e. Gaussian: Typically used for regression or classification

RBF network training RBF training (learning) ~ estimation of (1) RBF parameters (centers, width) (2) linear weights w’s Non-adaptive implementation: (1) Estimate RBF parameters via unsupervised learning (only x-values of training data) – can use SOM, GLA etc. (2) Estimate weights w via linear least squares Advantages: - fast training; - when x-samples are plenty, but (x,y) data are few Limitations: cannot discard irrelevant inputs the curse of dimensionalty

Non-adaptive RBF training algorithm Choose the number of basis functions (centers) m. Estimate centers using x-values of training data via unsupervised learning (SOM, GLA, clustering etc.) Determine width parameters using heuristic: For a given center (a) find the distance to the closest center: for all (b) set the width parameter where parameter controls degree of overlap between adjacent basis functions. Typically 4. Estimate weights w via linear least squares (minimization of the empirical risk MSE).

RBF network complexity control RBF model complexity can be controlled by The number of RBFs: Goal: select opt number of units (RBFs) RBF width: Goal: select opt width parameter (for large number of RBF’s) Penalization of large weights w’s See toy examples next (using the number of units as the complexity parameter)

Example: RBF regression Data set: 30 samples generated using sine-squared target function with Gaussian noise (st. deviation 0.1). RBF network: automatic width selection (via x-validation) 2 RBF’s underfitting

Example: RBF regression Data set: 30 samples generated using sine-squared target function with Gaussian noise (st. deviation 0.1). RBF network: automatic width selection 5 RBF’s ~ optimal

Example: RBF regression Data set: 30 samples generated using sine-squared target function with Gaussian noise (st. deviation 0.1). RBF network: automatic width selection 20 RBF’s overfitting

RBF Classification example (Ripley’s data) Data set: 250 samples ~ mixture of gaussians, where Class 0 data has centers (-0.3, 0.7) and (0.4, 0.7), and Class 1 data has centers (-0.7, 0.3) and (0.3, 0.3). The variance of all gaussians is 0.03. RBF classifier (4 units) some underfitting

RBF Classification example (cont’d) RBF classifier (9 units) Optimal

RBF Classification example (cont’d) RBF classifier (25 units) Little overfitting

OUTLINE Objectives Brief history and motivation for artificial neural networks Sequential estimation of model parameters Methods for supervised learning Methods for unsupervised learning - clustering and vector quantization - Self-Organizing Maps (SOM) - Application example Summary and discussion

Overview Recall from Lecture Set 2: unsupervised learning data reduction approach Example: Training data represented by 3 ‘centers’ H

Two types of problems 1. Data reduction: VQ + clustering ‘Model’ ~ m points Vector Quantizer Q: VQ setting: given n training samples find the coordinates of m centers (prototypes) such that the total squared error distortion is minimized

Dimensionality reduction: linear nonlinear ‘Model’ ~ projection of high-dim. data onto low-dim. space. Note: the goal is to estimate a mapping from d-dimensional input space (d=2) to low-dimensional feature space (d*=1)

Vector Quantization and Clustering Two complementary goals of VQ: 1. partition the input space into disjoint regions 2. find positions of units (coordinates of prototypes) Note: optimal partitioning into regions is according to the nearest-neighbor rule (~ the Voronoi regions)

Generalized Lloyd Algorithm(GLA) for VQ Given data points , loss function L (i.e., squared loss) and initial centers Perform the following updates upon presentation of 1. Find the nearest center to the data point (the winning unit): 2. Update the winning unit coordinates (only) via Increment k and iterate steps (1) – (2) above Note: - the learning rate decreases with iteration number k - biological interpretations of steps (1)-(2) exist

Batch version of GLA Given data points , loss function L (i.e., squared loss) and initial centers Iterate the following two steps 1. Partition the data (assign sample to unit j ) using the nearest neighbor rule. Partitioning matrix Q: 2. Update unit coordinates as centroids of the data: Note: final solution may depend on initialization (local min) – potential problem for both on-line and batch GLA

Numeric Example of univariate VQ Given data: {2,4,10,12,3,20,30,11,25}, set m=2 Initialization (random): c1=3,c2=4 Iteration 1 Projection: P1={2,3} P2={4,10,12,20,30,11,25} Expectation (averaging): c1=2.5, c2=16 Iteration 2 Projection: P1={2,3,4}, P2={10,12,20,30,11,25} Expectation(averaging): c1=3, c2=18 Iteration 3 Projection: P1={2,3,4,10},P2={12,20,30,11,25} Expectation(averaging): c1=4.75, c2=19.6 Iteration 4 Projection: P1={2,3,4,10,11,12}, P2={20,30,25} Expectation(averaging): c1=7, c2=25 Stop as the algorithm is stabilized with these values

GLA Example 1 Modeling doughnut distribution using 5 units (a) initialization (b) final position (of units)

GLA Example 2 Modeling doughnut distribution using 3 units: Bad initialization  poor local minimum

GLA Example 3 Modeling doughnut distribution using 20 units: 7 units were never moved by the GLA  the problem of unused units (dead units)

Avoiding local minima with GLA Starting with many random initializations, and then choosing the best GLA solution Conscience mechanism: forcing ‘dead’ units to participate in competition, by keeping the frequency count (of past winnings) for each unit, i.e. for on-line version of GLA in Step 1 Self-Organizing Map: introduce topological relationship (map), thus forcing the neighbors of the winning unit to move towards the data.

Clustering methods Clustering: separating a data set into several groups (clusters) according to some measure of similarity Goals of clustering: interpretation (of resulting clusters) exploratory data analysis preprocessing for supervised learning often the goal is not formally stated VQ-style methods (GLA) often used for clustering, i.e. k-means or c-means Many other clustering methods as well

Clustering (cont’d) Clustering: partition a set of n objects (samples) into k disjoint groups, based on some similarity measure. Assumptions: - similarity ~ distance metric dist (i,j) - usually k given a priori (but not always!) Intuitive motivation: similar objects into one cluster dissimilar objects into different clusters  the goal is not formally stated Similarity (distance) measure is critical but usually hard to define (objectively). Distance needs to be defined for different types of input variables.

Self-Organizing Maps History and biological motivation Brain changes its internal structure to reflect life experiences  interaction with environment is critical at early stages of brain development (first 1-2 years of life) Existence of various regions (maps) in the brain How these maps may be formed? i.e. information-processing model leading to map formation T. Kohonen (early 1980’s) proposed SOM

Goal of SOM Dimensionality reduction: project given (high-dim.) data onto low-dimensional space (called a map) Feature space (Z-space) is 1D or 2D and is discretized as a number of units, i.e., 10x10 map Z-space has distance metric  ordering of units Similarities and differences between VQ and SOM

Self-Organizing Map Discretization of 2D space via 10x10 map. In this discrete space, distance relations exist between all pairs of units. Distance relation ~ map topology

SOM Algorithm (flow through) Given data points , distance metric in the input space (~ Euclidean), map topology (in z-space), initial position of units (in x-space) Perform the following updates upon presentation of 1. Find the nearest unit to the data point (the winning unit denoted as z(k)): 2. Update all units around the winning unit via Increment k, decrease the learning rate and the neighborhood width, and repeat steps (1) – (2) above

SOM example (one iteration) Step 1: Step 2:

SOM example (next iteration) Step 1: Step 2: Final map

Hyper-parameters of SOM SOM performance depends on parameters (~ user-defined): Map dimension and topology (usually 1D or 2D) Number of SOM units ~ quantization level (of z-space) Neighborhood function ~ usually rectangular or gaussian (shape not important) Neighborhood width decrease schedule (important), i.e. exponential decrease for Gaussian with user defined: Also linear decrease of neighborhood width Learning rate schedule (important) (also linear decrease) Note: learning rate and neighborhood decrease should be set jointly

Modeling uniform distribution via SOM (a) 300 random samples (b) 10X10 map SOM neighborhood: Gaussian Learning rate: linear decrease

Position of SOM units: (a) initial, (b) after 50 iterations, (c) after 100 iterations, (d) after 10,000 iterations

Batch SOM (similar to Batch GLA) Given data points , distance metric (i.e., squared loss), map topology and initial centers Iterate the following two steps 1. Partition the data into clusters using the minimum distance rule. This results in assignment of n samples to m clusters (units) according to assignment matrix Q 2. Update center coordinates as the weighted average of all data samples (in each cluster): Decrease the neighborhood width, and iterate.

Example: effect of the final neighborhood width 90% 50% 10%

SOM Applications Two types of applications: Vector Quantization Clustering of multivariate data Main web site: http://www.cis.hut.fi/research/som-research/ Numerous Applications Marketing surveys/ segmentation Financial/ stock market data Text data / document map – WEBSOM Image data / picture map - PicSOM see HUT web site

Practical Issues for SOM Pre-scaling of inputs, usually to [0, 1] range. Why? Map topology: usually 1D or 2D Number of map units (per dimension) Learning rate schedule (for on-line version) Neighborhood type and schedule: Initial size (~1), final size Final neighborhood size + the number of units affect model complexity.

Modeling US states using 1D SOM (performed by Feng Cai) Purpose: clustering of US states Data encoding: each state described by 5 socio-economic indicators: obesity index, result of 2004 presidential elections, median income, mean NAEP, IQ score Data scaling: each input scaled independently to [0,1] range SOM specs: 1D map, 9 units, initial neighborhood width 1, final width 0.05

SOM Modeling 1 of US states

SOM Modeling 2 of US states - remove the voting input and apply 1D SOM:

SOM Modeling 2 of US states (cont’d) - remove voting input and apply 1D SOM:

Clustering of European Languages Background: historical linguistics studies relatedness btwn languages based on phonology, morphology, syntax and lexicon Difficulty of the problem: due to evolving nature of human languages and globalization. Hypothesis: similarity based on analysis of a small ‘stable’ word set. See glottochronology, Swadesh list, at http://en.wikipedia.org/wiki/Glottochronology

SOM Clustering of European Languages Modeling approach: language ~ 10 word set. Assuming words in different languages are encoded in the same alphabet, it is possible to perform clustering using some distance measure. Issues: selection of a stable word set data encoding + distance metric Stable word set: numbers 1 to 10 Data encoding: Latin alphabet, use 3 first letters (in each word)

Numbers word set in 18 European languages Each language is a feature vector encoding 10 words

Data Encoding Word ~ feature vector encoding 3 first letters Alphabet ~ 26 letters + 1 symbol ‘BLANK’ vector encoding: For example, ONE : ‘O’~14 ‘N’~15 ‘E’~05

Word Encoding (cont’d) Word  27-dimensional feature vector Encoding is insensitive to order (of 3 letters) Encoding of 10-word set: concatenate feature vectors of all words: ‘one’ + ‘two’ + …+ ‘ten’  word set encoded as vector of dim. [1 X 270]

SOM Modeling Approach 2-Dimensional SOM (Batch Algorithm) Number of Units per dimension=4 Initial Neighborhood =1 Final Neighborhood = 0.15 Total Number of Iterations= 70

OUTLINE Objectives Brief history and motivation for artificial neural networks Sequential estimation of model parameters Methods for supervised learning Methods for unsupervised learning Summary and discussion

Summary and Discussion Neural Network methods (vs statistical approaches): - new techniques/ grad descent style methods - simple (brute-force) computational approaches - black-box models (e.g. MLP network) - biological motivation The same fundamental issues: small-sample problems, curse-of-dimensionality, non-linear optimization, complexity control Neural network methods implement ERM or SRM approach (under predictive learning setting) Hype and controversy