Tutorial : Echo State Networks Dan Popovici University of Montreal (UdeM) MITACS 2005.

Slides:



Advertisements
Similar presentations
Artificial Neural Networks
Advertisements

Slides from: Doug Gray, David Poole
Neural networks Introduction Fitting neural networks
Artificial Neural Networks (1)
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: The FIR Adaptive Filter The LMS Adaptive Filter Stability and Convergence.
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
Machine Learning Neural Networks
Performance Optimization
280 SYSTEM IDENTIFICATION The System Identification Problem is to estimate a model of a system based on input-output data. Basic Configuration continuous.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
RBF Neural Networks x x1 Examples inside circles 1 and 2 are of class +, examples outside both circles are of class – What NN does.
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Neural Networks Marco Loog.
Evaluating Hypotheses
An Illustrative Example
Goals of Adaptive Signal Processing Design algorithms that learn from training data Algorithms must have good properties: attain good solutions, simple.
Chapter 6: Multilayer Neural Networks
Machine Learning Motivation for machine learning How to set up a problem How to design a learner Introduce one class of learners (ANN) –Perceptrons –Feed-forward.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Computational aspects of motor control and motor learning Michael I. Jordan* Mark J. Buller (mbuller) 21 February 2007 *In H. Heuer & S. Keele, (Eds.),
SOMTIME: AN ARTIFICIAL NEURAL NETWORK FOR TOPOLOGICAL AND TEMPORAL CORRELATION FOR SPATIOTEMPORAL PATTERN LEARNING.
Radial-Basis Function Networks
Radial Basis Function Networks
Adaptive Signal Processing
Normalised Least Mean-Square Adaptive Filtering
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Introduction to Adaptive Digital Filters Algorithms
MSE 2400 EaLiCaRA Spring 2015 Dr. Tom Way
Artificial Neural Networks
A Shaft Sensorless Control for PMSM Using Direct Neural Network Adaptive Observer Authors: Guo Qingding Luo Ruifu Wang Limei IEEE IECON 22 nd International.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Neural NetworksNN 11 Neural netwoks thanks to: Basics of neural network theory and practice for supervised and unsupervised.
Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.
Machine Learning Chapter 4. Artificial Neural Networks
Chapter 3 Neural Network Xiu-jun GONG (Ph. D) School of Computer Science and Technology, Tianjin University
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Motor Control. Beyond babbling Three problems with motor babbling: –Random exploration is slow –Error-based learning algorithms are faster but error signals.
Curiosity-Driven Exploration with Planning Trajectories Tyler Streeter PhD Student, Human Computer Interaction Iowa State University
Akram Bitar and Larry Manevitz Department of Computer Science
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Derivation Computational Simplifications Stability Lattice Structures.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
1 Lecture 6 Neural Network Training. 2 Neural Network Training Network training is basic to establishing the functional relationship between the inputs.
Over-Trained Network Node Removal and Neurotransmitter-Inspired Artificial Neural Networks By: Kyle Wray.
CHAPTER 10 Widrow-Hoff Learning Ming-Feng Yeh.
Professors: Eng. Diego Barral Eng. Mariano Llamedo Soria Julian Bruno
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Chapter 8: Adaptive Networks
Hazırlayan NEURAL NETWORKS Backpropagation Network PROF. DR. YUSUF OYSAL.
Nonlinear State Estimation
METHOD OF STEEPEST DESCENT ELE Adaptive Signal Processing1 Week 5.
Brain-Machine Interface (BMI) System Identification Siddharth Dangi and Suraj Gowda BMIs decode neural activity into control signals for prosthetic limbs.
Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.
Ghent University An overview of Reservoir Computing: theory, applications and implementations Benjamin Schrauwen David Verstraeten and Jan Van Campenhout.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Machine Learning Artificial Neural Networks MPλ ∀ Stergiou Theodoros 1.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Deep Learning Amin Sobhani.
Randomness in Neural Networks
Neuro-Computing Lecture 4 Radial Basis Function Network
Introduction to Radial Basis Function Networks
Neural networks (1) Traditional multi-layer perceptrons
CSC321: Neural Networks Lecture 11: Learning in recurrent networks
August 8, 2006 Danny Budik, Itamar Elhanany Machine Intelligence Lab
Presentation transcript:

Tutorial : Echo State Networks Dan Popovici University of Montreal (UdeM) MITACS 2005

Overview 1. Recurrent neural networks: a 1-minute primer 2. Echo state networks 3. Examples, examples, examples 4. Open Issues

1 Recurrent neural networks

Feedforward- vs. recurrent NN Input Output connections only "from left to right", no connection cycle activation is fed forward from input to output through "hidden layers" no memory at least one connection cycle activation can "reverberate", persist even with no input system with memory

recurrent NNs, main properties input time series  output time series can approximate any dynamical system (universal approximation property) mathematical analysis difficult learning algorithms computationally expensive and difficult to master few application-oriented publications, little research

Supervised training of RNNs A. Training Teacher: Model: B. Exploitation Input: Correct (unknown) output: Model: in out in out

Backpropagation through time (BPTT) Most widely used general- purpose supervised training algorithm Idea: 1. stack network copies, 2. interpret as feedforward network, 3. use backprop algorithm.... original RNN stack of copies

What are ESNs? training method for recurrent neural networks black-box modelling of nonlinear dynamical systems supervised training, offline and online exploits linear methods for nonlinear modeling Previously ESN training

Introductory example: a tone generator Goal: train a network to work as a tuneable tone generator input: frequency setting output: sines of desired frequency

Tone generator, sampling For sampling period, drive fixed "reservoir" network with teacher input and output. Observation: internal states of dynamical reservoir reflect both input and output teacher signals

Tone generator: compute weights Determine reservoir-to-output weights such that training output is optimally reconstituted from internal "echo" signals.

Tone generator: exploitation With new output weights in place, drive trained network with input. Observation: network continues to function as in training. –internal states reflect input and output –output is reconstituted from internal states internal states and output create each other echo reconstitute

Tone generator: generalization The trained generator network also works with input different from training input A. step inputB. teacher and learned output C. some internal states

Dynamical reservoir large recurrent network (100 -  units) works as "dynamical reservoir", "echo chamber" units in DR respond differently to excitation output units combine different internal dynamics into desired dynamics

Rich excited dynamics excitation responses Unit impulse responses should vary greatly. Achieve this by, e.g., inhomogeneous connectivity random weights different time constants...

Notation and Update Rules

Learning: basic idea Every stationary deterministic dynamical system can be defined by an equation like where the system function h might be a monster. Combine h from the I/O echo functions by selecting suitable DR-to-output weights :

Offline training: task definition Let be the teacher output.. Compute weights such that mean square error is minimized. Recall

Offline training: how it works 1.Let network run with training signal teacher-forced. 2.During this run, collect network states, in matrix M 3.Compute weights, such that is minimized MSE minimizing weight computation (step 3) is a standard operation. Many efficient implementations available, offline/constructive and online/adaptive.

Practical Considerations Chosen randomly Spectral radius of W < 1 W should be sparse Input and feedback weights have to be scaled “appropriately” Adding noise in the update rule can increase generalization performance

Echo state network training, summary use large recurrent network as "excitable dynamical reservoir (DR)" DR is not modified through learning adapt only DR  output weights thereby combine desired system function from I/O history echo functions use any offline or online linear regression algorithm to minimize error

3 Examples, examples, examples

Short-term memories

Delay line: scheme

Delay line: example Network size 400 Delays: 1, 30, 60, 70, 80, 90, 100, 103, 106, 120 steps Training sequence length N = 2000 training signal: random walk with resting states

results correct delayed signals ( ) and network outputs ( ) traces of some DR internal units

Delay line: test with different input correct delayed signals ( ) and network outputs ( ) traces of some DR internal units

3.2 Indentification of nonlinear systems

Identifying higher-order nonlinear systems A tenth-order system... Training setup

Results: offline learning augmented ESN (800 Parameters) : NMSE test = previous published state of the art 1) : NMSE train = 0.24 D. Prokhorov, pers. communication 2) : NMSE test = ) Atiya & Parlos (2000), IEEE Trans. Neural Networks 11(3), ) EKF-RNN, 30 units, 1000 Parameters.

The Mackey-Glass equation delay differential equation delay  > 16.8 : chaotic benchmark for time series prediction  = 17  = 30

Learning setup network size 1000 training sequence N = 3000 sampling rate 1

Results for  = 17 Error for 84-step prediction: NRMSE = 1E-4.2 (averaged over 100 training runs on independently created data) With refined training method: NRMSE = 1E-5.1 previous best: NRMSE = 1E-1.7 original learnt model

Prediction with model visible discrepancy after about 1500 steps...

Comparison: NRMSE for 84-step prediction *) data from survey in Gers / Eck /Schmidhuber 2000 log10(NRMSE)

3.3 Dynamic pattern recognition

Dynamic pattern detection 1) Training signal: output jumps to 1 after occurence of pattern instance in input 1) see GMD Report Nr 152 for detailed coverage

Single-instance patterns, training setup 1. A single-instance, 10- step pattern is randomly fixed 2. It is inserted into 500- step random signal at positions 200 (for training) 350, 400, 450, 500 (for testing) unit ESN trained on first 300 steps (single positive instance! "single shot learning), tested on remaining 200 steps test data: 200 steps with 4 occurances of pattern on random background, desired output: red impulses the pattern

Single-instance patterns, results 1. trained network response on test data 2. network response after training 800 more pattern- free steps ("negative examples") 3. like 2., but 5 positive examples in training data DR: 12.4DR: 12.1DR: comparison: optimal linear filter DR: 3.5 discrimination ratio DR:

Event detection for robots (joint work with J.Hertzberg & F. Schönherr) Robot runs through office environment, experiences data streams (27 channels) like sec infrared distance sensor left motor speed activation of "goThruDoor" external teacher signal, marking event category

Learning setup 27 (raw) data channelsunlimited number of event detector channels 100 unit RNN simulated robot (rich simulation) training run spans 15 simulated minutes event categories like pass through door pass by 90° corner pass by smooth corner

Results easy to train event hypothesis signals "boolean" categories possible single-shot learning possible

Network setup in training _az_az 29 input channels code symbols output channels for next symbol hypotheses 400 units

Trained network in "text" generation decision mechanism, e.g. winner-take-all !! winning symbol is next input

Results Selection by random draw according to output yth_upsghteyshhfakeofw_io,l_yodoinglle_d_upeiuttytyr_hsymua_doey_sa mmusos_trll,t.krpuflvek_hwiblhooslolyoe,_wtheble_ft_a_gimllveteud_... Winner-take-all selection sdear_oh,_grandmamma,_who_will_go_and_the_wolf_said_the_wolf_said _the_wolf_said_the_wolf_said_the_wolf_said_the_wolf_said_the_wolf...

4 Open Issues

4.2 Multiple timescales 4.3 Additive mixtures of dynamics 4.4 "Switching" memory 4.5 High-dimensional dynamics

Multiple time scales This is hard to learn (Laser benchmark time series): Reason: 2 widely separated time scales Approach for future research: ESNs with different time constants in their units

Additive dynamics This proved impossible to learn: Reason: requires 2 independent oscillators; but in ESN all dynamics are mutually coupled. Approach for future research: modular ESNs and unsupervised multiple expert learning

"Switching" memory This FSA has long memory "switches": Generating such sequences not possible with monotonic, area- bounded forgetting curves! a a b c b aaa....aaa c aaa...aaa b aaa...aaa c aaa...aaa... bounded area unbounded width An ESN simply is not a model for long-term memory!

High-dimensional dynamics High-dimensional dynamics would require very large ESN. Example: 6-DOF nonstationary time series one-step prediction 200-unit ESN: RMS = 0.2; 400-unit network: RMS = 0.1; best other training technique 1) : RMS = 0.02 Approach for future research: task-specific optimization of ESN 1) Prokhorov et al, extended Kalman filtering BPPT. Network size 40, 1400 trained links, training time 3 weeks

Spreading trouble... Signals x i (n) of reservoir can be interpreted as vectors in (infinite-dimensional) signal space Correlation E[xy] yields inner product on this space Output signal y(n) is linear combination of these x i (n) The more orthogonal the x i (n), the smaller the output weights: y y x1x1 x2x2 x2x2 x1x1 y = 30 x 1  28 x 2 y = 0.5 x 1  0.7 x 2

Eigenvectors v k of correlation matrix R = (E[x i x j ] ) are orthogonal signals Eigenvalues  k indicate what "mass" of reservoir signals x i (all together) is aligned with v k Eigenvalue spread max / min indicates overall "non- orthogonality" of reservoir signals v max x1x1 x2x2 x2x2 x1x1 v min v max v min max / min  20 max / min  1

Large eigenvalue spread  large output weights... harmful for generalization, because slight changes in reservoir signals will induce large changes in output harmful for model accuracy, because estimation error contained in reservoir signals is magnified (applies not to deterministic systems) renders LMS online adaptive learning useless v max x1x1 x2x2 v min max / min  20

Summary Basic idea: dynamical reservoir of echo states + supervised teaching of output connections. Seemed difficult: in nonlinear coupled systems, every variable interacts with every other. BUT seen the other way round, every variable rules and echoes every other. Exploit this for local learning and local system analysis. Echo states shape the tool for the solution from the task.

Thank you.

References H. Jaeger (2002): Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the "echo state network" approach. GMD Report 159, German National Research Center for Information Technology, 2002 Slides used by Herbert Jaeger at IK2002