Brief Overview of Connectionism to understand Learning Walter Schneider P2476 Cognitive Neuroscience of Human Learning & Instruction

Slides:



Advertisements
Similar presentations
Multi-Layer Perceptron (MLP)
Advertisements

Perceptron Lecture 4.
Slides from: Doug Gray, David Poole
NEURAL NETWORKS Backpropagation Algorithm
Learning in Neural and Belief Networks - Feed Forward Neural Network 2001 년 3 월 28 일 안순길.
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Introduction to Neural Networks Computing
Perceptron Learning Rule
1 Neural networks. Neural networks are made up of many artificial neurons. Each input into the neuron has its own weight associated with it illustrated.
Neural Network I Week 7 1. Team Homework Assignment #9 Read pp. 327 – 334 and the Week 7 slide. Design a neural network for XOR (Exclusive OR) Explore.
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
Perceptron.
Simple Neural Nets For Pattern Classification
1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation.
Connectionist Modeling Some material taken from cspeech.ucd.ie/~connectionism and Rich & Knight, 1991.
Back-Propagation Algorithm
Artificial Neural Networks
Artificial Neural Networks
CHAPTER 11 Back-Propagation Ming-Feng Yeh.
September 28, 2010Neural Networks Lecture 7: Perceptron Modifications 1 Adaline Schematic Adjust weights i1i1i1i1 i2i2i2i2 inininin …  w 0 + w 1 i 1 +
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
Radial Basis Function Networks
Artificial Neural Networks
Neural Networks AI – Week 23 Sub-symbolic AI Multi-Layer Neural Networks Lee McCluskey, room 3/10
Waqas Haider Khan Bangyal. Multi-Layer Perceptron (MLP)
Artificial Neural Network Yalong Li Some slides are from _24_2011_ann.pdf.
Machine Learning Dr. Shazzad Hosain Department of EECS North South Universtiy
1 Machine Learning The Perceptron. 2 Heuristic Search Knowledge Based Systems (KBS) Genetic Algorithms (GAs)
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
Artificial Neural Networks. The Brain How do brains work? How do human brains differ from that of other animals? Can we base models of artificial intelligence.
1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation.
Modelling Language Evolution Lecture 1: Introduction to Learning Simon Kirby University of Edinburgh Language Evolution & Computation Research Unit.
A note about gradient descent: Consider the function f(x)=(x-x 0 ) 2 Its derivative is: By gradient descent. x0x0 + -
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
Neural Network Basics Anns are analytical systems that address problems whose solutions have not been explicitly formulated Structure in which multiple.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
1 Lecture 6 Neural Network Training. 2 Neural Network Training Network training is basic to establishing the functional relationship between the inputs.
Chapter 2 Single Layer Feedforward Networks
Introduction to Neural Networks Introduction to Neural Networks Applied to OCR and Speech Recognition An actual neuron A crude model of a neuron Computational.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
Artificial Neural Networks Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.
Connectionist Modelling Summer School Lecture Three.
EEE502 Pattern Recognition
IE 585 History of Neural Networks & Introduction to Simple Learning Rules.
1 Perceptron as one Type of Linear Discriminants IntroductionIntroduction Design of Primitive UnitsDesign of Primitive Units PerceptronsPerceptrons.
Perceptrons Michael J. Watts
Previous Lecture Perceptron W  t+1  W  t  t  d(t) - sign (w(t)  x)] x Adaline W  t+1  W  t  t  d(t) - f(w(t)  x)] f’ x Gradient.
Artificial Intelligence CIS 342 The College of Saint Rose David Goldschmidt, Ph.D.
Chapter 6 Neural Network.
Artificial Intelligence Methods Neural Networks Lecture 3 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.
Connectionist Modelling Summer School Lecture Two.
Lecture 2 Introduction to Neural Networks and Fuzzy Logic President UniversityErwin SitompulNNFL 2/1 Dr.-Ing. Erwin Sitompul President University
NEURONAL NETWORKS AND CONNECTIONIST (PDP) MODELS Thorndike’s “Law of Effect” (1920’s) –Reward strengthens connections for operant response Hebb’s “reverberatory.
Pattern Recognition Lecture 20: Neural Networks 3 Dr. Richard Spillman Pacific Lutheran University.
Neural networks.
Artificial neural networks:
Real Neurons Cell structures Cell body Dendrites Axon
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Simple learning in connectionist networks
Perceptron as one Type of Linear Discriminants
Capabilities of Threshold Neurons
Backpropagation.
Machine Learning: Lecture 4
Machine Learning: UNIT-2 CHAPTER-1
Simple learning in connectionist networks
Computer Vision Lecture 19: Object Recognition III
The Network Approach: Mind as a Web
Outline Announcement Neural networks Perceptrons - continued
Presentation transcript:

Brief Overview of Connectionism to understand Learning Walter Schneider P2476 Cognitive Neuroscience of Human Learning & Instruction Slides adapted from U. Oxford Connectionist Summer School Hinton Lectures on connectionism David Plaut

Specific Example NetTalk NetTalk: Sejnowski, T. J. & Rosenberg, C. R. (1987) Parallel Networks that Learn to Pronounce English Text Complex Systems Learning input phonetic transcription of a child continuous speech

Simple Units

Learning Rules Change Connection Weights Learning rules calculate the difference between desired output and the correct output and use that difference to change weights to reduce the error.

Learning or 50,000 trials. Note if assume 200 words per our (welfare household) and 5 hr/day, 1000/day or 50 days. NetTalk Download Initial 0:46 20sec Learn space0:2:17 20s After 10K ep3:50 20s Transfer5:19 20s elNetsPronounce/index.php Transfer to new words same speaker 78%.

Hart & Risely (2003)

Graceful Deterioration and robust processing with fast relearning

More Hidden units better performance but slower learning

Unit Coding Unclear in Distributed Code

Hierarchical Clustering Sensible groupings

Performance characteristics With 120 hidden units –98% within trained units –75% generalization on dictionary of 20,012 words –85% first pass and 90% and 97.5% after 55 passes. Adding 2 hidden layers of 80 units slightly improved generalization (but slows learning) –97% after 55 passes, 80% generalization,

Summary Supervised Learning NetTalk – example of back propagation learning Performed computation with simple units, connection weight matrices, parallel activation Learning rule provided error signal from supervisor to change connection weights It took man 10 5 trials to reach good performance going through babbling to word production Learning speed and generalization varied with nature of number of units and levels Showed good generalization to related words Developed similarity space consistent with human clustering data Performance was robust to loss of units and connection noise Needed expert teacher with ability to reach in brain to set correct states

How is this like and not like human learning? Similar –Lots of trials –Babbling for a while before it makes sense –Ability to learn any language (e.g., Dutch) –Generalization to new words –Creates similarity spaces Dissimilar –Teacher shows exact correctness by activating the correct output units –Use DecTalk only allowing correct simple output –Very simple network, small number of units –Sequential presentation of target –Learning reading not babbling/speech –Accuracy does not reach human level –Unlikely to be biologically implement able (high precision connections, back propagate precision across levels –Does not learn from instruction but only experience

Switch to Contrastive Hebbian Learning

Some Fundamental Concepts Parallel Processing Distributed Representations Learning (multiple Types) Generalisation Graceful Degradation Input Output

Genres of Network Architecture

Introduction to Neural Computation Simplified Neuron A layered neural network Output Connections Σθ Cell Body Input Connections Output Neurons Input Neurons

Introduction to Neural Computation A single output neuron Output Neuron Input Neurons

The Mapping Principle Patterns of Activity An input pattern is transformed to an output pattern. Activation States are Vectors Each pattern of activity can be considered a unique point in a space of states. The activation vector identifies this point in space. x y z x y z Mapping Functions T = F (S) The network maps a source space S (the network inputs) to a target space T (the outputs). The mapping function F is most likely complex. No simple mathematical formula can capture it explicitly. Hyperspace Input states generally have a high dimensionality. Most network states are therefore considered to populate HyperSpace. S T

The Principle of Superposition Matrix Matrix Composite Matrix

Hebbian Learning Cellular Association “When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process of metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased.” (Hebb 1949, p.50) Learning Connections Take the product of the excitation of the two cells and change the value of the connection in proportion to this product. w a in a out The Learning Rule ε is the learning rate. Changing Connections If a in = 0.5, a out = 0.75, and ε = 0.5 then Δw = 0.5(0.75)(0.5) = And if w start = 0.0, then w next = Calculating Correlations InputOutput

Models of English past tense PDP accounts –Single homogeneous architecture –Superposition –Competition between different different verb types result in overregularisation and irregularisation –Vocabulary discontinuity –Rumelhart & McClelland 1986

Using an Error Signal Orthogonality Constraint Number of patterns limited by dimensionality of network. Input patterns must be orthogonal to each other Similarity effects. Perceptron Convergence Rule Learning in a single weight network Assume a teacher signal t out Adaptation of Connection and Threshold (Rosenblatt 1958) Note that threshold always changes if incorrect output. Blame is apportioned to a connection in proportion to the activity of the input line. x y z Input Neurons Output Neurons w a in a out

Using an Error Signal Perceptron Convergence Rule “ The perceptron convergence rule guarantees to find a solution to a mapping problem, provided a solution exists.” (Minsky & Papert 1969 ) An Example of Perceptron Learning Boolean Or Training the network InputOutput a out w 20 w 21 InOutW 20 W 21 θ a out δ ΔθΔθ ΔwΔw

Gradient Descent Least Mean Square Error (LMS) Define the error measure as the square of the discrepancy between the actual output and the desired output. (Widrow-Hoff 1960) Plot an error curve for a single weight network Make weight adjustments by performing gradient descent – always move down the slope. Calculating the Error Signal Note that Perceptron Convergence and LMS use similar learning algorithms – the Delta Rule Error Landscapes Gradient descent algorithms adapt by moving downhill in a multi- dimensional landscape – the error surface. Ball bearing analogy. In a smooth landscape, the bottom will always be reached. However, bottom may not correspond to zero error. Weight Value Error

Past Tense Revisited Vocabulary Discontinuity –Up to 10 epochs – 8 irregulars + 2 regulars. Thereafter – 420 verbs – mostly regular. –Justification: Irregulars are more frequent than regulars Lack of Evidence –Vocabulary spurt at 2 years whereas overregularizations occur at 3 years. Furthermore, vocabulary spurt consists mostly of nouns. –Pinker and Prince (1988) show that regulars and irregulars are relatively balanced in early productive vocabularies

Longitudinal evidence Stages or phases in development? –Initial error-free performance. –Protracted period of overregularisation but at low rates (typically < 10%). –Gradual recovery from error. –Rate of overregularisation is much less the rate of regularisation of regular verbs. 1992

Longitudinal evidence Error Characteristics –High frequency irregulars are robust to overregularisation. –Some errors seem to be phonologically conditioned. –Irregularisations.

Single system account Multi-layered Perceptrons –Hidden unit representation –Error correction technique –Plunkett & Marchman 1991 –Type/Token distinction –Continuous training set

Single system account Incremental Vocabularies –Plunkett & Marchman (1993) –Initial small training set –Gradual expansion Overregularisation –Initial error-free performance. –Protracted period of overregularisation but at low rates (typically < 5%). –High frequency irregulars are robust to overregularisation.

Using an Error Signal Orthogonality Constraint Number of patterns limited by dimensionality of network. Input patterns must be orthogonal to each other Similarity effects. Perceptron Convergence Rule Learning in a single weight network Assume a teacher signal t out Adaptation of Connection and Threshold (Rosenblatt 1958) Note that threshold always changes if incorrect output. Blame is apportioned to a connection in proportion to the activity of the input line. x y z Input Neurons Output Neurons w a in a out

Using an Error Signal Perceptron Convergence Rule “ The perceptron convergence rule guarantees to find a solution to a mapping problem, provided a solution exists.” (Minsky & Papert 1969 ) An Example of Perceptron Learning Boolean Or Training the network InputOutput a out w 20 w 21 InOutW 20 W 21 θ a out δ ΔθΔθ ΔwΔw

Gradient Descent Least Mean Square Error (LMS) Define the error measure as the square of the discrepancy between the actual output and the desired output. (Widrow-Hoff 1960) Plot an error curve for a single weight network Make weight adjustments by performing gradient descent – always move down the slope. Calculating the Error Signal Note that Perceptron Convergence and LMS use similar learning algorithms – the Delta Rule Error Landscapes Gradient descent algorithms adapt by moving downhill in a multi- dimensional landscape – the error surface. Ball bearing analogy. In a smooth landscape, the bottom will always be reached. However, bottom may not correspond to zero error. Weight Value Error

Past Tense Revisited Vocabulary Discontinuity –Up to 10 epochs – 8 irregulars + 2 regulars. Thereafter – 420 verbs – mostly regular. –Justification: Irregulars are more frequent than regulars Lack of Evidence –Vocabulary spurt at 2 years whereas overregularizations occur at 3 years. Furthermore, vocabulary spurt consists mostly of nouns. –Pinker and Prince (1988) show that regulars and irregulars are relatively balanced in early productive vocabularies

Longitudinal evidence Stages or phases in development? –Initial error-free performance. –Protracted period of overregularisation but at low rates (typically < 10%). –Gradual recovery from error. –Rate of overregularisation is much less the rate of regularisation of regular verbs. 1992

Longitudinal evidence Error Characteristics –High frequency irregulars are robust to overregularisation. –Some errors seem to be phonologically conditioned. –Irregularisations.

Single system account Multi-layered Perceptrons –Hidden unit representation –Error correction technique –Plunkett & Marchman 1991 –Type/Token distinction –Continuous training set

Single system account Incremental Vocabularies –Plunkett & Marchman (1993) –Initial small training set –Gradual expansion Overregularisation –Initial error-free performance. –Protracted period of overregularisation but at low rates (typically < 5%). –High frequency irregulars are robust to overregularisation.

Linear Separability Boolean AND, OR and XOR InputANDORXOR Partitioning Problem Space

Internal Representations Multi-layered Perceptrons Solving XOR InputHiddenTarget Hidden Units Input Units Output Unit θ θ θ Threshold θ = 1 Representing Similarity Relations Hidden units transform the input 1,1 1,0 0,1 0,0

Back Propagation Assignment of Blame to Hidden Units Weight Value Error Global Local Local Minima Activation Functions

Learning Hierarchical Relations Isomorphic Family Trees Family Tree Network Hinton Diagrams Unit1: Nationality Unit2: Generation Unit3: Branch of Tree