CS 182 Sections Eva Mok Feb 11, 2004 ( bad puns alert!
Announcements a3 part 1 is due tonight (submit as a3-1) The second tester file is up, so pls. start part 2. The quiz is graded (get it after class).
Where we stand Last Week –Backprop This Week –Recruitment learning –color Coming up –Imagining techniques (e.g. fMRI)
The Big (and complicated) Picture Cognition and Language Computation Structured Connectionism Computational Neurobiology Biology MidtermQuiz Finals Neural Development Triangle Nodes Neural Net & Learning Spatial Relation Motor Control Metaphor SHRUTI Grammar abstraction Regier Model Bailey Model Narayanan Model Chang Model Visual System Psycholinguistics Experiments
Quiz 1.What is a localist representation? What is a distributed representation? Why are they both bad? 2.What is coarse-fine encoding? Where is it used in our brain? 3.What can Back-Propagation do that Hebb’s Rule can’t? 4.Derive the Back-Propagation Algorithm 5.What (intuitively) does the learning rate do? How about the momentum term?
Distributed vs Localist Rep’n John 1100 Paul 0110 George 0011 Ringo 1001 John 1000 Paul 0100 George 0010 Ringo 0001 What are the drawbacks of each representation?
Distributed vs Localist Rep’n What happens if you want to represent a group? How many persons can you represent with n bits? 2^n What happens if one neuron dies? How many persons can you represent with n bits? n John 1100 Paul 0110 George 0011 Ringo 1001 John 1000 Paul 0100 George 0010 Ringo 0001
Visual System 1000 x 1000 visual map For each location, encode: –orientation –direction of motion –speed –size –color –depth Blows up combinatorically! … …
Coarse Coding info you can encode with one fine resolution unit = info you can encode with a few coarse resolution units Now as long as we need fewer coarse units total, we’re good
Coarse-Fine Coding but we can run into ghost “images” Feature 2 e.g. Direction of Motion Feature 1 e.g. Orientation Y X G G Y-Orientation X-Orientation Y-DirX-Dir Coarse in F2, Fine in F1 Coarse in F1, Fine in F2
Back-Propagation Algorithm We define the error term for a single node to be t i - y i xixi f yjyj w ij yiyi x i = ∑ j w ij y j y i = f(x i ) t i :target Sigmoid:
Gradient Descent i2i2 i1i1 global mimimum: this is your goal it should be 4-D (3 weights) but you get the idea
kji w jk w ij E = Error = ½ ∑ i (t i – y i ) 2 yiyi t i : target The derivative of the sigmoid is just The output layer learning rate
kji w jk w ij E = Error = ½ ∑ i (t i – y i ) 2 yiyi t i : target The hidden layer
Let’s just do an example E = Error = ½ ∑ i (t i – y i ) 2 x0x0 f i1i1 w 01 y0y0 i2i2 b=1 w 02 w 0b E = ½ (t 0 – y 0 ) 2 i1i1 i2i2 y0y /(1+e^-0.5) E = ½ (0 – ) 2 = learning rate suppose =