Principled Probabilistic Inference and Interactive Activation Psych209 January 25, 2013.

Slides:



Advertisements
Similar presentations
OPC Koustenis, Breiter. General Comments Surrogate for Control Group Benchmark for Minimally Acceptable Values Not a Control Group Driven by Historical.
Advertisements

The Helmholtz Machine P Dayan, GE Hinton, RM Neal, RS Zemel
Trustworthy Service Selection and Composition CHUNG-WEI HANG MUNINDAR P. Singh A. Moini.
Slides from: Doug Gray, David Poole
Deep Learning Bing-Chen Tsai 1/21.
Probabilistic models Haixu Tang School of Informatics.
Dynamic Causal Modelling for ERP/ERFs Valentina Doria Georg Kaegi Methods for Dummies 19/03/2008.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
Distributed Representation, Connection-Based Learning, and Memory Psychology 209 February 1, 2013.
CS 678 –Boltzmann Machines1 Boltzmann Machine Relaxation net with visible and hidden units Learning algorithm Avoids local minima (and speeds up learning)
Supervised Learning Recap
Automatic Speech Recognition II  Hidden Markov Models  Neural Network.
Learning in Recurrent Networks Psychology 209 February 25, 2013.
1 Statistical Inference H Plan: –Discuss statistical methods in simulations –Define concepts and terminology –Traditional approaches: u Hypothesis testing.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
For stimulus s, have estimated s est Bias: Cramer-Rao bound: Mean square error: Variance: Fisher information How good is our estimate? (ML is unbiased:
Simple Neural Nets For Pattern Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
Connectionist models. Connectionist Models Motivated by Brain rather than Mind –A large number of very simple processing elements –A large number of weighted.
Stochastic Neural Networks, Optimal Perceptual Interpretation, and the Stochastic Interactive Activation Model PDP Class January 15, 2010.
Reading. Reading Research Processes involved in reading –Orthography (the spelling of words) –Phonology (the sound of words) –Word meaning –Syntax –Higher-level.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Stochastic Interactive Activation and Interactive Activation in the Brain PDP Class January 20, 2010.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
Perceptual Inference and Information Integration in Brain and Behavior PDP Class Jan 11, 2010.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
The free-energy principle: a rough guide to the brain? K Friston Summarized by Joon Shik Kim (Thu) Computational Models of Intelligence.
Soft Computing Lecture 17 Introduction to probabilistic reasoning. Bayesian nets. Markov models.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Section 9-2 Inferences About Two Proportions.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Bayesian and Connectionist Approaches to Learning Tom Griffiths, Jay McClelland Alison Gopnik, Mark Seidenberg.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
The Interactive Activation Model. Ubiquity of the Constraint Satisfaction Problem In sentence processing –I saw the grand canyon flying to New York –I.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
Reicher (1969): Word Superiority Effect Dr. Timothy Bender Psychology Department Missouri State University Springfield, MO
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
Chapter 7 Sampling Distributions Statistics for Business (Env) 1.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Section 9-1 Review and Preview.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Other NN Models Reinforcement learning (RL)
CSC321: Introduction to Neural Networks and Machine Learning Lecture 18 Learning Boltzmann Machines Geoffrey Hinton.
Constraint Satisfaction and Schemata Psych 205. Goodness of Network States and their Probabilities Goodness of a network state How networks maximize goodness.
BCS547 Neural Decoding.
Logic and Vocabulary of Hypothesis Tests Chapter 13.
1 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Example: In a recent poll, 70% of 1501 randomly selected adults said they believed.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Inferential Statistics Inferential statistics allow us to infer the characteristic(s) of a population from sample data Slightly different terms and symbols.
Additional NN Models Reinforcement learning (RL) Basic ideas: –Supervised learning: (delta rule, BP) Samples (x, f(x)) to learn function f(.) precise error.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
Slide 20-1 Copyright © 2004 Pearson Education, Inc.
Introduction Imagine the process for testing a new design for a propulsion system on the International Space Station. The project engineers wouldn’t perform.
Bayesian inference in neural networks
Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton
Network States as Perceptual Inferences
Perception, interaction, and optimality
Bayesian inference in neural networks
Network States as Perceptual Inferences
Statistical Methods Carey Williamson Department of Computer Science
Graded Constraint Satisfaction, the IA Model, and Network States as Perceptual Inferences Psychology 209 January 15, 2019.
Bayes for Beginners Luca Chech and Jolanda Malamud
CSC321: Neural Networks Lecture 11: Learning in recurrent networks
Presentation transcript:

Principled Probabilistic Inference and Interactive Activation Psych209 January 25, 2013

A Problem For the Interactive Activation Model Data from many experiments give rise to a pattern corresponding to ‘logistic additivity’ And we expect such a pattern from a Bayesian point of view. Unfortunately, the original interactive activation model does not exhibit this pattern. Does this mean that the interactive activation model is fundamentally wrong – i.e. processing is strictly feedforward (as Massaro believed)? If not, is there a principled basis for understanding interactive activation as principled probabilistic inference?

Joint Effect of Context and Stimulus Information in Phoneme Identification (/l/ or /r/) From Massaro & Cohen (1991)

Massaro’s Model Joint effects of context and stimulus obey the fuzzy logical model of perception: t i is the stimulus support for r given input i and c j is the contextual support for r given context j. Massaro sees this model as having a strictly feed-forward organization: Evaluate stimulus Evaluate context IntegrationDecision

Massaro’s model implies ‘logistic additivity’: log(p ij /(1-p ij )) = log(t i /(1-t i )) + log(c j /(1-c j )) logit(p ij ) The p ij on this graph corresponds to the p(r|S ij ) on the preceding slide L-like R-like Different lines refer to different context conditions: r means ‘favors r’ l means ‘favors l’ n means ‘neutral’

Ideal logistic-additive pattern (upper right) vs. mini-IA simulation results (lower right).

Massaro’s argument against the IA model In the IA model, feature information gets used twice, once on the way up and then again on the way back down… Feeding the activation back in this way, he suggested, distorts the process of correctly identifying the target phoneme.

Should we agree and give up on interactivity? Perception of each letter is influenced by the amount of information about every other letter –So, it would be desirable to have a way for each letter to facilitate perception of others while it itself is being facilitated. In speech, there are both ‘left’ and ‘right’ context effects Examples of ‘right’ context effects: –‘?ift’ vs ‘?iss’ –‘the ?eel of the {shoe/wagon/orange/leather}’ As we discussed before, there are knock-on effects of context that appear to penetrate the perceptual system, as well as support from neurophysiology

What was wrong with the Interactive Activation model? The original interactive activation model ‘tacked the variability on at the end’ but neural activity is intrinsically stochastic. McClelland (1991) incorporated intrinsic variability in the computation of the net input to the IA model: Rather than choosing probabilistically based on relative activations, we simply choose the alternative with the highest activation after settling. Logistic additivity is observed. Intrinsic Variability

Can we Relate IA to Principled Probabilistic Inference? We begin with a probabilistic generative model We then show how a variant of the IA model samples from the correct posterior of the generative model

The Generative Model Select a word with probability p(w i ) Generate letters with probability p(l jp |w i ) Generate feature values with probability p(f vdp |l jp ) Note that features are specified as ‘present’ or ‘absent’

The Neural Network Model Network is viewed as consisting of several multinomial variables each represented by a pool of units corresponding to mutually exclusive alternatives. There are: –4*14 feature level variables, each with two alternative possible values (not well depicted in figure) –4 letter level variables, each with 26 possible values. –1 word level variable, with 1129 possible values. Connection weights are bi-directional, but their values are the logs of the top-down probabilities given in the generative model. There are biases only at the word level, corresponding to the logs of the p(w i ).

The Neural Network Model An input, assumed to have been produced by the generative model is clamped on the units at the feature level. The letter and word level variables are initialized to 0. Then, we alternate updating the letter and word variables –Letters can be updated in parallel or sequentially –Word updated after all of the letters Updates occur by calculating each unit’s net input based on active units that have connections to it (and the bias at the word level), then setting the activations using the softmax function. A state of the model consists of one active word, four active letters, and 4*14 active features. The hidden state consists of one active word and four active letters. We can view each state as a composite hypothesis about what underlying path might have produced the feature values clamped on the input units. After a ‘burn in period’, the network visits hidden states with probability proportional to the posterior probability that the partial path corresponding to the hidden state generated the observed features.

Sampled and Calculated Probabilities for a Specific Display (? = a random set feature values) Mirman et alFigure 14 ?

Alternatives to the MIAM Approach For the effect of context in a specific position: –Calculate p(w i |other letters) for all words –Use this to calculate p(l jp |context) Pearl’s procedure: –Calculate p(w i |all letters) –Divide the contribution of position p back out when calculating p(l jp |context) for each position. –This produces the correct marginals for each multinomial variable but doesn’t specify their joint distribution (see next slide)

Joint vs. marginal posterior probabilities Can you make sense of the given features? In the Rumelhart font, considering each position separately likely letters are: –{H,F}, {E,O}, {X,W} Known words are –HOW, HEX, FEW, FOX There are constraints between the word and letter possibilities not captured by just listing the marginal probabilities These constraints are captured in samples from the joint posterior.

Some Key Concepts A generative model as the basis for principled probabilistic inference Perception as a probabilistic sampling process A sample from the joint posterior as a compound hypothesis Joint vs. marginal posteriors Interactive neural networks as mechanisms that implement principled probabilistic sampling