Day 5: Entropy reduction models; connectionist models; course wrapup Roger Levy University of Edinburgh & University of California – San Diego.

Slides:



Advertisements
Similar presentations
Artificial Neural Networks
Advertisements

Artificial Intelligence 12. Two Layer ANNs
Neural networks Introduction Fitting neural networks
Linear Regression.
Neural Networks  A neural network is a network of simulated neurons that can be used to recognize instances of patterns. NNs learn by searching through.
Learning linguistic structure with simple recurrent networks February 20, 2013.
10. Lexicalized and Probabilistic Parsing -Speech and Language Processing- 발표자 : 정영임 발표일 :
Introduction and Jurafsky Model Resource: A Probabilistic Model of Lexical and Syntactic Access and Disambiguation, Jurafsky 1996.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
9.012 Brain and Cognitive Sciences II Part VIII: Intro to Language & Psycholinguistics - Dr. Ted Gibson.
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Radial Basis Functions
Neural Networks Marco Loog.
Tom Griffiths CogSci C131/Psych C123 Computational Models of Cognition.
Day 2: Pruning continued; begin competition models
Chapter Seven The Network Approach: Mind as a Web.
Chapter 6: Multilayer Neural Networks
Artificial Neural Networks
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
Chapter 5 Data mining : A Closer Look.
Ensemble Learning (2), Tree and Forest
Radial Basis Function Networks
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
The mental representation of sentences Tree structures or state vectors? Stefan Frank
Modelling Language Evolution Lecture 2: Learning Syntax Simon Kirby University of Edinburgh Language Evolution & Computation Research Unit.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Neural Networks Ellen Walker Hiram College. Connectionist Architectures Characterized by (Rich & Knight) –Large number of very simple neuron-like processing.
Chapter 9 Neural Network.
Linear Functions 2 Sociology 5811 Lecture 18 Copyright © 2004 by Evan Schofer Do not copy or distribute without permission.
Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College.
THE BIG PICTURE Basic Assumptions Linguistics is the empirical science that studies language (or linguistic behavior) Linguistics proposes theories (models)
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
Modelling Language Evolution Lecture 1: Introduction to Learning Simon Kirby University of Edinburgh Language Evolution & Computation Research Unit.
Bain on Neural Networks and Connectionism Stephanie Rosenthal September 9, 2015.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Evaluating Models of Computation and Storage in Human Sentence Processing Thang Luong CogACLL 2015 Tim J. O’Donnell & Noah D. Goodman.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
1 Statistics & R, TiP, 2011/12 Neural Networks  Technique for discrimination & regression problems  More mathematical theoretical foundation  Works.
Chapter 8: Adaptive Networks
CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.
NTU & MSRA Ming-Feng Tsai
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 6, 2013.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
Pattern Recognition. What is Pattern Recognition? Pattern recognition is a sub-topic of machine learning. PR is the science that concerns the description.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.
Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.
CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton.
Ch9: Decision Trees 9.1 Introduction A decision tree:
Intelligent Information System Lab
Neural Networks A neural network is a network of simulated neurons that can be used to recognize instances of patterns. NNs learn by searching through.
Data Mining Lecture 11.
Efficient Estimation of Word Representation in Vector Space
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Artificial Intelligence 12. Two Layer ANNs
Parametric Methods Berlin Chen, 2005 References:
Neural networks (1) Traditional multi-layer perceptrons
The Network Approach: Mind as a Web
CS249: Neural Language Model
Presentation transcript:

Day 5: Entropy reduction models; connectionist models; course wrapup Roger Levy University of Edinburgh & University of California – San Diego

Today Other information-theoretic models: Hale 2003 Connectionist models General discussion & course wrap-up

Hale 2003, 2005: Entropy reduction Another attempt to measure reduction in uncertainty arising from a word in a sentence The information contained in a grammar can be measured by the grammar’s entropy But grammars usually denote infinite sets of trees! Not a problem: Grenander 1967 shows how to calculate entropy of any* stochastic context-free branching process [skip details of matrix algebra…] *a probability distribution over an infinite set is not, however, guaranteed to have finite entropy (Cover & Thomas 1991, ch. 2)

Entropy reduction, cont. A partial input w 1…n rules out some of the possibilities in a grammar Uncertainty after a partial input can be characterized as the entropy of the grammar conditioned on w 1…n We can calculate this uncertainty efficiently by treating a partial input as a function from grammars to grammars (Lang 1988, 1989) [skip details of parsing theory…]

Entropy reduction, punchline Hale’s entropy reduction hypothesis: it takes effort to decrease uncertainty The amount of processing work associated with a word is the uncertainty-reduction it causes (if any) Note that entropy doesn’t always go down! more on this in a moment ER idea originally attributed to Wilson and Carroll (1954)

Intuitions for entropy reduction Consider a string composed of n independently identically distributed (i.i.d.) variables, e,g.: The entropy of the “grammar” is 1.5n, for any n When you see w i, the entropy goes from 1.5(n- i+1) to 1.5(n-i) So ER of any word is 1.5 Note that ER(w i ) is not really related to P(w i ) (each X i has 1.5 bits entropy)

Intuitions for entropy reduction (2) Cases when entropy doesn’t go down: Consider the following PCFG: ROOT -> a b 0.9 ROOT -> x y 0.05 ROOT -> x z 0.05 If you see a, the only possible resulting tree is so entropy goes to 0 for ER of 0.57 If you see x, there are two equiprobable continuations: [] and hence entropy goes to 1 for ER of 0 Entropy goes up, and what’s more, the more likely continuation is predicted to be harder H(G) ≡ H(ROOT) = 0.57

Entropy reduction applied Hale 2003 looked at object relative clauses versus subject relative clauses One important property of tree entropy is that it is very high before recursive categories NP → NP RC is recursive because a relative clause can expand into just about any kind of sentence So transitions past the end of an NP have big entropy reduction the reporter who sent the photographer to the editor the reporter who the photographer sent to the editor

Subject RCs (Hale 2003)

Object RCs (Hale 2003)

Entropy reduction in RCs Crossing past the embedded subject NP is a big entropy reduction Means that reading the embedded verb predicted harder in the object RC than in the subject RC No other probabilistic processing theory seems to get this data point right Pruning, attention shift, competition not applicable: no prominent ambiguity about what has been said Surprisal probably predicts the opposite: the embedded subject NP makes the context more constrained Some non-probabilistic theories* do get this right *e.g., Gibson 1998, 2000

Surprisal & Entropy reduction compared Surprisal and entropy reduction share some features… High-to-low uncertainty transitions are hard …and make some similar predictions… Constrained syntactic contexts are lower-entropy, thus will tend to be easier under ER ER also predicts facilitative ambiguity, since resulting state is higher-entropy

Surprisal & Entropy reduction, cont. …but they are also different… ER compares summary statistics of different probability distributions Surprisal implicitly compares two distr.’s point-by- point …and make some different predictions Surprisal predicts that more predictable words will be read more quickly in a given constrained context ER penalizes transitions across recursive states such as nominal postmodification (useful for some results in English relative clauses; see Hale 2003)

Future ideas Deriving some of these measures from more mechanistic views of sentence processing Information-theoretic measures on conditional probability distributions Connection with competition models: entropy of a c.p.d. as an index of amount of competition Closer links between specific theoretical measures and specific types of observable behavior Readers might use local entropy (not reduction) to gauge how carefully they must read

Today Other information-theoretic models: Hale 2003 Connectionist models General discussion & course wrap-up

Connectionist models: overview Connectionists have also had a lot to say about sentence comprehension Historically, there has been something of a sociological divide between the connectionists and the “grammars” people It’s useful to understand the commonalities and differences between the two approaches *I know less about connectionist literature than about “grammars” literature!

What are “connectionist models”? A connectionist model is one that uses a neural network to represent knowledge of language and its deployment in real time Biological plausibility often taken as motivation Mathematically, a generalization of the multiclass generalized linear models covered on Wednesday Two types of generalization: additional layers (gives flexibility of nonlinear modeling) recurrence (allows modeling of time-dependence)

Review: (generalized) linear models A (generalized) linear model expresses a predictor as the dot product of a weight vector with inputs For a categorical output variable, the class predictors can be thought of as drawing a decision boundary i ranges from 1 to m for an m-class categorization problem

non-linear decision boundary probabilistic interpretation: close to the boundary means P(z1)≈0.5

Non-linear classification Extra network layers → non-linear decision boundaries Boundaries can be curved, discontinuous, … A double-edged sword: Expressivity can mean good fit using few parameters But difficult to train (local maxima in likelihood surface) Plus danger of overfitting

Schematic picture of multilayer nets output layer hidden layer K hidden layer 1 … input layer

Recurrence and online processing The networks shown thus far are “feed-forward” – there are no feedback cycles between nodes contrast with the CI model of McRae et al. How to capture the incremental nature of online language processing in a connectionist model? Proposal by Elman 1990, 1991: let part of the internal state of the network feed back into input later on Simple Recurrent Network (SRN)

Simple recurrent networks output layer hidden layer K hidden layer J … input layer … context layer direct copy hidden layer J -1

Recurrence “unfolded” … Unfolded over time, the hidden layer at each step becomes the context for the next step Hidden layer has new role as “memory” of sentence input w 1 input w 2 input w 3 output w 4 output w 2

Connectionist models: overview (2) Connectionist work typically draws a closer connection between the acquisition process and results in “normal adult” sentence comprehension Highly evident in Christiansen & Chater 1999 The “grammars” modelers, in contrast, tend to assume perfect acquisition & representation e.g., Jurafsky 1996 & Levy 2005 assumed PCFGs derived directly from a treebank

Connectionist models: overview (3) Two major types of linking hypothesis between the online comprehension process and observables: Predictive hypothesis: next-word error rates should match reading times (starting from Elman 1990, 1991) Gravitiational hypothesis (Tabor, Juliano, & Tanenhaus; Tabor & Tanenhaus): more like a competition model In both cases, latent outcomes of the acquisition process are of major interest That is, how does a connectionist network learn to represent things like the MV/RR ambiguity?

Connectionist models for psycholx. Words presented to the network in sequence The model is trained to minimize next-word prediction error That is, there is no overt syntax or tree structure built into the model Performance of model is evaluated on how well it predicts next words Metrics turns out to be closely related to surprisal n 1 v 5 # N 3 n 8 v 2 V 4 # …

Christiansen & Chater 1999 Bach et al. 1986: cross-dependency recursion seems easier to process than nesting recursion Investigation: do connectionist models learn cross-dependency recursion better than nesting recursion? Aand has Jantje the teacher the marbles let help collect up. Aand has Jantje the teacher the marbles up collect help let.

These were small artificial languages, so both networks learned well ultimately But the center- embedding recursions were learned more slowly poor learning with 5 HUs

Connectionism summary Stronger association posited between constraints on acquisition and constraints on on-line processing among adults Evidence that cross-serial dependencies are easier to learn from a network architecture N.B. Steedman 1999 has critiqued neural networks for essentially learning n-gram surface structure this would do better with cross-serial dependencies Underlying issue: do networks learn hierarchical structure? If not, what learning biases would make them?

Today Other information-theoretic models: Hale 2003 Connectionist models General discussion & course wrap-up

Course summary: Probabilistic models in psycholinguistics Differences in technical details of models should not obscure major shared features & differences Shared feature: use of probability to model incremental disambiguation computational standpoint: achieves broad coverage with ambiguity management psycholinguistic standpoint: evidence is strong that people do deploy probabilistic information online the complex houses vs. the corporation fires the {crook/cop} arrested by the detective the group (to the mountain) led

Course summary (2) Differences: connection with observable behavior Pruning: difficulty when a needed analysis is lost Competition: difficulty when multiple analysis have substantial support Reranking: difficulty when distr. over analyses changes Next-word prediction (surprisal, connectionism) Note the close relationships among some of these approaches Pruning and surprisal are both special types of reranking Competition has conceptual similarity to attention shift

Course wrap-up: general ideas In most cases, the theories presented here are not mutually exclusive, e.g.: Surprisal and entropy-reduction are presented as full-parallel, but could also be limited-parallel Attention-shift effects could coexist with competition, surprisal, or entropy reduction In some cases, theories say very different things, e.g.: Ambiguity very different under competition vs. surprisal the {daughter/son} of the colonel who shot himself… But even in these cases, different theories may have different natural domains of explanation

Course wrap-up: unresolved issues Serial vs. parallel sentence parsing More of a continuum than a dichotomy Empirically distinguishing serial vs. parallel is difficult Practical limitations of probabilistic models Our ability to estimate surface (n-gram) models is probably beyond that of humans (we have more data!) But our ability to estimate structured (syntactic, semantic) probabilistic models is poor! Less annotated data, No real-world semantics Unsupervised methods still poorly understood We’re still happy if monotonicities match human data

Course wrap-up: unexplored territory Best-first search strategies for modeling reading times possible basis for heavily serial computational parsing models Entropy as a direct measure of various types of uncertainty Entropy of P( Tree | String ) as a measure of uncertainty as to what has been said Entropy of P( w i | w 1…i-1 ) as a measure of uncertainty as to what may yet be said Models of processing load combining probabilistic and non-probabilistic factors