Bayesian models of human inductive learning Josh Tenenbaum MIT.

Slides:



Advertisements
Similar presentations
The influence of domain priors on intervention strategy Neil Bramley.
Advertisements

Probabilistic Models in Human and Machine Intelligence.
Dynamic Bayesian Networks (DBNs)
Announcements CS Ice Cream Social 9/5 3:30-4:30, ECCR 265 includes poster session, student group presentations.
What is Statistical Modeling
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Bayesian inference Gil McVean, Department of Statistics Monday 17 th November 2008.
The dynamics of iterated learning Tom Griffiths UC Berkeley with Mike Kalish, Steve Lewandowsky, Simon Kirby, and Mike Dowman.
Introduction  Bayesian methods are becoming very important in the cognitive sciences  Bayesian statistics is a framework for doing inference, in a principled.
Bayesian models of inductive learning
Cognitive modelling (Cognitive Science MSc.) Fintan Costello
Bayesian models of inductive learning
Bayesian models of human learning and reasoning Josh Tenenbaum MIT Department of Brain and Cognitive Sciences Computer Science and AI Lab (CSAIL)
Nonparametric Bayes and human cognition Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.
Part III Hierarchical Bayesian Models. Phrase structure Utterance Speech signal Grammar Universal Grammar Hierarchical phrase structure grammars (e.g.,
Tom Griffiths CogSci C131/Psych C123 Computational Models of Cognition.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Revealing inductive biases with Bayesian models Tom Griffiths UC Berkeley with Mike Kalish, Brian Christian, and Steve Lewandowsky.
Exemplar-based accounts of “multiple system” phenomena in perceptual categorization R. M. Nosofsky and M. K. Johansen Presented by Chris Fagan.
Part I: Classification and Bayesian Learning
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Hypothesis Testing:.
Bayesian approaches to cognitive sciences. Word learning Bayesian property induction Theory-based causal inference.
Gaussian process modelling
Chapter 14: Artificial Intelligence Invitation to Computer Science, C++ Version, Third Edition.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Learning causal theories Josh Tenenbaum MIT Department of Brain and Cognitive Sciences Computer Science and AI Lab (CSAIL)
Bayesian Learning By Porchelvi Vijayakumar. Cognitive Science Current Problem: How do children learn and how do they get it right?
Optimal predictions in everyday cognition Tom Griffiths Josh Tenenbaum Brown University MIT Predicting the future Optimality and Bayesian inference Results.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Bayesian models of inductive learning and reasoning Josh Tenenbaum MIT Department of Brain and Cognitive Sciences Computer Science and AI Lab (CSAIL)
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Computational models of cognitive development: the grammar analogy Josh Tenenbaum MIT.
Bayesian models of inductive learning Tom Griffiths UC Berkeley Josh Tenenbaum MIT Charles Kemp CMU.
Infinite block models for belief networks, social networks, and cultural knowledge Josh Tenenbaum, MIT 2007 MURI Review Meeting Work of Charles Kemp, Chris.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
Randomized Algorithms for Bayesian Hierarchical Clustering
Machine Learning Chapter 5. Artificial IntelligenceChapter 52 Learning 1. Rote learning rote( โรท ) n. วิถีทาง, ทางเดิน, วิธีการตามปกติ, (by rote จากความทรงจำ.
Uncertainty Management in Rule-based Expert Systems
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Mestrado em Ciência de Computadores Mestrado Integrado em Engenharia de Redes e Sistemas Informáticos VC 15/16 – TP14 Pattern Recognition Miguel Tavares.
Probabilistic Models in Human and Machine Intelligence.
International Conference on Fuzzy Systems and Knowledge Discovery, p.p ,July 2011.
Data Mining and Decision Support
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
G. Cowan Computing and Statistical Data Analysis / Stat 9 1 Computing and Statistical Data Analysis Stat 9: Parameter Estimation, Limits London Postgraduate.
Artificial Intelligence: Research and Collaborative Possibilities a presentation by: Dr. Ernest L. McDuffie, Assistant Professor Department of Computer.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
Pattern Recognition. What is Pattern Recognition? Pattern recognition is a sub-topic of machine learning. PR is the science that concerns the description.
Basic Bayes: model fitting, model selection, model averaging Josh Tenenbaum MIT.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Structured Probabilistic Models: A New Direction in Cognitive Science
Bayesian data analysis
Machine Learning Basics
Data Mining Lecture 11.
Bayesian models of human learning and inference
CSCI 5822 Probabilistic Models of Human and Machine Learning
Finding structure in data
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
A Hierarchical Bayesian Look at Some Debates in Category Learning
The causal matrix: Learning the background knowledge that makes causal learning possible Josh Tenenbaum MIT Department of Brain and Cognitive Sciences.
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Learning overhypotheses with hierarchical Bayesian models
Presentation transcript:

Bayesian models of human inductive learning Josh Tenenbaum MIT

Everyday inductive leaps How can people learn so much about the world from such limited evidence? –Kinds of objects and their properties –The meanings of words, phrases, and sentences –Cause-effect relations –The beliefs, goals and plans of other people –Social structures, conventions, and rules “tufa”

Modeling Goals Explain how and why human learning and reasoning works, in terms of (approximations to) optimal statistical inference in natural environments. Computational-level theories that provide insights into algorithmic- or processing-level questions. Principled quantitative models of human behavior, with broad coverage and a minimum of free parameters and ad hoc assumptions. A framework for studying people’s implicit knowledge about the world: how it is structured, used, and acquired. A two-way bridge to state-of-the-art AI, machine learning.

1. How does background knowledge guide learning from sparsely observed data? Bayesian inference, with priors based on background knowledge. 2. What form does background knowledge take, across different domains and tasks? Probabilities defined over structured representations: graphs, grammars, rules, logic, relational schemas, theories. 3. How is background knowledge itself learned? Hierarchical Bayesian models, with inference at multiple levels of abstraction. 4. How can background knowledge constrain learning yet maintain flexibility, balancing assimilation and accommodation? Nonparametric models, growing in complexity as the data require. Explaining inductive learning

Two case studies The number game Property induction

The number game Program input: number between 1 and 100 Program output: “yes” or “no”

The number game Learning task: –Observe one or more positive (“yes”) examples. –Judge whether other numbers are “yes” or “no”.

The number game Examples of “yes” numbers Generalization judgments (N = 20) 60 Diffuse similarity

The number game Examples of “yes” numbers Generalization judgments (N = 20) Diffuse similarity Rule: “multiples of 10”

The number game Examples of “yes” numbers Generalization judgments (N = 20) Diffuse similarity Rule: “multiples of 10” Focused similarity: numbers near 50-60

The number game Examples of “yes” numbers Generalization judgments (N = 20) Diffuse similarity Rule: “powers of 2” Focused similarity: numbers near 20

Main phenomena to explain: –Generalization can appear either similarity- based (graded) or rule-based (all-or-none). –Learning from just a few positive examples Diffuse similarity Rule: “multiples of 10” Focused similarity: numbers near The number game

Divisions into “rule” and “similarity” subsystems? Category learning –Nosofsky, Palmeri et al.: RULEX –Erickson & Kruschke: ATRIUM Language processing –Pinker, Marcus et al.: Past tense morphology Reasoning –Sloman –Rips –Nisbett, Smith et al.

H: Hypothesis space of possible concepts: –h 1 = {2, 4, 6, 8, 10, 12, …, 96, 98, 100} (“even numbers”) –h 2 = {10, 20, 30, 40, …, 90, 100} (“multiples of 10”) –h 3 = {2, 4, 8, 16, 32, 64} (“powers of 2”) –h 4 = {50, 51, 52, …, 59, 60} (“numbers between 50 and 60”) –... Bayesian model Representational interpretations for H: – Candidate rules – Features for similarity – “Consequential subsets” (Shepard, 1987)

Three hypothesis subspaces for number concepts Mathematical properties (24 hypotheses): –Odd, even, square, cube, prime numbers –Multiples of small integers –Powers of small integers Raw magnitude (5050 hypotheses): –All intervals of integers with endpoints between 1 and 100. Approximate magnitude (10 hypotheses): –Decades (1-10, 10-20, 20-30, …)

H: Hypothesis space of possible concepts: –Mathematical properties: even, odd, square, prime,.... –Approximate magnitude: {1-10}, {10-20}, {20-30},.... –Raw magnitude: all intervals between 1 and 100. X = {x 1,..., x n }: n examples of a concept C. Evaluate hypotheses given data: –p(h) [prior]: domain knowledge, pre-existing biases –p(X|h) [likelihood]: statistical information in examples. –p(h|X) [posterior]: degree of belief that h is the true extension of C. Bayesian model

Generalizing to new objects Given p(h|X), how do we compute, the probability that C applies to some new stimulus y? x 1 x 2 x 3 x 4 h Background knowledge X =

Generalizing to new objects Hypothesis averaging: Compute the probability that C applies to some new object y by averaging the predictions of all hypotheses h, weighted by p(h|X):

Likelihood: p(X|h) Size principle: Smaller hypotheses receive greater likelihood, and exponentially more so as n increases. Follows from assumption of randomly sampled examples + law of “conservation of belief”: Captures the intuition of a “representative” sample.

Illustrating the size principle h1h1 h2h2

Illustrating the size principle h1h1 h2h2 Data slightly more of a coincidence under h 1

Illustrating the size principle h1h1 h2h2 Data much more of a coincidence under h 1

Prior: p(h) Choice of hypothesis space embodies a strong prior: effectively, p(h) ~ 0 for many logically possible but conceptually unnatural hypotheses. Prevents overfitting by highly specific but unnatural hypotheses, e.g. “multiples of 10 except 50 and 70”.

Prior: p(h) Choice of hypothesis space embodies a strong prior: effectively, p(h) ~ 0 for many logically possible but conceptually unnatural hypotheses. Prevents overfitting by highly specific but unnatural hypotheses, e.g. “multiples of 10 except 50 and 70”. e.g., X = { }:

The “ugly duckling” theorem Hypotheses How would we generalize without any inductive bias – without constraints on the hypothesis space, informative priors or likelihoods? Objects

The “ugly duckling” theorem Hypotheses Objects

The “ugly duckling” theorem Hypotheses Objects = 4/ = 4/8

The “ugly duckling” theorem Hypotheses Objects Without any inductive bias – constraints on hypotheses, informative priors or likelihoods – no meaningful generalization! = 2/ = 2/4

Posterior: X = {60, 80, 10, 30} Why prefer “multiples of 10” over “even numbers”? p(X|h). Why prefer “multiples of 10” over “multiples of 10 except 50 and 20”? p(h). Why does a good generalization need both high prior and high likelihood? p(h|X) ~ p(X|h) p(h) Occam’s razor: balancing simplicity and fit to data

Prior: p(h) Choice of hypothesis space embodies a strong prior: effectively, p(h) ~ 0 for many logically possible but conceptually unnatural hypotheses. Prevents overfitting by highly specific but unnatural hypotheses, e.g. “multiples of 10 except 50 and 70”. p(h) encodes relative weights of alternative theories: H 1 : Math properties (24) even numbers powers of two multiples of three …. H 2 : Raw magnitude (5050) …. H 3 : Approx. magnitude (10) …. H: Total hypothesis space p(H 1 ) = 1/5 p(H 2 ) = 3/5 p(H 3 ) = 1/5 p(h) = p(H 1 ) / 24p(h) = p(H 2 ) / 5050p(h) = p(H 3 ) / 10

+ ExamplesHuman generalization Bayesian Model

Examples: 16

Connection to feature-based similarity Similarity as a weighted sum of common features: Bayesian hypothesis averaging: Equivalent if we identify features f k with hypotheses h, and weights w k with.

Connection to feature-based similarity Similarity as a weighted sum of common features: Bayesian hypothesis averaging: Equivalent if we identify features f k with hypotheses h, and weights w k with.

Examples:

Examples:

Summary of the Bayesian model How do the statistics of the examples interact with prior knowledge to guide generalization? Why does generalization appear rule-based or similarity-based? broad p(h|X): similarity gradient narrow p(h|X): all-or-none rule

Summary of the Bayesian model How do the statistics of the examples interact with prior knowledge to guide generalization? Why does generalization appear rule-based or similarity-based? broad p(h|X): Many h of similar size, or very few examples (i.e. 1) narrow p(h|X): One h much smaller

The need for more flexible priors Suppose you see these sets of examples: –60, 22, 80, 10, 25, 30, 24, 90, 27, 70, 26, 29 –60, 52, 68, 56, 64, 54, 66, 52, 64 –60, 80, 10, 30, 55, 40, 20, 80, 10, 70, 55 –60, 80, 10, 30, 40, 20, 90, 80, 60, 40, 10, 20, 80, 30, 90, 60, 40, 30, 60, 80, 20, 90, 10

Constructing more flexible priors Start with a base set of regularities R and combination operators C. Hypothesis space = closure of R under C. –C = {and, or}: H = unions and intersections of regularities in R (e.g., “multiples of 10 between 30 and 70”). Start with a base set of regularities R and allow singleton exceptions. –e.g., “multiples of 10 except 50”, “multiples of 10 except 50 and 70”, “multiples of 10 and also 52”. Defining a prior: –The Bayesian Occam’s Razor: Model classes defined by number of combinations. More combinations more hypotheses lower prior

Prior: simple math properties math properties with 1 exception All hypotheses ~ 20 x 100 hypotheses~ 20 x hypotheses ~ 20 hypotheses math properties with 2 exceptions …

Alternative models Neural networks Hypothesis ranking and elimination Similarity to exemplars Time?

Alternative models Neural networks evenmultiple of 10 power of 2 multiple of

Alternative models Neural networks Hypothesis ranking and elimination evenmultiple of 10 power of 2 multiple of Hypothesis ranking: …. ….

Model (r = 0.80) Data Alternative models Neural networks Hypothesis ranking and elimination Similarity to exemplars –Average similarity:

Alternative models Neural networks Hypothesis ranking and elimination Similarity to exemplars –Flexible similarity? Bayes.

Distance in psychological space Probability of generalization The universal law of generalization

Explaining the universal law (Tenenbaum & Griffiths, 2001) Bayesian generalization when the hypotheses correspond to convex regions in a low-dimensional metric space (e.g., intervals in one dimension), with an isotropic prior. x y1y1 y2y2 y3y3

Assume a gradient of typicality Examples sampled in proportion to their typicality: Size of hypotheses now horse camel Asymmetric generalization

Symmetry may depend heavily on the context: –Healthy levels of hormone (left) versus healthy levels of toxin (right) –Predicting the durations or magnitudes of events.

“tufa” Modeling word learning Bayesian inference over tree- structured hypothesis space: (Xu & Tenenbaum; Schmidt & Tenenbaum)

Taking stock A model of high-level, knowledge-driven inductive reasoning that makes strong quantitative predictions with minimal free parameters. (r 2 > 0.9 for mean judgments on 180 generalization stimuli, with 3 free numerical parameters) Explains qualitatively different patterns of generalization (rules, similarity) as the output of a single general-purpose rational inference engine. Differently structured hypothesis spaces account for different kinds of generalization behavior seen in different domains and contexts.

What’s missing: How do we choose a good prior? Can we describe formally how these priors are generated by abstract knowledge or theories? Can we move from ‘weak rational analysis’ to ‘strong rational analysis’ in inductive learning? –“Weak”: behavior consistent with some reasonable prior. –“Strong”: behavior consistent with the “correct” prior given the structure of the world (c.f., ideal observer analyses in vision). Can we explain how people learn these rich priors? Can we work with more flexible priors, not just restricted to a small subset of all logically possible concepts? –Would like to be able to learn any concept, even complex and unnatural ones, given enough data (a non-dogmatic prior).

How likely is the conclusion, given the premises? “Similarity”, “Typicality”, “Diversity” Gorillas have T9 hormones. Seals have T9 hormones. Squirrels have T9 hormones. Horses have T9 hormones. Gorillas have T9 hormones. Chimps have T9 hormones. Monkeys have T9 hormones. Baboons have T9 hormones. Horses have T9 hormones. Gorillas have T9 hormones. Seals have T9 hormones. Squirrels have T9 hormones. Flies have T9 hormones. Property induction

The computational problem ???????????????? Features New property ? Horse Cow Chimp Gorilla Mouse Squirrel Dolphin Seal Rhino Elephant 85 features for 50 animals (Osherson et al.): e.g., for Elephant: ‘gray’, ‘hairless’, ‘toughskin’, ‘big’, ‘bulbous’, ‘longleg’, ‘tail’, ‘chewteeth’, ‘tusks’, ‘smelly’, ‘walks’, ‘slow’, ‘strong’, ‘muscle’, ‘fourlegs’,… “Transfer Learning”, “Semi-Supervised Learning”

???????????????? Horse Cow Chimp Gorilla Mouse Squirrel Dolphin Seal Rhino Elephant... Horses have T9 hormones Rhinos have T9 hormones Cows have T9 hormones X Y } Prior P(h) Hypotheses h

F: form S: structure D: data Tree with species at leaf nodes mouse squirrel chimp gorilla mouse squirrel chimp gorilla F1 F2 F3 F4 Has T9 hormones ?????? … P(structure | form) P(data | structure) P(form) Hierarchical Bayesian Framework (Kemp & Tenenbaum)

Smooth: P(h) high P(D|S): How the structure constrains the data of experience Define a stochastic process over structure S that generates candidate property extensions h. –Intuition: properties should vary smoothly over structure. Not smooth: P(h) low

S y Gaussian Process (~ random walk, diffusion) Threshold P(D|S): How the structure constrains the data of experience h [Zhu, Lafferty & Ghahramani 2003]

S y Gaussian Process (~ random walk, diffusion) Threshold P(D|S): How the structure constrains the data of experience [Zhu, Lafferty & Ghahramani 2003] h

Let d ij be the length of the edge between i and j (= if i and j are not connected) A graph-based prior A Gaussian prior ~ N(0,  ), with (Zhu, Lafferty & Ghahramani, 2003)

Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 Structure S Data D Features 85 features for 50 animals (Osherson et al.): e.g., for Elephant: ‘gray’, ‘hairless’, ‘toughskin’, ‘big’, ‘bulbous’, ‘longleg’, ‘tail’, ‘chewteeth’, ‘tusks’, ‘smelly’, ‘walks’, ‘slow’, ‘strong’, ‘muscle’, ‘fourlegs’,…

[c.f., Lawrence, 2004; Smola & Kondor 2003]

Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 FeaturesNew property Structure S Data D ???????????????? 85 features for 50 animals (Osherson et al.): e.g., for Elephant: ‘gray’, ‘hairless’, ‘toughskin’, ‘big’, ‘bulbous’, ‘longleg’, ‘tail’, ‘chewteeth’, ‘tusks’, ‘smelly’, ‘walks’, ‘slow’, ‘strong’, ‘muscle’, ‘fourlegs’,…

Probability of generalization:

Gorillas have property P. Mice have property P. Seals have property P. All mammals have property P. Cows have property P. Elephants have property P. Horses have property P. Tree 2D

Testing different priors Correct bias Wrong bias Too weak bias Too strong bias Inductive bias

A connectionist alternative (Rogers and McClelland, 2004) Features Species Emergent structure: clustering on hidden unit activation vectors

Learning about spatial properties Geographic inference task: “Given that a certain kind of native American artifact has been found in sites near city X, how likely is the same artifact to be found near city Y?” Tree 2D

Summary so far A framework for modeling human inductive reasoning as rational statistical inference over structured knowledge representations –Qualitatively different priors are appropriate for different domains of property induction. –In each domain, a prior that matches the world’s structure fits people’s judgments well, and better than alternative priors. –A language for representing different theories: graph structure defined over objects + probabilistic model for the distribution of properties over that graph. Remaining question: How can we learn appropriate structures for different domains?

Hierarchical Bayesian Framework F: form S: structure D: data mouse squirrel chimp gorilla F1 F2 F3 F4 Tree mouse squirrel chimp gorilla ClustersLinear chimp gorilla squirrel mouse squirrel chimp gorilla

F: form S: structure D: data mouse squirrel chimp gorilla F1 F2 F3 F4 Favors simplicity Favors smoothness [Zhu et al., 2003] Tree mouse squirrel chimp gorilla ClustersLinear chimp gorilla squirrel mouse squirrel chimp gorilla

Hypothesis space of structural forms Order ChainRingPartition HierarchyTreeGridCylinder

Development of structural forms as more data are observed

The “blessing of abstraction” Often quicker to learn at higher levels of abstraction. –Quicker to learn that you have a biased coin than to learn its precise bias, or to learn that you have a second-order polynomial than to learn the precise coefficients. –Quicker to learn that shape matters most for labeling object categories than to learn the labels for most categories. –Quicker to learn that a domain is tree-structured than to learn the precise tree that best characterizes it. Explanation in hierarchical Bayesian models: –At higher levels, hypothesis space gets smaller and simpler, and draw support (albeit indirectly) from a broader sample of data. –Total hypothesis space gets bigger when we add levels of abstraction, but the effective number of degrees of freedom only decreases, because higher levels specify constraints on lower levels. –Hence the overall learning problem becomes easier.

Beyond “Nativism” versus “Empiricism” “Nativism”: Explicit knowledge of structural forms for core domains is innate. –Atran (1998): The tendency to group living kinds into hierarchies reflects an “innately determined cognitive structure”. –Chomsky (1980): “The belief that various systems of mind are organized along quite different principles leads to the natural conclusion that these systems are intrinsically determined, not simply the result of common mechanisms of learning or growth.” “Empiricism”: General-purpose learning systems without explicit knowledge of structural form. –Connectionist networks (e.g., Rogers and McClelland, 2004). –Traditional structure learning in probabilistic graphical models.

Conclusions Computational tools for studying core questions of human learning (and building more human-like machine learning) –What is the structure of knowledge, at multiple levels of abstraction? –How does abstract domain knowledge guide new learning? –How can abstract domain knowledge itself be learned? –How can inductive biases provide strong constraints yet be flexible? A different way to think about the development of cognition. –Powerful abstractions can be learned “from the top down”, together with or prior to learning more concrete knowledge. Go beyond the traditional “either-or” dichotomies: –How can probabilistic inference over symbolic hypotheses span the range of “rule-based” to “similarity-based” generalization? –How can domain-general learning mechanisms acquire domain- specific representations? –How can structured symbolic representations be acquired by statistical learning?