Inferring structure from data Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei
Teg Grenager NLP Group Lunch February 24, 2005
Gentle Introduction to Infinite Gaussian Mixture Modeling
Xiaolong Wang and Daniel Khashabi
MAD-Bayes: MAP-based Asymptotic Derivations from Bayes
Hierarchical Dirichlet Process (HDP)
Gibbs Sampling Methods for Stick-Breaking priors Hemant Ishwaran and Lancelot F. James 2001 Presented by Yuting Qi ECE Dept., Duke Univ. 03/03/06.
Hierarchical Dirichlet Processes
DEPARTMENT OF ENGINEERING SCIENCE Information, Control, and Vision Engineering Bayesian Nonparametrics via Probabilistic Programming Frank Wood
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.
HW 4. Nonparametric Bayesian Models Parametric Model Fixed number of parameters that is independent of the data we’re fitting.
Learning Scalable Discriminative Dictionaries with Sample Relatedness a.k.a. “Infinite Attributes” Jiashi Feng, Stefanie Jegelka, Shuicheng Yan, Trevor.
Part IV: Monte Carlo and nonparametric Bayes. Outline Monte Carlo methods Nonparametric Bayesian models.
Nonparametric Bayes and human cognition Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.
Exploring subjective probability distributions using Bayesian statistics Tom Griffiths Department of Psychology Cognitive Science Program University of.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
The Sequence Memoizer Frank Wood Cedric Archambeau Jan Gasthaus Lancelot James Yee Whye Teh Gatsby UCL Gatsby HKUST Gatsby TexPoint fonts used in EMF.
Nonparametric Bayesian Learning
Markov chain Monte Carlo with people Tom Griffiths Department of Psychology Cognitive Science Program UC Berkeley with Mike Kalish, Stephan Lewandowsky,
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Hierarchical Bayesian Nonparametrics with Applications Michael I. Jordan University of California, Berkeley Acknowledgments: Emily Fox, Erik Sudderth,
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
EM and expected complete log-likelihood Mixture of Experts
Fast Max–Margin Matrix Factorization with Data Augmentation Minjie Xu, Jun Zhu & Bo Zhang Tsinghua University.
Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.
Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.
(Infinitely) Deep Learning in Vision Max Welling (UCI) collaborators: Ian Porteous (UCI) Evgeniy Bart UCI/Caltech) Pietro Perona (Caltech)
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
The Sequence Memoizer Frank Wood Cedric Archambeau Jan Gasthaus Lancelot James Yee Whye Teh UCL Gatsby HKUST Gatsby TexPoint fonts used in EMF. Read the.
Hierarchical Topic Models and the Nested Chinese Restaurant Process Blei, Griffiths, Jordan, Tenenbaum presented by Rodrigo de Salvo Braz.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
1 LING 696B: Midterm review: parametric and non-parametric inductive inference.
Summary We propose a framework for jointly modeling networks and text associated with them, such as networks or user review websites. The proposed.
Randomized Algorithms for Bayesian Hierarchical Clustering
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Hierarchical Dirichlet Process and Infinite Hidden Markov Model Duke University Machine Learning Group Presented by Kai Ni February 17, 2006 Paper by Y.
1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.
Stick-Breaking Constructions
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
1 Clustering in Generalized Linear Mixed Model Using Dirichlet Process Mixtures Ya Xue Xuejun Liao April 1, 2005.
The Infinite Hierarchical Factor Regression Model Piyush Rai and Hal Daume III NIPS 2008 Presented by Bo Chen March 26, 2009.
Latent Dirichlet Allocation
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.
Categorization and density estimation Tom Griffiths UC Berkeley.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
Nonparametric Bayesian Models. HW 4 x x Parametric Model Fixed number of parameters that is independent of the data we’re fitting.
Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:
Hierarchical Beta Process and the Indian Buffet Process by R. Thibaux and M. I. Jordan Discussion led by Qi An.
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Nonparametric Bayesian Learning of Switching Dynamical Processes
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Non-Parametric Models
A Non-Parametric Bayesian Method for Inferring Hidden Causes
Hierarchical Topic Models and the Nested Chinese Restaurant Process
Revealing priors on category structures through iterated learning
A Hierarchical Bayesian Look at Some Debates in Category Learning
Michal Rosen-Zvi University of California, Irvine
Topic Models in Text Processing
Rational models of categorization
Presentation transcript:

Inferring structure from data Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley

Human learning Categorization Causal learning Function learning Representations Language Experiment design … Machine learning Density estimation Graphical models Regression Nonparametric Bayes Probabilistic grammars Inference algorithms …

How much structure exists? How many categories of objects? How many features does an object have? How many words (or rules) are in a language? Learning the things people learn requires using rich, unbounded hypothesis spaces

Nonparametric Bayesian statistics Assume the world contains infinite complexity, of which we only observe a part Use stochastic processes to define priors on infinite hypothesis spaces –Dirichlet process/Chinese restaurant process (Ferguson, 1973; Pitman, 1996) –Beta process/Indian buffet process (Griffiths & Ghahramani, 2006; Thibaux & Jordan, 2007)

Categorization How do people represent categories? cat dog ?

cat Prototypes (Posner & Keele, 1968; Reed, 1972) Prototype cat

Exemplars (Medin & Schaffer, 1978; Nosofsky, 1986) Store every instance (exemplar) in memory cat

Something in between (Love et al., 2004; Vanpaemel et al., 2005) cat

A computational problem Categorization is a classic inductive problem –data: stimulus x –hypotheses: category c We can apply Bayes’ rule: and choose c such that P(c|x) is maximized

Density estimation We need to estimate some probability distributions –what is P(c)? –what is p(x|c)? Two approaches: –parametric –nonparametric These approaches correspond to prototype and exemplar models respectively (Ashby & Alfonso-Reese, 1995)

Parametric density estimation Assume that p(x|c) has a simple form, characterized by parameters  (indicating the prototype) Probability density x

Nonparametric density estimation Approximate a probability distribution as a sum of many “kernels” (one per data point) estimated function individual kernels true function n = 10 Probability density x

Something in between mixture distribution mixture components Probability x Use a “mixture” distribution, with more than one component per data point (Rosseel, 2002)

Anderson’s rational model (Anderson, 1990, 1991) Treat category labels like any other feature Define a joint distribution p(x,c) on features using a mixture model, breaking objects into clusters Allow the number of clusters to vary… a Dirichlet process mixture model (Neal, 1998; Sanborn et al., 2006)

The Chinese restaurant process n customers walk into a restaurant, choose tables z i with probability Defines an exchangeable distribution over seating arrangements (inc. counts on tables) (Aldous, 1985; Pitman, 1996)

Dirichlet process mixture model 1.Sample parameters for each component 2.Assign datapoints to components via CRP  …  …

A unifying rational model Density estimation is a unifying framework –a way of viewing models of categorization We can go beyond this to define a unifying model –one model, of which all others are special cases Learners can adopt different representations by adaptively selecting between these cases Basic tool: two interacting levels of clusters –results from the hierarchical Dirichlet process (Teh, Jordan, Beal, & Blei, 2004)

The hierarchical Dirichlet process

A unifying rational model cluster exemplar category exemplar prototype Anderson

HDP +,  and Smith & Minda (1998) HDP +,  will automatically infer a representation using exemplars, prototypes, or something in between (with  being learned from the data) Test on Smith & Minda (1998, Experiment 2) Category A:Category B: exceptions

HDP +,  and Smith & Minda (1998) Probability of A prototype exemplarHDP Log-likelihood exemplar prototype HDP

The promise of HDP +,+ In HDP +,+, clusters are shared between categories –a property of hierarchical Bayesian models Learning one category has a direct effect on the prior on probability densities for the next category

Other uses of Dirichlet processes Nonparametric Block Model from causality lecture –extends DPMM to relations Models of language, where number of words, syntactic classes, or grammar rules is unknown Any multinomial can be replaced by a CRP… Extensions: –hierarchical, nested, dependent, two-parameter, distance-dependent, …

Learning the features of objects Most models of human cognition assume objects are represented in terms of abstract features What are the features of this object? What determines what features we identify? (Austerweil & Griffiths, 2009)

Binary matrix factorization 

 How should we infer the number of features?

The nonparametric approach  Use the Indian buffet process as a prior on Z Assume that the total number of features is unbounded, but only a finite number will be expressed in any finite dataset (Griffiths & Ghahramani, 2006)

The Indian buffet process First customer walks into Indian restaurant, and tastes Poisson (  ) dishes from the buffet The ith customer tastes previously-tasted dish k with probability m k /i, where m k is the number of previous tasters, and Poisson (  /i) new dishes Customers are exchangeable, as in the CRP (Griffiths & Ghahramani, 2006)

The Indian buffet process (Griffiths & Ghahramani, 2006) Dishes Customers

(Austerweil & Griffiths, 2009)

An experiment… (Austerweil & Griffiths, 2009) Training Testing Correlated Factorial Seen Unseen Shuffled

(Austerweil & Griffiths, 2009) Results

Other uses of the IBP Prior on sparse binary matrices, used for number of dimensions in any sparse latent feature model –PCA, ICA, collaborative filtering, … Prior on adjacency matrix for bipartite graph with one class of nodes having unknown size –e.g., inferring hidden causes Interesting link to Beta processes (like CRP to DP) Extensions: –two parameters, three parameters, phylogenetic, …

Nonparametric Bayes and the mind Nonparametric Bayesian models provide a way to answer questions of how much structure to infer For questions like “how many clusters?” we can use the Chinese restaurant/Dirichlet process For questions like “how many features?” we can use the Indian buffet/Beta process Lots of room to develop new models…

P(S|Grammatical) P(Grammatical|S) Distribution of sentences S2S2 S1S1 S3S3 S4S4 S5S5 S6S6 S2S2 S1S1 S3S3 S4S4 S5S5 S6S6 Learning language vs.DiscriminativeGenerative Labels of grammatical or ungrammatical

S1S2S3 V1+ (9) - (6) V2- (3)+(18)- (3) V3+ (18)- (3) V4*+ (18)- (0)- (6) An artificial language: S1) Verb Subject Object S2) Subject Verb Object S3) Subject Object Verb

Grammatical Ungrammatical vs.Discriminative Logistic regression Generative Hierarchical Bayesian model

Model Predictions

Condition 1: Generative learning Always grammatically correct adult Always grammatically incorrect child

blergen norg nagid

nagid blergen semz

flern Condition 2: Discriminative learning

flern Condition 2: Discriminative learning

semz Condition 2: Discriminative learning

Human language learning * * χ2(1) = 7.28, p = 0.007