Part IV: Inference algorithms. Estimation and inference Actually working with probabilistic models requires solving some difficult computational problems…

Slides:

Advertisements

Similar presentations

Mixture Models and the EM Algorithm

Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

Bayesian Estimation in MARK

Segmentation and Fitting Using Probabilistic Methods

The loss function, the normal equation,

Bayesian statistics – MCMC techniques

BAYESIAN INFERENCE Sampling techniques

Exact Inference (Last Class) variable elimination  polytrees (directed graph with at most one undirected path between any two vertices; subset of DAGs)

Visual Recognition Tutorial

Part IV: Monte Carlo and nonparametric Bayes. Outline Monte Carlo methods Nonparametric Bayesian models.

The dynamics of iterated learning Tom Griffiths UC Berkeley with Mike Kalish, Steve Lewandowsky, Simon Kirby, and Mike Dowman.

Maximum likelihood (ML) and likelihood ratio (LR) test

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy.

Lecture 5: Learning models using EM

Machine Learning CUNY Graduate Center Lecture 7b: Sampling.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Exploring subjective probability distributions using Bayesian statistics Tom Griffiths Department of Psychology Cognitive Science Program University of.

End of Chapter 8 Neil Weisenfeld March 28, 2005.

Concepts & Categorization. Measurement of Similarity Geometric approach Featural approach  both are vector representations.

Visual Recognition Tutorial

Computer vision: models, learning and inference Chapter 10 Graphical Models.

Analyzing iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana.

Normative models of human inductive inference Tom Griffiths Department of Psychology Cognitive Science Program University of California, Berkeley.

Maximum likelihood (ML)

Principles of the Global Positioning System Lecture 10 Prof. Thomas Herring Room A;

Radial Basis Function Networks

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Topics in statistical language modeling Tom Griffiths.

ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:

EM and expected complete log-likelihood Mixture of Experts

Model Inference and Averaging

CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.

Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.

CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

Forward-Scan Sonar Tomographic Reconstruction PHD Filter Multiple Target Tracking Bayesian Multiple Target Tracking in Forward Scan Sonar.

First topic: clustering and pattern recognition Marc Sobel.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:

Computer Vision Lecture 6. Probabilistic Methods in Segmentation.

Lecture 2: Statistical learning primer for biologists

An Introduction to Markov Chain Monte Carlo Teg Grenager July 1, 2004.

Lecture #9: Introduction to Markov Chain Monte Carlo, part 3

CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.

Inference of Non-Overlapping Camera Network Topology by Measuring Statistical Dependence Date ：

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Review of statistical modeling and probability theory Alan Moses ML4bio.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.

Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.

Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

MCMC Output & Metropolis-Hastings Algorithm Part I

Model Inference and Averaging

Classification of unlabeled data:

Statistical Models for Automatic Speech Recognition

CSC 594 Topics in AI – Natural Language Processing

Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)

Latent Variables, Mixture Models and EM

Bayesian Models in Machine Learning

Statistical Models for Automatic Speech Recognition

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Michal Rosen-Zvi University of California, Irvine

Expectation-Maximization & Belief Propagation

LECTURE 07: BAYESIAN ESTIMATION

Presentation transcript:

Part IV: Inference algorithms

Estimation and inference Actually working with probabilistic models requires solving some difficult computational problems… Two key problems: –estimating parameters in models with latent variables –computing posterior distributions involving large numbers of variables

Part IV: Inference algorithms The EM algorithm –for estimation in models with latent variables Markov chain Monte Carlo –for sampling from posterior distributions involving large numbers of variables

Part IV: Inference algorithms The EM algorithm –for estimation in models with latent variables Markov chain Monte Carlo –for sampling from posterior distributions involving large numbers of variables

dog cat SUPERVISED

Supervised learning Category ACategory B What characterizes the categories? How should we categorize a new observation?

Parametric density estimation Assume that p(x|c) has a simple form, characterized by parameters  Given stimuli X = x 1, x 2, …, x n from category c, find  by maximum-likelihood estimation or some form of Bayesian estimation

Spatial representations Assume a simple parametric form for p(x|c): a Gaussian For each category, estimate parameters –mean –variance P(c)P(c) p(x|c)p(x|c) c x } 

The Gaussian distribution Probability density p(x) (x-  )/  mean standard deviation variance =  2

Estimating a Gaussian X = {x 1, x 2, …, x n } independently sampled from a Gaussian

Estimating a Gaussian X = {x 1, x 2, …, x n } independently sampled from a Gaussian maximum likelihood parameter estimates:

Multivariate Gaussians meanvariance/covariance matrix quadratic form

Estimating a Gaussian maximum likelihood parameter estimates: X = {x 1, x 2, …, x n } independently sampled from a Gaussian

Bayesian inference x Probability

UNSUPERVISED

Unsupervised learning What latent structure is present? What are the properties of a new observation?

An example: Clustering Assume each observed x i is from a cluster c i, where c i is unknown What characterizes the clusters? What cluster does a new x come from?

Density estimation We need to estimate some probability distributions –what is P(c)? –what is p(x|c)? But… c is unknown, so we only know the value of x P(c)P(c) p(x|c)p(x|c) c x

Supervised and unsupervised Supervised learning: categorization Given x = {x 1, …, x n } and c = {c 1, …, c n } Estimate parameters  of p(x|c) and P(c) Unsupervised learning: clustering Given x = {x 1, …, x n } Estimate parameters  of p(x|c) and P(c)

Mixture distributions x Probability mixture distribution mixture components mixture weights

More generally… Unsupervised learning is density estimation using distributions with latent variables P(z)P(z) P(x|z)P(x|z) z x Latent (unobserved) Observed Marginalize out (i.e. sum over) latent structure

A chicken and egg problem If we knew which cluster the observations were from we could find the distributions –this is just density estimation If we knew the distributions, we could infer which cluster each observation came from –this is just categorization

Alternating optimization algorithm 0. Guess initial parameter values 1. Given parameter estimates, solve for maximum a posteriori assignments c i : 2. Given assignments c i, solve for maximum likelihood parameter estimates: 3. Go to step 1

x c: assignments to cluster , , P(c): parameters  Alternating optimization algorithm For simplicity, assume , P(c) fixed: “k-means” algorithm

Step 0: initial parameter values Alternating optimization algorithm

Step 1: update assignments Alternating optimization algorithm

Step 2: update parameters Alternating optimization algorithm

Step 1: update assignments Alternating optimization algorithm

Step 2: update parameters Alternating optimization algorithm

0. Guess initial parameter values 1. Given parameter estimates, solve for maximum a posteriori assignments c i : 2. Given assignments c i, solve for maximum likelihood parameter estimates: 3. Go to step 1 why “hard” assignments?

Estimating a Gaussian (with hard assignments) X = {x 1, x 2, …, x n } independently sampled from a Gaussian maximum likelihood parameter estimates:

Estimating a Gaussian (with soft assignments) maximum likelihood parameter estimates: the “weight” of each point is the probability of being in the cluster

The E xpectation -M aximization algorithm (clustering version) 0. Guess initial parameter values 1. Given parameter estimates, compute posterior distribution over assignments c i : 2. Solve for maximum likelihood parameter estimates, weighting each observation by the probability it came from that cluster 3. Go to step 1

The E xpectation -M aximization algorithm (more general version) 0. Guess initial parameter values 1. Given parameter estimates, compute posterior distribution over latent variables z: 2. Find parameter estimates 3. Go to step 1

A note on expectations For a function f(x) and distribution P(x), the expectation of f with respect to P is The expectation is the average of f, when x is drawn from the probability distribution P

Good features of EM Convergence –guaranteed to converge to at least a local maximum of the likelihood (or other extremum) –likelihood is non-decreasing across iterations Efficiency –big steps initially (other algorithms better later) Generality –can be defined for many probabilistic models –can be combined with a prior for MAP estimation

Limitations of EM Local minima –e.g., one component poorly fits two clusters, while two components split up a single cluster Degeneracies –e.g., two components may merge, a component may lock onto one data point, with variance going to zero May be intractable for complex models –dealing with this is an active research topic

EM and cognitive science The EM algorithm seems like it might be a good way to describe some “bootstrapping” –anywhere there’s a “chicken and egg” problem –a prime example: language learning

Probabilistic context free grammars S  NP VP1.0 NP  T N0.7 NP  N 0.3 VP  V NP1.0 T  the0.8 T  a0.2 N  man0.5 N  ball0.5 V  hit0.6 V  took0.4 S NP VP 1.0 T N 0.7 V NP 1.0 the 0.8 man 0.5 hit 0.6 the 0.8 ball 0.5 T N 0.7 P(tree) = 1.0  0.7  1.0  0.8  0.5  0.6  0.7  0.8  0.5

EM and cognitive science The EM algorithm seems like it might be a good way to describe some “bootstrapping” –anywhere there’s a “chicken and egg” problem –a prime example: language learning Fried and Holyoak (1984) explicitly tested a model of human categorization that was almost exactly a version of the EM algorithm for a mixture of Gaussians

Part IV: Inference algorithms The EM algorithm –for estimation in models with latent variables Markov chain Monte Carlo –for sampling from posterior distributions involving large numbers of variables

The Monte Carlo principle The expectation of f with respect to P can be approximated by where the x i are sampled from P(x) Example: the average # of spots on a die roll

The Monte Carlo principle Average number of spots Number of rolls The law of large numbers

Markov chain Monte Carlo Sometimes it isn’t possible to sample directly from a distribution Sometimes, you can only compute something proportional to the distribution Markov chain Monte Carlo: construct a Markov chain that will converge to the target distribution, and draw samples from that chain –just uses something proportional to the target

Variables x (t+1) independent of all previous variables given immediate predecessor x (t) Markov chains xx x xx x x x Transition matrix T = P(x (t+1) |x (t) )

An example: card shuffling Each state x (t) is a permutation of a deck of cards (there are 52! permutations) Transition matrix T indicates how likely one permutation will become another The transition probabilities are determined by the shuffling procedure –riffle shuffle –overhand –one card

Convergence of Markov chains Why do we shuffle cards? Convergence to a uniform distribution takes only 7 riffle shuffles… Other Markov chains will also converge to a stationary distribution, if certain simple conditions are satisfied (called “ergodicity”) –e.g. every state can be reached in some number of steps from every other state

Markov chain Monte Carlo States of chain are variables of interest Transition matrix chosen to give target distribution as stationary distribution xx x xx x x x Transition matrix T = P(x (t+1) |x (t) )

Metropolis-Hastings algorithm Transitions have two parts: –proposal distribution: Q(x (t+1) |x (t) ) –acceptance: take proposals with probability A(x (t),x (t+1) ) = min( 1, ) P(x (t+1) ) Q(x (t) |x (t+1) ) P(x (t) ) Q(x (t+1) |x (t) )

Metropolis-Hastings algorithm p(x)p(x)

p(x)p(x)

p(x)p(x)

A(x (t), x (t+1) ) = 0.5 p(x)p(x)

Metropolis-Hastings algorithm p(x)p(x)

A(x (t), x (t+1) ) = 1 p(x)p(x)

Gibbs sampling Particular choice of proposal distribution For variables x = x 1, x 2, …, x n Draw x i (t+1) from P(x i |x -i ) x -i = x 1 (t+1), x 2 (t+1),…, x i-1 (t+1), x i+1 (t), …, x n (t) (this is called the full conditional distribution)

Gibbs sampling (MacKay, 2002)

MCMC vs. EM EM: converges to a single solution MCMC: converges to a distribution of solutions

MCMC and cognitive science The Metropolis-Hastings algorithm seems like a good metaphor for aspects of development… Some forms of cultural evolution can be shown to be equivalent to Gibbs sampling (Griffiths & Kalish, 2005) For experiments based on MCMC, see talk by Adam Sanborn at MathPsych! The main use of MCMC is for probabilistic inference in complex models

STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN A selection of topics FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE

FOOD FOODS BODY NUTRIENTS DIET FAT SUGAR ENERGY MILK EATING MAP NORTH EARTH SOUTH POLE MAPS EQUATOR WEST LINES EAST DOCTOR PATIENT HEALTH HOSPITAL MEDICAL CARE PATIENTS NURSE DOCTORS MEDICINE BOOK BOOKS READING INFORMATION LIBRARY REPORT PAGE TITLE SUBJECT PAGES GOLD IRON SILVER COPPER METAL METALS STEEL CLAY LEAD ADAM BEHAVIOR SELF INDIVIDUAL PERSONALITY RESPONSE SOCIAL EMOTIONAL LEARNING FEELINGS PSYCHOLOGISTS CELLS CELL ORGANISMS ALGAE BACTERIA MICROSCOPE MEMBRANE ORGANISM FOOD LIVING Semantic classes Syntactic classes GOOD SMALL NEW IMPORTANT GREAT LITTLE LARGE BIG LONG HIGH DIFFERENT THE HIS THEIR YOUR HER ITS MY OUR THIS THESE A MORE SUCH LESS MUCH KNOWN JUST BETTER RATHER GREATER HIGHER LARGER ON AT INTO FROM WITH THROUGH OVER AROUND AGAINST ACROSS UPON ONE SOME MANY TWO EACH ALL MOST ANY THREE THIS EVERY HE YOU THEY I SHE WE IT PEOPLE EVERYONE OTHERS SCIENTISTS BE MAKE GET HAVE GO TAKE DO FIND USE SEE HELP Semantic “gist” of document

Summary Probabilistic models can pose significant computational challenges –parameter estimation with latent variables, computing posteriors with many variables Clever algorithms exist for solving these problems, easing use of probabilistic models These algorithms also provide a source of new models and methods in cognitive science

Generative models for language latent structure observed data

meaning words Generative models for language

Topic models Each document (or conversation, or segment of either) is a mixture of topics Each word is chosen from a single topic where w i is the ith word z i is the topic of the ith word T is the number of topics

Generating a document g z w z z ww distribution over topics topic assignments observed words

HEART0.2 LOVE0.2 SOUL0.2 TEARS0.2 JOY0.2 SCIENTIFIC 0.0 KNOWLEDGE 0.0 WORK 0.0 RESEARCH0.0 MATHEMATICS0.0 HEART0.0 LOVE0.0 SOUL0.0 TEARS0.0 JOY0.0 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH0.2 MATHEMATICS0.2 topic 1topic 2 w P(w|z = 1)w P(w|z = 2)

Choose mixture weights for each document, generate “bag of words” g = {P(z = 1), P(z = 2)} {0, 1} {0.25, 0.75} {0.5, 0.5} {0.75, 0.25} {1, 0} MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC HEART LOVE TEARS KNOWLEDGE HEART MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK TEARS SOUL KNOWLEDGE HEART WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE LOVE SOUL TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY

Inferring topics from text The topic model is a generative model for a set of documents (assuming a set of topics) –a simple procedure for generating documents Given the documents, we can try to find the topics and their proportions in each document This is an unsupervised learning problem –we can use the EM algorithm, but it’s not great –instead, we use Markov chain Monte Carlo

THEORY SCIENTISTS EXPERIMENT OBSERVATIONS SCIENTIFIC EXPERIMENTS HYPOTHESIS EXPLAIN SCIENTIST OBSERVED EXPLANATION BASED OBSERVATION IDEA EVIDENCE THEORIES BELIEVED DISCOVERED OBSERVE FACTS SPACE EARTH MOON PLANET ROCKET MARS ORBIT ASTRONAUTS FIRST SPACECRAFT JUPITER SATELLITE SATELLITES ATMOSPHERE SPACESHIP SURFACE SCIENTISTS ASTRONAUT SATURN MILES ART PAINT ARTIST PAINTING PAINTED ARTISTS MUSEUM WORK PAINTINGS STYLE PICTURES WORKS OWN SCULPTURE PAINTER ARTS BEAUTIFUL DESIGNS PORTRAIT PAINTERS STUDENTS TEACHER STUDENT TEACHERS TEACHING CLASS CLASSROOM SCHOOL LEARNING PUPILS CONTENT INSTRUCTION TAUGHT GROUP GRADE SHOULD GRADES CLASSES PUPIL GIVEN BRAIN NERVE SENSE SENSES ARE NERVOUS NERVES BODY SMELL TASTE TOUCH MESSAGES IMPULSES CORD ORGANS SPINAL FIBERS SENSORY PAIN IS CURRENT ELECTRICITY ELECTRIC CIRCUIT IS ELECTRICAL VOLTAGE FLOW BATTERY WIRE WIRES SWITCH CONNECTED ELECTRONS RESISTANCE POWER CONDUCTORS CIRCUITS TUBE NEGATIVE A selection from 500 topics [ P(w|z = j) ]

STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE A selection from 500 topics [ P(w|z = j) ]

STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE A selection from 500 topics [ P(w|z = j) ]

Gibbs sampling for topics Need full conditional distributions for variables Since we only sample z we need number of times word w assigned to topic j number of times topic j used in document d

Gibbs sampling iteration 1

Gibbs sampling iteration 1 2

Gibbs sampling iteration 1 2

Gibbs sampling iteration 1 2

Gibbs sampling iteration 1 2

Gibbs sampling iteration 1 2

Gibbs sampling iteration 1 2

Gibbs sampling iteration 1 2

Gibbs sampling iteration 1 2 … 1000

pixel = word image = document sample each pixel from a mixture of topics A visual example: Bars

Summary Probabilistic models can pose significant computational challenges –parameter estimation with latent variables, computing posteriors with many variables Clever algorithms exist for solving these problems, easing use of probabilistic models These algorithms also provide a source of new models and methods in cognitive science

When Bayes is useful… Clarifying computational problems in cognition Providing rational explanations for behavior Characterizing knowledge informing induction Capturing inferences at multiple levels