Presentation is loading. Please wait.

Presentation is loading. Please wait.

Part IV: Inference algorithms. Estimation and inference Actually working with probabilistic models requires solving some difficult computational problems…

Similar presentations


Presentation on theme: "Part IV: Inference algorithms. Estimation and inference Actually working with probabilistic models requires solving some difficult computational problems…"— Presentation transcript:

1 Part IV: Inference algorithms

2 Estimation and inference Actually working with probabilistic models requires solving some difficult computational problems… Two key problems: –estimating parameters in models with latent variables –computing posterior distributions involving large numbers of variables

3 Part IV: Inference algorithms The EM algorithm –for estimation in models with latent variables Markov chain Monte Carlo –for sampling from posterior distributions involving large numbers of variables

4 Part IV: Inference algorithms The EM algorithm –for estimation in models with latent variables Markov chain Monte Carlo –for sampling from posterior distributions involving large numbers of variables

5 dog cat SUPERVISED

6 Supervised learning Category ACategory B What characterizes the categories? How should we categorize a new observation?

7 Parametric density estimation Assume that p(x|c) has a simple form, characterized by parameters  Given stimuli X = x 1, x 2, …, x n from category c, find  by maximum-likelihood estimation or some form of Bayesian estimation

8 Spatial representations Assume a simple parametric form for p(x|c): a Gaussian For each category, estimate parameters –mean –variance P(c)P(c) p(x|c)p(x|c) c x } 

9 The Gaussian distribution Probability density p(x) (x-  )/  mean standard deviation variance =  2

10 Estimating a Gaussian X = {x 1, x 2, …, x n } independently sampled from a Gaussian

11 Estimating a Gaussian X = {x 1, x 2, …, x n } independently sampled from a Gaussian maximum likelihood parameter estimates:

12 Multivariate Gaussians meanvariance/covariance matrix quadratic form

13 Estimating a Gaussian maximum likelihood parameter estimates: X = {x 1, x 2, …, x n } independently sampled from a Gaussian

14 Bayesian inference x Probability

15 UNSUPERVISED

16 Unsupervised learning What latent structure is present? What are the properties of a new observation?

17 An example: Clustering Assume each observed x i is from a cluster c i, where c i is unknown What characterizes the clusters? What cluster does a new x come from?

18 Density estimation We need to estimate some probability distributions –what is P(c)? –what is p(x|c)? But… c is unknown, so we only know the value of x P(c)P(c) p(x|c)p(x|c) c x

19 Supervised and unsupervised Supervised learning: categorization Given x = {x 1, …, x n } and c = {c 1, …, c n } Estimate parameters  of p(x|c) and P(c) Unsupervised learning: clustering Given x = {x 1, …, x n } Estimate parameters  of p(x|c) and P(c)

20 Mixture distributions x Probability mixture distribution mixture components mixture weights

21 More generally… Unsupervised learning is density estimation using distributions with latent variables P(z)P(z) P(x|z)P(x|z) z x Latent (unobserved) Observed Marginalize out (i.e. sum over) latent structure

22 A chicken and egg problem If we knew which cluster the observations were from we could find the distributions –this is just density estimation If we knew the distributions, we could infer which cluster each observation came from –this is just categorization

23 Alternating optimization algorithm 0. Guess initial parameter values 1. Given parameter estimates, solve for maximum a posteriori assignments c i : 2. Given assignments c i, solve for maximum likelihood parameter estimates: 3. Go to step 1

24 x c: assignments to cluster , , P(c): parameters  Alternating optimization algorithm For simplicity, assume , P(c) fixed: “k-means” algorithm

25 Step 0: initial parameter values Alternating optimization algorithm

26 Step 1: update assignments Alternating optimization algorithm

27 Step 2: update parameters Alternating optimization algorithm

28 Step 1: update assignments Alternating optimization algorithm

29 Step 2: update parameters Alternating optimization algorithm

30 0. Guess initial parameter values 1. Given parameter estimates, solve for maximum a posteriori assignments c i : 2. Given assignments c i, solve for maximum likelihood parameter estimates: 3. Go to step 1 why “hard” assignments?

31 Estimating a Gaussian (with hard assignments) X = {x 1, x 2, …, x n } independently sampled from a Gaussian maximum likelihood parameter estimates:

32 Estimating a Gaussian (with soft assignments) maximum likelihood parameter estimates: the “weight” of each point is the probability of being in the cluster

33 The E xpectation -M aximization algorithm (clustering version) 0. Guess initial parameter values 1. Given parameter estimates, compute posterior distribution over assignments c i : 2. Solve for maximum likelihood parameter estimates, weighting each observation by the probability it came from that cluster 3. Go to step 1

34 The E xpectation -M aximization algorithm (more general version) 0. Guess initial parameter values 1. Given parameter estimates, compute posterior distribution over latent variables z: 2. Find parameter estimates 3. Go to step 1

35 A note on expectations For a function f(x) and distribution P(x), the expectation of f with respect to P is The expectation is the average of f, when x is drawn from the probability distribution P

36 Good features of EM Convergence –guaranteed to converge to at least a local maximum of the likelihood (or other extremum) –likelihood is non-decreasing across iterations Efficiency –big steps initially (other algorithms better later) Generality –can be defined for many probabilistic models –can be combined with a prior for MAP estimation

37 Limitations of EM Local minima –e.g., one component poorly fits two clusters, while two components split up a single cluster Degeneracies –e.g., two components may merge, a component may lock onto one data point, with variance going to zero May be intractable for complex models –dealing with this is an active research topic

38 EM and cognitive science The EM algorithm seems like it might be a good way to describe some “bootstrapping” –anywhere there’s a “chicken and egg” problem –a prime example: language learning

39 Probabilistic context free grammars S  NP VP1.0 NP  T N0.7 NP  N 0.3 VP  V NP1.0 T  the0.8 T  a0.2 N  man0.5 N  ball0.5 V  hit0.6 V  took0.4 S NP VP 1.0 T N 0.7 V NP 1.0 the 0.8 man 0.5 hit 0.6 the 0.8 ball 0.5 T N 0.7 P(tree) = 1.0  0.7  1.0  0.8  0.5  0.6  0.7  0.8  0.5

40 EM and cognitive science The EM algorithm seems like it might be a good way to describe some “bootstrapping” –anywhere there’s a “chicken and egg” problem –a prime example: language learning Fried and Holyoak (1984) explicitly tested a model of human categorization that was almost exactly a version of the EM algorithm for a mixture of Gaussians

41 Part IV: Inference algorithms The EM algorithm –for estimation in models with latent variables Markov chain Monte Carlo –for sampling from posterior distributions involving large numbers of variables

42 The Monte Carlo principle The expectation of f with respect to P can be approximated by where the x i are sampled from P(x) Example: the average # of spots on a die roll

43 The Monte Carlo principle Average number of spots Number of rolls The law of large numbers

44 Markov chain Monte Carlo Sometimes it isn’t possible to sample directly from a distribution Sometimes, you can only compute something proportional to the distribution Markov chain Monte Carlo: construct a Markov chain that will converge to the target distribution, and draw samples from that chain –just uses something proportional to the target

45 Variables x (t+1) independent of all previous variables given immediate predecessor x (t) Markov chains xx x xx x x x Transition matrix T = P(x (t+1) |x (t) )

46 An example: card shuffling Each state x (t) is a permutation of a deck of cards (there are 52! permutations) Transition matrix T indicates how likely one permutation will become another The transition probabilities are determined by the shuffling procedure –riffle shuffle –overhand –one card

47 Convergence of Markov chains Why do we shuffle cards? Convergence to a uniform distribution takes only 7 riffle shuffles… Other Markov chains will also converge to a stationary distribution, if certain simple conditions are satisfied (called “ergodicity”) –e.g. every state can be reached in some number of steps from every other state

48 Markov chain Monte Carlo States of chain are variables of interest Transition matrix chosen to give target distribution as stationary distribution xx x xx x x x Transition matrix T = P(x (t+1) |x (t) )

49 Metropolis-Hastings algorithm Transitions have two parts: –proposal distribution: Q(x (t+1) |x (t) ) –acceptance: take proposals with probability A(x (t),x (t+1) ) = min( 1, ) P(x (t+1) ) Q(x (t) |x (t+1) ) P(x (t) ) Q(x (t+1) |x (t) )

50 Metropolis-Hastings algorithm p(x)p(x)

51 p(x)p(x)

52 p(x)p(x)

53 A(x (t), x (t+1) ) = 0.5 p(x)p(x)

54 Metropolis-Hastings algorithm p(x)p(x)

55 A(x (t), x (t+1) ) = 1 p(x)p(x)

56 Gibbs sampling Particular choice of proposal distribution For variables x = x 1, x 2, …, x n Draw x i (t+1) from P(x i |x -i ) x -i = x 1 (t+1), x 2 (t+1),…, x i-1 (t+1), x i+1 (t), …, x n (t) (this is called the full conditional distribution)

57 Gibbs sampling (MacKay, 2002)

58 MCMC vs. EM EM: converges to a single solution MCMC: converges to a distribution of solutions

59 MCMC and cognitive science The Metropolis-Hastings algorithm seems like a good metaphor for aspects of development… Some forms of cultural evolution can be shown to be equivalent to Gibbs sampling (Griffiths & Kalish, 2005) For experiments based on MCMC, see talk by Adam Sanborn at MathPsych! The main use of MCMC is for probabilistic inference in complex models

60 STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN A selection of topics FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE

61 FOOD FOODS BODY NUTRIENTS DIET FAT SUGAR ENERGY MILK EATING MAP NORTH EARTH SOUTH POLE MAPS EQUATOR WEST LINES EAST DOCTOR PATIENT HEALTH HOSPITAL MEDICAL CARE PATIENTS NURSE DOCTORS MEDICINE BOOK BOOKS READING INFORMATION LIBRARY REPORT PAGE TITLE SUBJECT PAGES GOLD IRON SILVER COPPER METAL METALS STEEL CLAY LEAD ADAM BEHAVIOR SELF INDIVIDUAL PERSONALITY RESPONSE SOCIAL EMOTIONAL LEARNING FEELINGS PSYCHOLOGISTS CELLS CELL ORGANISMS ALGAE BACTERIA MICROSCOPE MEMBRANE ORGANISM FOOD LIVING Semantic classes Syntactic classes GOOD SMALL NEW IMPORTANT GREAT LITTLE LARGE BIG LONG HIGH DIFFERENT THE HIS THEIR YOUR HER ITS MY OUR THIS THESE A MORE SUCH LESS MUCH KNOWN JUST BETTER RATHER GREATER HIGHER LARGER ON AT INTO FROM WITH THROUGH OVER AROUND AGAINST ACROSS UPON ONE SOME MANY TWO EACH ALL MOST ANY THREE THIS EVERY HE YOU THEY I SHE WE IT PEOPLE EVERYONE OTHERS SCIENTISTS BE MAKE GET HAVE GO TAKE DO FIND USE SEE HELP Semantic “gist” of document

62 Summary Probabilistic models can pose significant computational challenges –parameter estimation with latent variables, computing posteriors with many variables Clever algorithms exist for solving these problems, easing use of probabilistic models These algorithms also provide a source of new models and methods in cognitive science

63

64 Generative models for language latent structure observed data

65 meaning words Generative models for language

66 Topic models Each document (or conversation, or segment of either) is a mixture of topics Each word is chosen from a single topic where w i is the ith word z i is the topic of the ith word T is the number of topics

67 Generating a document g z w z z ww distribution over topics topic assignments observed words

68 HEART0.2 LOVE0.2 SOUL0.2 TEARS0.2 JOY0.2 SCIENTIFIC 0.0 KNOWLEDGE 0.0 WORK 0.0 RESEARCH0.0 MATHEMATICS0.0 HEART0.0 LOVE0.0 SOUL0.0 TEARS0.0 JOY0.0 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH0.2 MATHEMATICS0.2 topic 1topic 2 w P(w|z = 1)w P(w|z = 2)

69 Choose mixture weights for each document, generate “bag of words” g = {P(z = 1), P(z = 2)} {0, 1} {0.25, 0.75} {0.5, 0.5} {0.75, 0.25} {1, 0} MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC HEART LOVE TEARS KNOWLEDGE HEART MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK TEARS SOUL KNOWLEDGE HEART WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE LOVE SOUL TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY

70 Inferring topics from text The topic model is a generative model for a set of documents (assuming a set of topics) –a simple procedure for generating documents Given the documents, we can try to find the topics and their proportions in each document This is an unsupervised learning problem –we can use the EM algorithm, but it’s not great –instead, we use Markov chain Monte Carlo

71 THEORY SCIENTISTS EXPERIMENT OBSERVATIONS SCIENTIFIC EXPERIMENTS HYPOTHESIS EXPLAIN SCIENTIST OBSERVED EXPLANATION BASED OBSERVATION IDEA EVIDENCE THEORIES BELIEVED DISCOVERED OBSERVE FACTS SPACE EARTH MOON PLANET ROCKET MARS ORBIT ASTRONAUTS FIRST SPACECRAFT JUPITER SATELLITE SATELLITES ATMOSPHERE SPACESHIP SURFACE SCIENTISTS ASTRONAUT SATURN MILES ART PAINT ARTIST PAINTING PAINTED ARTISTS MUSEUM WORK PAINTINGS STYLE PICTURES WORKS OWN SCULPTURE PAINTER ARTS BEAUTIFUL DESIGNS PORTRAIT PAINTERS STUDENTS TEACHER STUDENT TEACHERS TEACHING CLASS CLASSROOM SCHOOL LEARNING PUPILS CONTENT INSTRUCTION TAUGHT GROUP GRADE SHOULD GRADES CLASSES PUPIL GIVEN BRAIN NERVE SENSE SENSES ARE NERVOUS NERVES BODY SMELL TASTE TOUCH MESSAGES IMPULSES CORD ORGANS SPINAL FIBERS SENSORY PAIN IS CURRENT ELECTRICITY ELECTRIC CIRCUIT IS ELECTRICAL VOLTAGE FLOW BATTERY WIRE WIRES SWITCH CONNECTED ELECTRONS RESISTANCE POWER CONDUCTORS CIRCUITS TUBE NEGATIVE A selection from 500 topics [ P(w|z = j) ]

72 STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE A selection from 500 topics [ P(w|z = j) ]

73 STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE A selection from 500 topics [ P(w|z = j) ]

74 Gibbs sampling for topics Need full conditional distributions for variables Since we only sample z we need number of times word w assigned to topic j number of times topic j used in document d

75 Gibbs sampling iteration 1

76 Gibbs sampling iteration 1 2

77 Gibbs sampling iteration 1 2

78 Gibbs sampling iteration 1 2

79 Gibbs sampling iteration 1 2

80 Gibbs sampling iteration 1 2

81 Gibbs sampling iteration 1 2

82 Gibbs sampling iteration 1 2

83 Gibbs sampling iteration 1 2 … 1000

84 pixel = word image = document sample each pixel from a mixture of topics A visual example: Bars

85

86

87 Summary Probabilistic models can pose significant computational challenges –parameter estimation with latent variables, computing posteriors with many variables Clever algorithms exist for solving these problems, easing use of probabilistic models These algorithms also provide a source of new models and methods in cognitive science

88

89 When Bayes is useful… Clarifying computational problems in cognition Providing rational explanations for behavior Characterizing knowledge informing induction Capturing inferences at multiple levels


Download ppt "Part IV: Inference algorithms. Estimation and inference Actually working with probabilistic models requires solving some difficult computational problems…"

Similar presentations


Ads by Google