Topics in statistical language modeling Tom Griffiths
Mark Steyvers UC Irvine Josh Tenenbaum MIT Dave Blei CMU Mike Jordan UC Berkeley
Latent Dirichlet Allocation (LDA) Each document a mixture of topics Each word chosen from a single topic Introduced by Blei, Ng, and Jordan (2001), reinterpretation of PLSI (Hofmann, 1999) Idea of probabilistic topics widely used (eg. Bigi et al., 1997; Iyer & Ostendorf, 1996; Ueda & Saito, 2003)
Each document a mixture of topics Each word chosen from a single topic from parameters Latent Dirichlet Allocation (LDA)
HEART0.2 LOVE0.2 SOUL0.2 TEARS0.2 JOY0.2 SCIENTIFIC 0.0 KNOWLEDGE 0.0 WORK 0.0 RESEARCH0.0 MATHEMATICS0.0 HEART0.0 LOVE0.0 SOUL0.0 TEARS0.0 JOY0.0 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH0.2 MATHEMATICS0.2 topic 1topic 2 w P(w|z = 1) = (1) w P(w|z = 2) = (2)
Choose mixture weights for each document, generate “bag of words” = {P(z = 1), P(z = 2)} {0, 1} {0.25, 0.75} {0.5, 0.5} {0.75, 0.25} {1, 0} MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC HEART LOVE TEARS KNOWLEDGE HEART MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK TEARS SOUL KNOWLEDGE HEART WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE LOVE SOUL TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY
Generating a document 1. Choose d Dirichlet ( ) 2. For each word in the document –choose z Multinomial ( (d) ) –choose w Multinomial ( (z) ) z w z z ww
Inverting the generative model Generative model gives procedure to obtain corpus from topics, mixing proportions Inverting the model extracts topics and mixing proportions from corpus Goal: describe content of documents, and be able to identify content of new documents All inference completely unsupervised, fixed # of topics T, words W, documents D
Inverting the generative model Maximum likelihood estimation (EM) –eg. Hofmann (1999) –slow, local maxima Approximate E-steps –VB; Blei, Ng & Jordan (2001) –EP; Minka & Lafferty (2002) Bayesian inference (via Gibbs sampling)
Gibbs sampling in LDA Numerator rewards sparsity in words assigned to topics, topics to documents Sum in the denominator over T n terms Full posterior tractable to a constant, so use Markov chain Monte Carlo (MCMC)
Markov chain Monte Carlo Sample from a Markov chain constructed to converge to the target distribution Allows sampling from unnormalized posterior, and other complex distributions Can compute approximate statistics from intractable distributions Gibbs sampling one such method, construct Markov chain with conditional distributions
Gibbs sampling in LDA Need full conditional distributions for variables Since we only sample z we need number of times word w assigned to topic j number of times topic j used in document d
Gibbs sampling in LDA iteration 1
Gibbs sampling in LDA iteration 1 2
Gibbs sampling in LDA iteration 1 2
Gibbs sampling in LDA iteration 1 2
Gibbs sampling in LDA iteration 1 2
Gibbs sampling in LDA iteration 1 2
Gibbs sampling in LDA iteration 1 2
Gibbs sampling in LDA iteration 1 2
Gibbs sampling in LDA iteration 1 2 … 1000
Estimating topic distributions Parameter estimates from posterior predictive distributions
pixel = word image = document sample each pixel from a mixture of topics A visual example: Bars
Strategy Markov chain Monte Carlo (MCMC) is normally slow, so why consider using it? In discrete models, use conjugate priors to reduce inference to discrete variables Several benefits: –save memory: need only track sparse counts –save time: cheap updates, even with complex dependencies between variables
(not estimating Dirichlet hyperparameters , ) Perplexity vs. time
Strategy Markov chain Monte Carlo (MCMC) is normally slow, so why consider using it? In discrete models, use conjugate priors to reduce inference to discrete variables Several benefits: –save memory: need only track sparse counts –save time: cheap updates, even with complex dependencies between variables These properties let us explore larger, more complex models
Application to corpus data TASA corpus: text from first grade to college word types, over documents, used approximately 6 million word tokens Run Gibbs for models with T = 300, 500, …, 1700 topics
THEORY SCIENTISTS EXPERIMENT OBSERVATIONS SCIENTIFIC EXPERIMENTS HYPOTHESIS EXPLAIN SCIENTIST OBSERVED EXPLANATION BASED OBSERVATION IDEA EVIDENCE THEORIES BELIEVED DISCOVERED OBSERVE FACTS SPACE EARTH MOON PLANET ROCKET MARS ORBIT ASTRONAUTS FIRST SPACECRAFT JUPITER SATELLITE SATELLITES ATMOSPHERE SPACESHIP SURFACE SCIENTISTS ASTRONAUT SATURN MILES ART PAINT ARTIST PAINTING PAINTED ARTISTS MUSEUM WORK PAINTINGS STYLE PICTURES WORKS OWN SCULPTURE PAINTER ARTS BEAUTIFUL DESIGNS PORTRAIT PAINTERS STUDENTS TEACHER STUDENT TEACHERS TEACHING CLASS CLASSROOM SCHOOL LEARNING PUPILS CONTENT INSTRUCTION TAUGHT GROUP GRADE SHOULD GRADES CLASSES PUPIL GIVEN BRAIN NERVE SENSE SENSES ARE NERVOUS NERVES BODY SMELL TASTE TOUCH MESSAGES IMPULSES CORD ORGANS SPINAL FIBERS SENSORY PAIN IS CURRENT ELECTRICITY ELECTRIC CIRCUIT IS ELECTRICAL VOLTAGE FLOW BATTERY WIRE WIRES SWITCH CONNECTED ELECTRONS RESISTANCE POWER CONDUCTORS CIRCUITS TUBE NEGATIVE A selection from 500 topics [ P(w|z = j) ]
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE A selection from 500 topics [ P(w|z = j) ]
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE A selection from 500 topics [ P(w|z = j) ]
PLANET Cue: (Nelson, McEvoy & Schreiber, 1998) Evaluation: Word association
EARTH PLUTO JUPITER NEPTUNE VENUS URANUS SATURN COMET MARS ASTEROID PLANET Cue: Associates: (Nelson, McEvoy & Schreiber, 1998) Evaluation: Word association
associates cues Evaluation: Word association
Comparison with Latent Semantic Analysis (LSA; Landauer & Dumais, 1997) Both algorithms applied to TASA corpus (D > 30,000, W > 20,000, n > 6,000,000) Compare LSA cosine, inner product, with the “on-topic” conditional probability Evaluation: Word association
Latent Semantic Analysis (Landauer & Dumais, 1997) 1…1… 6…6… 11 … spaces … 6195semantic 2120in 3034words Doc3 …Doc2Doc1 SVD words in semantic spaces XU D V T co-occurrence matrixhigh dimensional space
words documents U D V words dims vectors documents LSA Latent Semantic Analysis (Landauer & Dumais, 1997) Dimensionality reduction makes storage efficient, extracts correlation
Properties of word association Asymmetry Violation of triangle inequality “Small world” graph JOY LOVE RESEARCH MATHEMATICS MYSTERY
Small world graph JOY LOVE RESEARCH MATHEMATICS MYSTERY treat association matrix as adjacency matrix (edges indicate positive association) associates cues
Small world graph Properties: –short path lengths –clustering –power law degree distribution Small world graphs arise elsewhere –social relations, biology, the internet JOY LOVE RESEARCH MATHEMATICS MYSTERY
Small world graph Properties: –short path lengths –clustering –power law degree distribution Small world graphs arise elsewhere –social relations, biology, the internet JOY LOVE RESEARCH MATHEMATICS MYSTERY
What is a power law distribution?
Exponential: height Power law: wealth
A power law in word association (Steyvers & Tenenbaum) Cue: PLANET Associates: EARTH PLUTO JUPITER NEPTUNE Word association data k = number of cues
The statistics of meaning Zipf’s law of meaning –number of senses Roget’s Thesaurus –number of classes k = number of classes Roget’s Thesaurus (Steyvers & Tenenbaum)
Meanings and associations Word association involves words Meaning involves words and contexts
Meanings and associations Word association involves words: unipartite Meaning involves words and contexts JOY LOVE RESEARCH MATHEMATICS MYSTERY
Meanings and associations Word association involves words: unipartite Meaning involves words and contexts: bipartite CONTEXT 1 CONTEXT 2 JOY LOVE RESEARCH MYSTERY MATHEMATICS JOY LOVE RESEARCH MATHEMATICS MYSTERY
Meanings and associations Power law in bipartite implies same in unipartite Can get word association power law from meanings CONTEXT 1 CONTEXT 2 JOY LOVE RESEARCH MYSTERY MATHEMATICS JOY LOVE RESEARCH MATHEMATICS MYSTERY
Power law in word association WORDS IN SEMANTIC SPACES (Steyvers & Tenenbaum) Word association data k = number of associations
(Steyvers & Tenenbaum) Word association data k = number of associations Power law in word association WORDS IN SEMANTIC SPACES
Power law in word association Latent Semantic Analysis (Steyvers & Tenenbaum) Word association data k = number of associations
Probability of contaning first associate Rank
Meanings and associations k = number of topics Topic model - P(w|z = j) k = number of cues Topic model - P(w 2 |w 1 )
Problems Finding the right number of topics No dependencies between topics The “bag of words” assumption Need for a stop list
Problems Finding the right number of topics No dependencies between topics The “bag of words” assumption Need for a stop list } } CRP models (Blei, Jordan, Tenenbaum) HMM syntax (Steyvers, Blei Tenenbaum)
Problems Finding the right number of topics No dependencies between topics The “bag of words” assumption Need for a stop list } } CRP models (Blei, Jordan, Tenenbaum) HMM syntax (Steyvers, Blei Tenenbaum)
T corpus topics all T topics are in each document 1T Standard LDA: doc1 doc2 doc3
T corpus topics only L topics are in each document 1T doc1 doc2 doc3
T corpus topics only L topics are in each document 1T doc1 doc2 doc3 topic identities indexed by c
Richer dependencies Nature of topic dependencies comes from prior on assignments to documents p(c) Inference with Gibbs is straightforward Boring prior: pick L from T uniformly Some interesting priors on assignments: –Chinese restaurant process (CRP) –nested CRP (for hierarchies)
Chinese restaurant process The mth customer at an infinitely large Chinese restaurant chooses a table with Also Dirichlet process, infinite models (Beal, Ghahramani, Neal, Rasmussen) Prior on assignments: one topic on each table, L visits/document, T is unbounded
Generating a document 1. Choose c by sampling L tables from the Chinese restaurant, without replacement 2. Choose d Dirichlet ( ) (over L slots) 3. For each word in the document –choose z Multinomial ( (d) ) –choose w Multinomial ( (c(z)) )
Inverting the generative model Draw z as before, but conditioned on c Draw c one at a time from Need only track occupied tables Recover topics, number of occupied tables
Chinese restaurant process prior Bayes factor Model selection with the CRP
Nested CRP Infinitely many infinite-table restaurants Every table has a card for another restaurant, forming an infinite-branching tree L day vacation: visit root restaurant first night, go to restaurant on card the next night, etc. Once inside the restaurant, choose the table (and the next restaurant) via the standard CRP
The nested CRP as a prior One topic per restaurant, each document has one topic at each of the L-levels of a tree Each c is a path through the tree Collecting these paths from all documents gives a finite subtree of used topics Allows unsupervised learning of hierarchies Extends Hofmann’s (1999) topic hierarchies
Generating a document 1. Choose c by sampling a path from the nested Chinese restaurant process 2. Choose d Dirichlet ( ) (over L slots) 3. For each word in the document –choose z Multinomial ( (d) ) –choose w Multinomial ( (c(z)) )
Inverting the generative model Draw z as before, but conditioned on c Draw c as a block from Need only track previously taken paths Recover topics, set of paths (finite subtree)
Twelve years of NIPS
Summary Letting document topics to be a subset of corpus topics allows richer dependencies Using Gibbs sampling makes it possible to have an unbounded number of corpus topics Flat model, hierarchies only two options of many: factorial, arbitrary graphs, etc
Problems Finding the right number of topics No dependencies between topics The “bag of words” assumption Need for a stop list } } CRP models (Blei, Jordan, Tenenbaum) HMM syntax (Steyvers, Tenenbaum)
Syntax and semantics from statistics z w z z ww x x x semantics: probabilistic topics syntax: probabilistic regular grammar Factorization of language based on statistical dependency patterns: long-range, document specific, dependencies short-range dependencies constant across all documents
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = x = 1 THE 0.6 A 0.3 MANY 0.1 x = 3 OF 0.6 FOR 0.3 BETWEEN 0.1 x =
THE ……………………………… HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = x = 1 THE 0.6 A 0.3 MANY 0.1 x = 3 OF 0.6 FOR 0.3 BETWEEN 0.1 x =
THE LOVE…………………… HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = x = 1 THE 0.6 A 0.3 MANY 0.1 x = 3 OF 0.6 FOR 0.3 BETWEEN 0.1 x =
THE LOVE OF……………… HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = x = 1 THE 0.6 A 0.3 MANY 0.1 x = 3 OF 0.6 FOR 0.3 BETWEEN 0.1 x =
THE LOVE OF RESEARCH …… HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = x = 1 THE 0.6 A 0.3 MANY 0.1 x = 3 OF 0.6 FOR 0.3 BETWEEN 0.1 x =
Inverting the generative model Sample z conditioned on x, other z –draw from prior if x > 1 Sample x conditioned on z, other x Inference allows estimation of –“semantic” topics –“syntactic” classes
FOOD FOODS BODY NUTRIENTS DIET FAT SUGAR ENERGY MILK EATING FRUITS VEGETABLES WEIGHT FATS NEEDS CARBOHYDRATES VITAMINS CALORIES PROTEIN MINERALS MAP NORTH EARTH SOUTH POLE MAPS EQUATOR WEST LINES EAST AUSTRALIA GLOBE POLES HEMISPHERE LATITUDE PLACES LAND WORLD COMPASS CONTINENTS DOCTOR PATIENT HEALTH HOSPITAL MEDICAL CARE PATIENTS NURSE DOCTORS MEDICINE NURSING TREATMENT NURSES PHYSICIAN HOSPITALS DR SICK ASSISTANT EMERGENCY PRACTICE BOOK BOOKS READING INFORMATION LIBRARY REPORT PAGE TITLE SUBJECT PAGES GUIDE WORDS MATERIAL ARTICLE ARTICLES WORD FACTS AUTHOR REFERENCE NOTE GOLD IRON SILVER COPPER METAL METALS STEEL CLAY LEAD ADAM ORE ALUMINUM MINERAL MINE STONE MINERALS POT MINING MINERS TIN BEHAVIOR SELF INDIVIDUAL PERSONALITY RESPONSE SOCIAL EMOTIONAL LEARNING FEELINGS PSYCHOLOGISTS INDIVIDUALS PSYCHOLOGICAL EXPERIENCES ENVIRONMENT HUMAN RESPONSES BEHAVIORS ATTITUDES PSYCHOLOGY PERSON CELLS CELL ORGANISMS ALGAE BACTERIA MICROSCOPE MEMBRANE ORGANISM FOOD LIVING FUNGI MOLD MATERIALS NUCLEUS CELLED STRUCTURES MATERIAL STRUCTURE GREEN MOLDS Semantic topics PLANTS PLANT LEAVES SEEDS SOIL ROOTS FLOWERS WATER FOOD GREEN SEED STEMS FLOWER STEM LEAF ANIMALS ROOT POLLEN GROWING GROW
GOOD SMALL NEW IMPORTANT GREAT LITTLE LARGE * BIG LONG HIGH DIFFERENT SPECIAL OLD STRONG YOUNG COMMON WHITE SINGLE CERTAIN THE HIS THEIR YOUR HER ITS MY OUR THIS THESE A AN THAT NEW THOSE EACH MR ANY MRS ALL MORE SUCH LESS MUCH KNOWN JUST BETTER RATHER GREATER HIGHER LARGER LONGER FASTER EXACTLY SMALLER SOMETHING BIGGER FEWER LOWER ALMOST ON AT INTO FROM WITH THROUGH OVER AROUND AGAINST ACROSS UPON TOWARD UNDER ALONG NEAR BEHIND OFF ABOVE DOWN BEFORE SAID ASKED THOUGHT TOLD SAYS MEANS CALLED CRIED SHOWS ANSWERED TELLS REPLIED SHOUTED EXPLAINED LAUGHED MEANT WROTE SHOWED BELIEVED WHISPERED ONE SOME MANY TWO EACH ALL MOST ANY THREE THIS EVERY SEVERAL FOUR FIVE BOTH TEN SIX MUCH TWENTY EIGHT HE YOU THEY I SHE WE IT PEOPLE EVERYONE OTHERS SCIENTISTS SOMEONE WHO NOBODY ONE SOMETHING ANYONE EVERYBODY SOME THEN Syntactic classes BE MAKE GET HAVE GO TAKE DO FIND USE SEE HELP KEEP GIVE LOOK COME WORK MOVE LIVE EAT BECOME
Bayes factors for different models Part-of-speech tagging
MODEL ALGORITHM SYSTEM CASE PROBLEM NETWORK METHOD APPROACH PAPER PROCESS IS WAS HAS BECOMES DENOTES BEING REMAINS REPRESENTS EXISTS SEEMS SEE SHOW NOTE CONSIDER ASSUME PRESENT NEED PROPOSE DESCRIBE SUGGEST USED TRAINED OBTAINED DESCRIBED GIVEN FOUND PRESENTED DEFINED GENERATED SHOWN IN WITH FOR ON FROM AT USING INTO OVER WITHIN HOWEVER ALSO THEN THUS THEREFORE FIRST HERE NOW HENCE FINALLY #*IXTN-CFP#*IXTN-CFP EXPERTS EXPERT GATING HME ARCHITECTURE MIXTURE LEARNING MIXTURES FUNCTION GATE DATA GAUSSIAN MIXTURE LIKELIHOOD POSTERIOR PRIOR DISTRIBUTION EM BAYESIAN PARAMETERS STATE POLICY VALUE FUNCTION ACTION REINFORCEMENT LEARNING CLASSES OPTIMAL * MEMBRANE SYNAPTIC CELL * CURRENT DENDRITIC POTENTIAL NEURON CONDUCTANCE CHANNELS IMAGE IMAGES OBJECT OBJECTS FEATURE RECOGNITION VIEWS # PIXEL VISUAL KERNEL SUPPORT VECTOR SVM KERNELS # SPACE FUNCTION MACHINES SET NETWORK NEURAL NETWORKS OUPUT INPUT TRAINING INPUTS WEIGHTS # OUTPUTS NIPS Semantics NIPS Syntax
Function and content words
Highlighting and templating
Open questions Are MCMC methods useful elsewhere? –“smoothing with negative weights” –Markov chains on grammars Other nonparametric language models? –infinite HMM, infinite PCFG, clustering Better ways of combining topics and syntax? –richer syntactic models –better combination schemes