Latent Semantic Analysis Probabilistic Topic Models & Associative Memory
The Psychological Problem How do we learn semantic structure? Covariation between words and the contexts they appear in (e.g. LSA) How do we represent semantic structure? Semantic Spaces (e.g. LSA) Probabilistic Topics
Latent Semantic Analysis (Landauer & Dumais, 1997) word-document counts high dimensional space SVD RIVER STREAM MONEY BANK Each word is a single point in semantic space Similarity measured by cosine of angle between word vectors
Critical Assumptions of Semantic Spaces (e.g. LSA) Psychological distance should obey three axioms Minimality Symmetry Triangle inequality
For conceptual relations, violations of distance axioms often found Similarities can often be asymmetric “North-Korea” is more similar to “China” than vice versa “Pomegranate” is more similar to “Apple” than vice versa Violations of triangle inequality: AB BC AC Euclidian distance:AC AB + BC
Triangle Inequality in Semantic Spaces might not always hold w1w1 PLAY SOCCER THEATER Cosine similarity: cos(w 1,w 3 ) ≥ cos(w 1,w 2 )cos(w 2,w 3 ) – sin(w 1,w 2 )sin(w 2,w 3 ) w2w2 w3w3 Euclidian distance: AC AB + BC
Nearest neighbor problem (Tversky & Hutchinson (1986) In similarity data, “Fruit” is nearest neighbor in 18 out of 20 fruit words In 2D solution, “Fruit” can be nearest neighbor of at most 5 items High-dimensional solutions might solve this but these are less appealing
Probabilistic Topic Models A probabilistic version of LSA: no spatial constraints. Originated in domain of statistics & machine learning (e.g., Hoffman, 2001; Blei, Ng, Jordan, 2003) Extracts topics from large collections of text Topics are interpretable unlike the arbitrary dimensions of LSA
DATA Corpus of text: Word counts for each document Topic Model Find parameters that “reconstruct” data Model is Generative
Probabilistic Topic Models Each document is a probability distribution over topics (distribution over topics = gist) Each topic is a probability distribution over words
Document generation as a probabilistic process TOPICS MIXTURE TOPIC TOPIC WORD WORD for each document, choose a mixture of topics 2. For every word slot, sample a topic [1..T] from the mixture 3. sample a word from the topic
loan TOPIC 1 money loan bank money bank river TOPIC 2 river stream bank stream bank loan DOCUMENT 2: river 2 stream 2 bank 2 stream 2 bank 2 money 1 loan 1 river 2 stream 2 loan 1 bank 2 river 2 bank 2 bank 1 stream 2 river 2 loan 1 bank 2 stream 2 bank 2 money 1 loan 1 river 2 stream 2 bank 2 stream 2 bank 2 money 1 river 2 stream 2 loan 1 bank 2 river 2 bank 2 money 1 bank 1 stream 2 river 2 bank 2 stream 2 bank 2 money 1 DOCUMENT 1: money 1 bank 1 bank 1 loan 1 river 2 stream 2 bank 1 money 1 river 2 bank 1 money 1 bank 1 loan 1 money 1 stream 2 bank 1 money 1 bank 1 bank 1 loan 1 river 2 stream 2 bank 1 money 1 river 2 bank 1 money 1 bank 1 loan 1 bank 1 money 1 stream Example Mixture components Mixture weights Bayesian approach: use priors Mixture weights ~ Dirichlet( ) Mixture components ~ Dirichlet( ).7
DOCUMENT 2: river ? stream ? bank ? stream ? bank ? money ? loan ? river ? stream ? loan ? bank ? river ? bank ? bank ? stream ? river ? loan ? bank ? stream ? bank ? money ? loan ? river ? stream ? bank ? stream ? bank ? money ? river ? stream ? loan ? bank ? river ? bank ? money ? bank ? stream ? river ? bank ? stream ? bank ? money ? DOCUMENT 1: money ? bank ? bank ? loan ? river ? stream ? bank ? money ? river ? bank ? money ? bank ? loan ? money ? stream ? bank ? money ? bank ? bank ? loan ? river ? stream ? bank ? money ? river ? bank ? money ? bank ? loan ? bank ? money ? stream ? Inverting (“fitting”) the model Mixture components Mixture weights TOPIC 1 TOPIC 2 ? ? ?
Application to corpus data TASA corpus: text from first grade to college representative sample of text 26,000+ word types (stop words removed) 37,000+ documents 6,000,000+ word tokens
Example: topics from an educational corpus (TASA) PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW 37K docs, 26K words 1700 topics, e.g.:
Polysemy PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW
Three documents with the word “play” (numbers & colors topic assignments)
No Problem of Triangle Inequality SOCCER MAGNETIC FIELD TOPIC 1 TOPIC 2 Topic structure easily explains violations of triangle inequality
Applications
Enron data 500,000 s 5000 authors
Enron topics TEXANS WIN FOOTBALL FANTASY SPORTSLINE PLAY TEAM GAME SPORTS GAMES GOD LIFE MAN PEOPLE CHRIST FAITH LORD JESUS SPIRITUAL VISIT ENVIRONMENTAL AIR MTBE EMISSIONS CLEAN EPA PENDING SAFETY WATER GASOLINE FERC MARKET ISO COMMISSION ORDER FILING COMMENTS PRICE CALIFORNIA FILED POWER CALIFORNIA ELECTRICITY UTILITIES PRICES MARKET PRICE UTILITY CUSTOMERS ELECTRIC STATE PLAN CALIFORNIA DAVIS RATE BANKRUPTCY SOCAL POWER BONDS MOU TIMELINE May 22, 2000 Start of California energy crisis
Applying Model to Psychological Data
BASEBALL BAT BALL GAME PLAY STAGE Network of Word Associations THEATER (Association norms by Doug Nelson et al. 1998)
BASEBALL BAT BALL GAME PLAY STAGE THEATER Explaining structure with topics topic 1 topic 2
Modeling Word Association Word association modeled as prediction Given that a single word is observed, what future other words might occur? Under a single topic assumption: Response Cue
Observed associates for the cue “play”
Model predictions RANK 9
Median rank of first associate Median Rank
Recall: example study List STUDY: Bed, Rest, Awake, Tired, Dream, Wake, Snooze, Blanket, Doze, Slumber, Snore, Nap, Peace, Yawn, Drowsy FALSE RECALL: “Sleep” 61%
Recall as a reconstructive process Reconstruct study list based on the stored “gist” The gist can be represented by a distribution over topics Under a single topic assumption: Retrieved word Study list
Predictions for the “Sleep” list STUDY LIST EXTRA LIST (top 8)