Download presentation
Presentation is loading. Please wait.
1
Extracting Semantic Representations with Probabilistic Topic Models Mark SteyversUC Irvine Tom Griffiths Padhraic Smyth Dave Newman Brown University UC Irvine
2
Extracting Statistical Regularities from Text EMAIL BOOKS/ JOURNALS NEWSPAPERS Computer Science/Statistics: Information retrieval Text mining Data mining Psychology: Semantic cognition Episodic memory Psycholinguistics ?
3
Overview IProbabilistic Topic Models II Computer Science Applications Analyzing Scientific Topics: PNAS Analyzing NSF and NIH funding Analyzing Enron Email III Theory for semantic cognition Word Association Free Recall IV Conclusion
4
Probabilistic Topic Models Originated in domain of statistics & machine learning (e.g., Hoffman, 2001; Blei, Ng, Jordan, 2003) Extracts topics from large collections of text No usage of dictionaries of thesauri Topic extraction is unsupervised
5
DATA Corpus of text Topic Model Find parameters that “reconstruct” data Model is Generative
6
Probabilistic Topic Models Each document is a probability distribution over topics Each topic is a probability distribution over words
7
Document generation as a probabilistic process TOPICS MIXTURE TOPIC TOPIC WORD WORD...... 1. for each document, choose a mixture of topics 2. For every word slot, sample a topic [1..T] from the mixture 3. sample a word from the topic
8
loan TOPIC 1 money loan bank money bank river TOPIC 2 river stream bank stream bank loan DOCUMENT 2: river 2 stream 2 bank 2 stream 2 bank 2 money 1 loan 1 river 2 stream 2 loan 1 bank 2 river 2 bank 2 bank 1 stream 2 river 2 loan 1 bank 2 stream 2 bank 2 money 1 loan 1 river 2 stream 2 bank 2 stream 2 bank 2 money 1 river 2 stream 2 loan 1 bank 2 river 2 bank 2 money 1 bank 1 stream 2 river 2 bank 2 stream 2 bank 2 money 1 DOCUMENT 1: money 1 bank 1 bank 1 loan 1 river 2 stream 2 bank 1 money 1 river 2 bank 1 money 1 bank 1 loan 1 money 1 stream 2 bank 1 money 1 bank 1 bank 1 loan 1 river 2 stream 2 bank 1 money 1 river 2 bank 1 money 1 bank 1 loan 1 bank 1 money 1 stream 2.3.8.2 Example Mixture components Mixture weights Bayesian approach: use priors Mixture weights ~ Dirichlet( ) Mixture components ~ Dirichlet( ).7
9
DOCUMENT 2: river ? stream ? bank ? stream ? bank ? money ? loan ? river ? stream ? loan ? bank ? river ? bank ? bank ? stream ? river ? loan ? bank ? stream ? bank ? money ? loan ? river ? stream ? bank ? stream ? bank ? money ? river ? stream ? loan ? bank ? river ? bank ? money ? bank ? stream ? river ? bank ? stream ? bank ? money ? DOCUMENT 1: money ? bank ? bank ? loan ? river ? stream ? bank ? money ? river ? bank ? money ? bank ? loan ? money ? stream ? bank ? money ? bank ? bank ? loan ? river ? stream ? bank ? money ? river ? bank ? money ? bank ? loan ? bank ? money ? stream ? Inverting (“fitting”) the model Mixture components Mixture weights TOPIC 1 TOPIC 2 ? ? ?
10
Inverting the generative model Inverting the model involves extracting topics and mixing proportions per document from corpus Bayesian Inference techniques (MCMC with Gibbs sampling)
11
Example: topics from an educational corpus (TASA) PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW 37K docs, 26K words 1700 topics, e.g.:
12
Polysemy PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW
13
Overview IProbabilistic Topic Models II Computer Science Applications Analyzing Scientific Topics: PNAS Analyzing NSF and NIH funding Analyzing Enron Email III Theory for semantic cognition Word Association Free Recall IV Conclusion
14
37 CDNA AMINO SEQUENCE ACID PROTEIN ISOLATED ENCODING CLONED ACIDS IDENTITY CLONE EXPRESSED ENCODES RAT HOMOLOGY How do topics change over time? Analysis of dynamics: perform linear trend analysis for each topic “hot topics” go up, “cold topics” go down 289 KDA PROTEIN PURIFIED MOLECULAR MASS CHROMATOGRA.. POLYPEPTIDE GEL SDS BAND APPARENT LABELED IDENTIFIED FRACTION DETECTED 75 ANTIBODY ANTIBODIES MONOCLONAL ANTIGEN IGG MAB SPECIFIC EPITOPE HUMAN MABS RECOGNIZED SERA EPITOPES DIRECTED NEUTRALIZING 2 SPECIES GLOBAL CLIMATE CO2 WATER ENVIRONMENTAL YEARS MARINE CARBON DIVERSITY OCEAN EXTINCTION TERRESTRIAL COMMUNITY ABUNDANCE 134 MICE DEFICIENT NORMAL GENE NULL MOUSE TYPE HOMOZYGOUS ROLE KNOCKOUT DEVELOPMENT GENERATED LACKING ANIMALS REDUCED 179 APOPTOSIS DEATH CELL INDUCED BCL CELLS APOPTOTIC CASPASE FAS SURVIVAL PROGRAMMED MEDIATED INDUCTION CERAMIDE EXPRESSION
15
37 CDNA AMINO SEQUENCE ACID PROTEIN ISOLATED ENCODING CLONED ACIDS IDENTITY CLONE EXPRESSED ENCODES RAT HOMOLOGY 289 KDA PROTEIN PURIFIED MOLECULAR MASS CHROMATOGRA.. POLYPEPTIDE GEL SDS BAND APPARENT LABELED IDENTIFIED FRACTION DETECTED 75 ANTIBODY ANTIBODIES MONOCLONAL ANTIGEN IGG MAB SPECIFIC EPITOPE HUMAN MABS RECOGNIZED SERA EPITOPES DIRECTED NEUTRALIZING 2 SPECIES GLOBAL CLIMATE CO2 WATER ENVIRONMENTAL YEARS MARINE CARBON DIVERSITY OCEAN EXTINCTION TERRESTRIAL COMMUNITY ABUNDANCE 134 MICE DEFICIENT NORMAL GENE NULL MOUSE TYPE HOMOZYGOUS ROLE KNOCKOUT DEVELOPMENT GENERATED LACKING ANIMALS REDUCED 179 APOPTOSIS DEATH CELL INDUCED BCL CELLS APOPTOTIC CASPASE FAS SURVIVAL PROGRAMMED MEDIATED INDUCTION CERAMIDE EXPRESSION Cold topics Hot topics NOBEL 1987 NOBEL 2002
16
Overview IProbabilistic Topic Models II Computer Science Applications Analyzing Scientific Topics: PNAS Analyzing NSF and NIH funding Analyzing Enron Email III Theory for semantic cognition Word Association Free Recall IV Conclusion
17
NSF & NIH grant abstracts Analyze 22,000+ active grants during 2002 NIH – NIMH, NCI NSF – BIO, SBE Visualize topic similarity between funding programs What topics are funded?
18
Example topics
19
NIH NSF – BIO NSF – SBE 2D visualization of funding programs – nearby program support similar topics
20
Funding Amounts per Topic We have $ funding per grant We have distribution of topics for each grant Solve for the $ amount per topic What are expensive topics?
21
High $$$ topicsLow $$$ topics
22
Overview IProbabilistic Topic Models II Computer Science Applications Analyzing Scientific Topics: PNAS Analyzing NSF and NIH funding Analyzing Enron Email III Theory for semantic cognition Word Association Free Recall IV Conclusion
23
Enron email data 500,000 emails 5000 authors 1999-2002
24
Enron topics TEXANS WIN FOOTBALL FANTASY SPORTSLINE PLAY TEAM GAME SPORTS GAMES GOD LIFE MAN PEOPLE CHRIST FAITH LORD JESUS SPIRITUAL VISIT ENVIRONMENTAL AIR MTBE EMISSIONS CLEAN EPA PENDING SAFETY WATER GASOLINE FERC MARKET ISO COMMISSION ORDER FILING COMMENTS PRICE CALIFORNIA FILED POWER CALIFORNIA ELECTRICITY UTILITIES PRICES MARKET PRICE UTILITY CUSTOMERS ELECTRIC STATE PLAN CALIFORNIA DAVIS RATE BANKRUPTCY SOCAL POWER BONDS MOU TIMELINE May 22, 2000 Start of California energy crisis
25
Overview IProbabilistic Topic Models II Computer Science Applications Analyzing Scientific Topics: PNAS Analyzing NSF and NIH funding Analyzing Enron Email III Theory for semantic cognition Word Association Free Recall IV Conclusion
26
Semantic Memory Semantic memory system might arise from the need to 1) predict what concepts are needed in what contexts 2) disambiguate uncertain information Useful perspective for understanding various language and memory tasks
27
Word Association CUE: PLAY RESPONSES: FUN, BALL, GAME, WORK, GROUND, MATE, CHILD, ENJOY, WIN, ACTOR
28
Modeling Word Association Word association modeled as prediction Given that a single word is observed, what future other words might occur? Under a single topic assumption: Response Cue
29
Observed associates for the cue “play”
30
Model predictions from TASA corpus RANK 9
31
Median rank of first associate Median Rank
32
Latent Semantic Analysis (Landauer & Dumais, 1997) word-document counts high dimensional space SVD RIVER STREAM MONEY BANK Each word is a single point in semantic space Similarity measured by cosine of angle between word vectors
33
Median rank of first associate Median Rank
34
Triangle Inequality in Spatial Representations w1w1 PLAY SOCCER THEATER Cosine similarity: cos(w 1,w 3 ) ≥ cos(w 1,w 2 )cos(w 2,w 3 ) – sin(w 1,w 2 )sin(w 2,w 3 ) w2w2 w3w3
35
Testing violation of triangle inequality Look for triplets of associates w 1 w 2 w 3 such that P( w 2 | w 1 ) > P( w 3 | w 2 ) > and measure P( w 3 | w 1 ) Vary threshold
37
Small-World Structure of Associations (Steyvers & Tenenbaum, 2005) BASEBALL BAT BALL GAME PLAY STAGE THEATER Properties: 1) Short path lengths 2) Clustering 3) Power law degree distributions Small world graphs arise elsewhere: internet, social relations, biology
38
#Incoming links has power law distribution =-2.25 Power law degree distribution some words are very often used as an associate BASEBALL BAT BALL GAME PLAY STAGE THEATER
39
Creating Association Networks TOPICS MODEL: Calculate the conditional probabilities of all word pairs i and j Connect i to j when P( w=j | w=i ) > threshold LSA: For each word, generate K associates by picking K nearest neighbors in semantic space =-2.05
40
Paradigmatic/ Syntagmatic Associations
41
Associations in free recall STUDY THESE WORDS: Bed, Rest, Awake, Tired, Dream, Wake, Snooze, Blanket, Doze, Slumber, Snore, Nap, Peace, Yawn, Drowsy RECALL WORDS..... FALSE RECALL: “Sleep” 61%
42
Recall as a reconstructive process Reconstruct study list based on the stored “gist” The gist can be represented by a distribution over topics Under a single topic assumption: Retrieved word Study list
43
Predictions for the “Sleep” list STUDY LIST EXTRA LIST (top 8)
44
Psychology/Comp.Sci Connections Research on human memory is useful for developing better text mining algorithms Models for information retrieval might be helpful in understanding human memory
45
Integrating Topics and Syntax Syntactic dependencies short range dependencies Semantic dependencies long-range zz zz zz zz ww ww ww ww ss ss ss ss Semantic state: generate words from topic model Syntactic states: generate words from HMM (Griffiths, Steyvers, Blei, & Tenenbaum, 2004)
46
... IN BY WITH ON AS FROM TO FOR THE A AN THIS THEIR ITS EACH ONE IS ARE BE HAS HAVE WAS WERE AS BASED PRESENTED DISCUSSED PROPOSED DESCRIBED SUCH USED DERIVED THEORY MODEL PROCESSES MODELS SYSTEM PROCESS EFFECTS INFORMATION ATTENTION SEARCH VISUAL PROCESSING TASK PERFORMANCE INFORMATION ATTENTIONAL MEMORY TERM LONG SHORT RETRIEVAL STORAGE MEMORIES AMNESIA IQ BEHAVIOR EVOLUTIONARY ENVIRONMENT GENES HERITABILITY GENETIC SELECTION DRUG AROUSAL NEURAL BRAIN HABITUATION BIOLOGICAL TOLERANCE BEHAVIORAL SOCIAL SELF ATTITUDE IMPLICIT ATTITUDES PERSONALITY JUDGMENT PERCEPTION (S) THE SEARCH IN LONG TERM MEMORY …… (S) A MODEL OF VISUAL ATTENTION ……
47
Random sentence generation LANGUAGE: [S] RESEARCHERS GIVE THE SPEECH [S] THE SOUND FEEL NO LISTENERS [S] WHICH WAS TO BE MEANING [S] HER VOCABULARIES STOPPED WORDS [S] HE EXPRESSLY WANTED THAT BETTER VOWEL
48
Topic Hierarchies In regular topic model, no relations between topics Alternative: hierarchical topic organization topic 1 topic 2 topic 3 topic 4 topic 5 topic 6 topic 7 Nested Chinese Restaurant Process Blei, Griffiths, Jordan, Tenenbaum (2004) Learn hierarchical structure, as well as topics within structure
49
Example: Psych Review Abstracts RESPONSE STIMULUS REINFORCEMENT RECOGNITION STIMULI RECALL CHOICE CONDITIONING SPEECH READING WORDS MOVEMENT MOTOR VISUAL WORD SEMANTIC ACTION SOCIAL SELF EXPERIENCE EMOTION GOALS EMOTIONAL THINKING GROUP IQ INTELLIGENCE SOCIAL RATIONAL INDIVIDUAL GROUPS MEMBERS SEX EMOTIONS GENDER EMOTION STRESS WOMEN HEALTH HANDEDNESS REASONING ATTITUDE CONSISTENCY SITUATIONAL INFERENCE JUDGMENT PROBABILITIES STATISTICAL IMAGE COLOR MONOCULAR LIGHTNESS GIBSON SUBMOVEMENT ORIENTATION HOLOGRAPHIC CONDITIONIN STRESS EMOTIONAL BEHAVIORAL FEAR STIMULATION TOLERANCE RESPONSES A MODEL MEMORY FOR MODELS TASK INFORMATION RESULTS ACCOUNT SELF SOCIAL PSYCHOLOGY RESEARCH RISK STRATEGIES INTERPERSONAL PERSONALITY SAMPLING MOTION VISUAL SURFACE BINOCULAR RIVALRY CONTOUR DIRECTION CONTOURS SURFACES DRUG FOOD BRAIN AROUSAL ACTIVATION AFFECTIVE HUNGER EXTINCTION PAIN THE OF AND TO IN A IS
50
Generative Process RESPONSE STIMULUS REINFORCEMENT RECOGNITION STIMULI RECALL CHOICE CONDITIONING SPEECH READING WORDS MOVEMENT MOTOR VISUAL WORD SEMANTIC ACTION SOCIAL SELF EXPERIENCE EMOTION GOALS EMOTIONAL THINKING GROUP IQ INTELLIGENCE SOCIAL RATIONAL INDIVIDUAL GROUPS MEMBERS SEX EMOTIONS GENDER EMOTION STRESS WOMEN HEALTH HANDEDNESS REASONING ATTITUDE CONSISTENCY SITUATIONAL INFERENCE JUDGMENT PROBABILITIES STATISTICAL IMAGE COLOR MONOCULAR LIGHTNESS GIBSON SUBMOVEMENT ORIENTATION HOLOGRAPHIC CONDITIONIN STRESS EMOTIONAL BEHAVIORAL FEAR STIMULATION TOLERANCE RESPONSES A MODEL MEMORY FOR MODELS TASK INFORMATION RESULTS ACCOUNT SELF SOCIAL PSYCHOLOGY RESEARCH RISK STRATEGIES INTERPERSONAL PERSONALITY SAMPLING MOTION VISUAL SURFACE BINOCULAR RIVALRY CONTOUR DIRECTION CONTOURS SURFACES DRUG FOOD BRAIN AROUSAL ACTIVATION AFFECTIVE HUNGER EXTINCTION PAIN THE OF AND TO IN A IS
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.