Download presentation
Presentation is loading. Please wait.
1
Probabilistic Topic Models Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work with: Tom Griffiths, UC Berkeley Padhraic Smyth, UC Irvine Dave Newman, UC Irvine Chaitanya Chemudugunta, UC Irvine
2
Problems of Interest Information retrieval –Finding relevant documents in databases –How can we go beyond keyword matches? Psychology –Retrieving information from episodic/semantic memory –How to access memory at multiple levels of abstraction?
3
Overview IProbabilistic Topic Models II Topic models for association III Topic models for recall IV Topic models for information retrieval V Conclusion
4
Probabilistic Topic Models Originated in domain of statistical machine learning Performs unsupervised extraction of topics from large text collections Text documents: –scientific articles –web pages –email –list of words in memory experiment
5
Probabilistic Topic Models Each topic is a probability distribution over words Each document is modeled as a mixture of topics We do not observe these distributions but we can infer them statistically
6
The Generative Model TOPIC MIXTURE TOPIC WORD... 1. for each document, choose a mixture of topics 2. For every word slot, sample a topic [1..T] from the mixture 3. sample a word from the topic
7
Toy Example Topic 1 Topic 2 Topic mixture Word probabilities for each topic Document: HEART, LOVE, LOVE, SOUL, HEART, ….
8
Toy Example Topic 1 Topic 2 Topic mixture Word probabilities for each topic Document: SCIENTIFIC, KNOWLEDGE, SCIENTIFIC, RESEARCH, ….
9
Toy Example Topic 1 Topic 2 Topic mixture Word probabilities for each topic Document: LOVE, SCIENTIFIC, SOUL, LOVE, KNOWLEDGE, ….
10
Applying model to large corpus of text Simultaneously infer the topics and topic mixtures that best “reconstructs” a corpus of text Bayesian methods (MCMC) Unsupervised!!
11
Example topics from TASA: an educational corpus 37K docs 26K word vocabulary 300 topics e.g.:
12
Topics from psych review abstracts All 1281 abstracts since 1967 50 topics – examples: SIMILARITY CATEGORY CATEGORIES RELATIONS DIMENSIONS FEATURES STRUCTURE SIMILAR REPRESENTATION ONJECTS STIMULUS CONDITIONING LEARNING RESPONSE STIMULI RESPONSES AVOIDANCE REINFORCEMENT CLASSICAL DISCRIMINATION MEMORY RETRIEVAL RECALL ITEMS INFORMATION TERM RECOGNITION ITEMS LIST ASSOCIATIVE GROUP INDIVIDUAL GROUPS OUTCOMES INDIVIDUALS GROUPS OUTCOMES INDIVIDUALS DIFFERENCES INTERACTION EMOTIONAL EMOTION BASIC EMOTIONS AFFECT STATES EXPERIENCES AFFECTIVE AFFECTS RESEARCH...
13
Application: Research Browser http://yarra.calit2.uci.edu/calit2/ Automatically analyzes research papers by UC San Diego and UC Irvine researchers related to California Institute for Telecommunications and Information Technology (CalIT2) Finds topically related researchers
14
Notes Determining number of topics Word order is ignored -- bags of words assumption Function words (the, a, etc) are typically removed
15
Latent Semantic Analysis (Landauer & Dumais, 1997) Learned from large text collections Spatial representation: –Each word is a single point in semantic space –Nearby words are similar in meaning Problems: –Static representation for ambiguous words –Dimensions are uninterpretable –Little flexibility to expand JOY MYSTERY MATHEMATICS RESEARCH LOVE (semantic space)
16
Overview IProbabilistic Topic Models II Topic models for association III Topic models for recall IV Topic models for information retrieval V Conclusion
17
Word Association CUE: PLAY (Nelson, McEvoy, & Schreiber USF word association norms) Responses BASEBALL BAT BALL GAME PLAY STAGE THEATER
18
Topic model for Word Association Topics are latent factors that group words Word association modeled as probabilistic “spreading of activation” –Cue activates topics –Topics activates other words BASEBALL BAT BALL GAME PLAY STAGE THEATER BASEBALL BAT BALL GAME STAGE THEATER TOPIC PLAY
19
Model predictions from TASA corpus RANK 9
20
Median rank of first associate TOPICS
21
Median rank of first associate TOPICS LSA
22
What about collocations? Why are these words related? –PLAY - GROUND –DOW - JONES –BUMBLE - BEE Suggests at least two routes for association: –Semantic –Collocation Integrate collocations into topic model
23
TOPIC MIXTURE TOPIC WORD X TOPIC WORD X TOPIC WORD If x=0, sample a word from the topic If x=1, sample a word from the distribution based on previous word......... Collocation Topic Model
24
TOPIC MIXTURE TOPIC DOW X=1 JONES X=0 TOPIC RISES Example: “DOW JONES RISES” JONES is more likely explained as a word following DOW than as word sampled from topic Result: DOW_JONES recognized as collocation......... Collocation Topic Model
25
Examples Topics from New York Times WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NASDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN 10 500_STOCK_INDEX WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPANIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CASE GROUP TerrorismWall Street FirmsStock MarketBankruptcy
26
Overview IProbabilistic Topic Models II Topic models for association III Topic models for recall IV Topic models for information retrieval V Conclusion
27
Experiment Study the following words for a memory test: PEASCARROTSBEANSSPINACHLETTUCEHAMMERTOMATOESCORNCABBAGESQUASH Recall words..... Result: outlier word “HAMMER” better remembered
28
Hunt & Lamb (2001 exp. 1) OUTLIER LIST PEAS CARROTS BEANS SPINACH LETTUCE HAMMER TOMATOES CORN CABBAGE SQUASH CONTROL LIST SAW SCREW CHISEL DRILL SANDPAPER HAMMER NAILS BENCH RULER ANVIL
29
Semantic Isolation/ Von Restorff Effects Verbal explanations: –Attention, surprise, distinctiveness Our approach: –Multiple ways to encode each word Isolated item stored verbatim – direct access route Related items stored by topics – gist route
30
Dual route topic model Study list Topics Word distribution Encoding Reconstructed study list Retrieval Recall involves a reconstructive process from stored information Encoding involves choosing routes to store words -- each word is represented by one route X X (distributed representation) (localist representation)
31
Study words PEAS CARROTS BEANS SPINACH LETTUCE HAMMER TOMATOES CORN CABBAGE SQUASH ENCODING RETRIEVAL
32
Model Predictions
33
Interpretation of Von Restorff effects Contextually unique words are stored in verbatim route Contextually expected items stored by topics Reconstruction is better for words stored in verbatim route
34
False memory effects MAD FEAR HATE SMOOTH NAVY HEAT SALAD TUNE COURTS CANDY PALACE PLUSH TOOTH BLIND WINTER Number of Associates 369 MAD FEAR HATE RAGE TEMPER FURY SALAD TUNE COURTS CANDY PALACE PLUSH TOOTH BLIND WINTER MAD FEAR HATE RAGE TEMPER FURY WRATH HAPPY FIGHT CANDY PALACE PLUSH TOOTH BLIND WINTER (lure = ANGER) Robinson & Roediger (1997)
35
Relation to dual retrieval models Familiarity/ Recollection distinction e.g. Atkinson, Juola, 1972; Mandler, 1980 Not clear how these retrieval processes relate to encoding routes Gist/Verbatim distinction e.g. Reyna, Brainerd, 1995 Maps onto topics and verbatim route Difference –Our approach specifies both encoding and retrieval representations and processes –Routes are not independent –Model explains performance for actual word lists
36
Overview IProbabilistic Topic Models II Topic models for association III Topic models for recall IV Topic models for information retrieval V Conclusion
37
Example: finding Dave Huber’s website Works well when keywords are distinctive and match exactly Query Document Dave Huber Dave Huber UCSD UCSD
38
Problem: related but non-matching keywords Challenge: how to match at both the word level as well as a conceptual level Query Document Dave Huber Dave Huber Mathematical modeling ??
39
Dual route model for information retrieval Encode documents with two routes –contextually unique words verbatim route –Thematic words topics route
40
Example encoding Kruschke, J. K.. ALCOVE: An exemplar-based connectionist model of category learning. Psychological Review, 99, 22-44. Contextually unique words: ALCOVE, SCHAFFER, MEDIN, NOSOFSKY Topic 1 (p=0.21): learning phenomena acquisition learn acquired... Topic 22 (p=0.17): similarity objects object space category dimensional categories spatial Topic 61 (p=0.08): representations representation order alternative 1st higher 2nd descriptions problem form
41
Retrieval Experiments Corpora: –NIPS conference papers For each candidate document, calculate how likely the query was “generated” from the model’s encoding
42
Retrieval Results with Short Queries NIPS corpus Queries consist of 2-4 words Dual route Standard topics LSA
43
Overview IProbabilistic Topic Models II Topic models for association III Topic models for recall IV Topic models for information retrieval V Conclusion
44
Conclusions (1) 1) Memory as a reconstructive process –Encoding –Retrieval 2) Information retrieval as a reconstructive process –What candidate document best “reconstructs” the query?
45
Conclusions (2) Modeling multiple levels of organization in human memory Information retrieval –Matching of query to documents should operate at multiple levels
46
Software Public-domain MATLAB toolbox for topic modeling on the Web: http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm
49
Retrieval Results with Long Queries Cranfield corpus Queries consist of sentences Dual route Standard topics LSA
50
Relevant Retrieved All docs Precision/ Recall
51
Gibbs Sampler Stability
52
Comparing topics from two runs KL distance Topics from Run 1 Re-ordered Topics Run 2 BEST KL =.46 WORST KL = 9.40
53
Extensions of Topic Models Combining topics and syntax Topic hierarchies Topic segmentation –no need for document boundaries Modeling authors as well as documents –who wrote this part of the paper?
54
Words can have high probability in multiple topics PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW (Based on TASA corpus)
55
Problems with Spatial Representations
56
Violation of triangle inequality Can find similarity judgments that violate this: A BC AC AB + BC
57
Violation of triangle inequality Can find associations that violate this: A BC AC AB + BC SOCCER FIELD MAGNETIC
58
No Triangle Inequality with Topics SOCCER MAGNETIC FIELD TOPIC 1 TOPIC 2 Topic structure easily explains violations of triangle inequality
59
Small-World Structure of Associations (Steyvers & Tenenbaum, 2005) Properties: 1) Short path lengths 2) Clustering 3) Power law degree distributions Small world graphs arise elsewhere: internet, social relations, biology BASEBALL BAT BALL GAME PLAY STAGE THEATER
60
Power law degree distributions in Semantic Networks Power law degree distribution some words are “hubs” in a semantic network
61
Creating Association Networks TOPICS MODEL: Connect i to j when P( w=j | w=i ) > threshold LSA: For each word, generate K associates by picking K nearest neighbors in semantic space =-2.05
62
Associations in free recall STUDY THESE WORDS: Bed, Rest, Awake, Tired, Dream, Wake, Snooze, Blanket, Doze, Slumber, Snore, Nap, Peace, Yawn, Drowsy RECALL WORDS..... FALSE RECALL: “Sleep” 61%
63
Recall as a reconstructive process Reconstruct study list based on the stored “gist” The gist can be represented by a distribution over topics Under a single topic assumption: Retrieved word Study list
64
Predictions for the “Sleep” list STUDY LIST EXTRA LIST (top 8)
65
Hidden Markov Topic Model
66
Hidden Markov Topics Model Syntactic dependencies short range dependencies Semantic dependencies long-range zz zz zz zz ww ww ww ww ss ss ss ss Semantic state: generate words from topic model Syntactic states: generate words from HMM (Griffiths, Steyvers, Blei, & Tenenbaum, 2004)
67
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = 1 0.4 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = 2 0.6 THE 0.6 A 0.3 MANY 0.1 OF 0.6 FOR 0.3 BETWEEN 0.1 0.9 0.1 0.2 0.8 0.7 0.3 Transition between semantic state and syntactic states
68
THE ……………………………… HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = 1 0.4 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = 2 0.6 x = 1 THE 0.6 A 0.3 MANY 0.1 x = 3 OF 0.6 FOR 0.3 BETWEEN 0.1 x = 2 0.9 0.1 0.2 0.8 0.7 0.3 Combining topics and syntax
69
THE LOVE…………………… HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = 1 0.4 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = 2 0.6 x = 1 THE 0.6 A 0.3 MANY 0.1 x = 3 OF 0.6 FOR 0.3 BETWEEN 0.1 x = 2 0.9 0.1 0.2 0.8 0.7 0.3 Combining topics and syntax
70
THE LOVE OF……………… HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = 1 0.4 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = 2 0.6 x = 1 THE 0.6 A 0.3 MANY 0.1 x = 3 OF 0.6 FOR 0.3 BETWEEN 0.1 x = 2 0.9 0.1 0.2 0.8 0.7 0.3 Combining topics and syntax
71
THE LOVE OF RESEARCH …… HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = 1 0.4 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = 2 0.6 x = 1 THE 0.6 A 0.3 MANY 0.1 x = 3 OF 0.6 FOR 0.3 BETWEEN 0.1 x = 2 0.9 0.1 0.2 0.8 0.7 0.3 Combining topics and syntax
72
FOOD FOODS BODY NUTRIENTS DIET FAT SUGAR ENERGY MILK EATING FRUITS VEGETABLES WEIGHT FATS NEEDS CARBOHYDRATES VITAMINS CALORIES PROTEIN MINERALS MAP NORTH EARTH SOUTH POLE MAPS EQUATOR WEST LINES EAST AUSTRALIA GLOBE POLES HEMISPHERE LATITUDE PLACES LAND WORLD COMPASS CONTINENTS DOCTOR PATIENT HEALTH HOSPITAL MEDICAL CARE PATIENTS NURSE DOCTORS MEDICINE NURSING TREATMENT NURSES PHYSICIAN HOSPITALS DR SICK ASSISTANT EMERGENCY PRACTICE BOOK BOOKS READING INFORMATION LIBRARY REPORT PAGE TITLE SUBJECT PAGES GUIDE WORDS MATERIAL ARTICLE ARTICLES WORD FACTS AUTHOR REFERENCE NOTE GOLD IRON SILVER COPPER METAL METALS STEEL CLAY LEAD ADAM ORE ALUMINUM MINERAL MINE STONE MINERALS POT MINING MINERS TIN BEHAVIOR SELF INDIVIDUAL PERSONALITY RESPONSE SOCIAL EMOTIONAL LEARNING FEELINGS PSYCHOLOGISTS INDIVIDUALS PSYCHOLOGICAL EXPERIENCES ENVIRONMENT HUMAN RESPONSES BEHAVIORS ATTITUDES PSYCHOLOGY PERSON CELLS CELL ORGANISMS ALGAE BACTERIA MICROSCOPE MEMBRANE ORGANISM FOOD LIVING FUNGI MOLD MATERIALS NUCLEUS CELLED STRUCTURES MATERIAL STRUCTURE GREEN MOLDS Semantic topics PLANTS PLANT LEAVES SEEDS SOIL ROOTS FLOWERS WATER FOOD GREEN SEED STEMS FLOWER STEM LEAF ANIMALS ROOT POLLEN GROWING GROW
73
GOOD SMALL NEW IMPORTANT GREAT LITTLE LARGE * BIG LONG HIGH DIFFERENT SPECIAL OLD STRONG YOUNG COMMON WHITE SINGLE CERTAIN THE HIS THEIR YOUR HER ITS MY OUR THIS THESE A AN THAT NEW THOSE EACH MR ANY MRS ALL MORE SUCH LESS MUCH KNOWN JUST BETTER RATHER GREATER HIGHER LARGER LONGER FASTER EXACTLY SMALLER SOMETHING BIGGER FEWER LOWER ALMOST ON AT INTO FROM WITH THROUGH OVER AROUND AGAINST ACROSS UPON TOWARD UNDER ALONG NEAR BEHIND OFF ABOVE DOWN BEFORE SAID ASKED THOUGHT TOLD SAYS MEANS CALLED CRIED SHOWS ANSWERED TELLS REPLIED SHOUTED EXPLAINED LAUGHED MEANT WROTE SHOWED BELIEVED WHISPERED ONE SOME MANY TWO EACH ALL MOST ANY THREE THIS EVERY SEVERAL FOUR FIVE BOTH TEN SIX MUCH TWENTY EIGHT HE YOU THEY I SHE WE IT PEOPLE EVERYONE OTHERS SCIENTISTS SOMEONE WHO NOBODY ONE SOMETHING ANYONE EVERYBODY SOME THEN Syntactic classes BE MAKE GET HAVE GO TAKE DO FIND USE SEE HELP KEEP GIVE LOOK COME WORK MOVE LIVE EAT BECOME
74
MODEL ALGORITHM SYSTEM CASE PROBLEM NETWORK METHOD APPROACH PAPER PROCESS IS WAS HAS BECOMES DENOTES BEING REMAINS REPRESENTS EXISTS SEEMS SEE SHOW NOTE CONSIDER ASSUME PRESENT NEED PROPOSE DESCRIBE SUGGEST USED TRAINED OBTAINED DESCRIBED GIVEN FOUND PRESENTED DEFINED GENERATED SHOWN IN WITH FOR ON FROM AT USING INTO OVER WITHIN HOWEVER ALSO THEN THUS THEREFORE FIRST HERE NOW HENCE FINALLY #*IXTN-CFP#*IXTN-CFP EXPERTS EXPERT GATING HME ARCHITECTURE MIXTURE LEARNING MIXTURES FUNCTION GATE DATA GAUSSIAN MIXTURE LIKELIHOOD POSTERIOR PRIOR DISTRIBUTION EM BAYESIAN PARAMETERS STATE POLICY VALUE FUNCTION ACTION REINFORCEMENT LEARNING CLASSES OPTIMAL * MEMBRANE SYNAPTIC CELL * CURRENT DENDRITIC POTENTIAL NEURON CONDUCTANCE CHANNELS IMAGE IMAGES OBJECT OBJECTS FEATURE RECOGNITION VIEWS # PIXEL VISUAL KERNEL SUPPORT VECTOR SVM KERNELS # SPACE FUNCTION MACHINES SET NETWORK NEURAL NETWORKS OUPUT INPUT TRAINING INPUTS WEIGHTS # OUTPUTS NIPS Semantics NIPS Syntax
75
Random sentence generation LANGUAGE: [S] RESEARCHERS GIVE THE SPEECH [S] THE SOUND FEEL NO LISTENERS [S] WHICH WAS TO BE MEANING [S] HER VOCABULARIES STOPPED WORDS [S] HE EXPRESSLY WANTED THAT BETTER VOWEL
76
Nested Chinese Restaurant Process
77
Topic Hierarchies In regular topic model, no relations between topics topic 1 topic 2 topic 3 topic 4 topic 5 topic 6 topic 7 Nested Chinese Restaurant Process –Blei, Griffiths, Jordan, Tenenbaum (2004) –Learn hierarchical structure, as well as topics within structure
78
Example: Psych Review Abstracts RESPONSE STIMULUS REINFORCEMENT RECOGNITION STIMULI RECALL CHOICE CONDITIONING SPEECH READING WORDS MOVEMENT MOTOR VISUAL WORD SEMANTIC ACTION SOCIAL SELF EXPERIENCE EMOTION GOALS EMOTIONAL THINKING GROUP IQ INTELLIGENCE SOCIAL RATIONAL INDIVIDUAL GROUPS MEMBERS SEX EMOTIONS GENDER EMOTION STRESS WOMEN HEALTH HANDEDNESS REASONING ATTITUDE CONSISTENCY SITUATIONAL INFERENCE JUDGMENT PROBABILITIES STATISTICAL IMAGE COLOR MONOCULAR LIGHTNESS GIBSON SUBMOVEMENT ORIENTATION HOLOGRAPHIC CONDITIONIN STRESS EMOTIONAL BEHAVIORAL FEAR STIMULATION TOLERANCE RESPONSES A MODEL MEMORY FOR MODELS TASK INFORMATION RESULTS ACCOUNT SELF SOCIAL PSYCHOLOGY RESEARCH RISK STRATEGIES INTERPERSONAL PERSONALITY SAMPLING MOTION VISUAL SURFACE BINOCULAR RIVALRY CONTOUR DIRECTION CONTOURS SURFACES DRUG FOOD BRAIN AROUSAL ACTIVATION AFFECTIVE HUNGER EXTINCTION PAIN THE OF AND TO IN A IS
79
Generative Process RESPONSE STIMULUS REINFORCEMENT RECOGNITION STIMULI RECALL CHOICE CONDITIONING SPEECH READING WORDS MOVEMENT MOTOR VISUAL WORD SEMANTIC ACTION SOCIAL SELF EXPERIENCE EMOTION GOALS EMOTIONAL THINKING GROUP IQ INTELLIGENCE SOCIAL RATIONAL INDIVIDUAL GROUPS MEMBERS SEX EMOTIONS GENDER EMOTION STRESS WOMEN HEALTH HANDEDNESS REASONING ATTITUDE CONSISTENCY SITUATIONAL INFERENCE JUDGMENT PROBABILITIES STATISTICAL IMAGE COLOR MONOCULAR LIGHTNESS GIBSON SUBMOVEMENT ORIENTATION HOLOGRAPHIC CONDITIONIN STRESS EMOTIONAL BEHAVIORAL FEAR STIMULATION TOLERANCE RESPONSES A MODEL MEMORY FOR MODELS TASK INFORMATION RESULTS ACCOUNT SELF SOCIAL PSYCHOLOGY RESEARCH RISK STRATEGIES INTERPERSONAL PERSONALITY SAMPLING MOTION VISUAL SURFACE BINOCULAR RIVALRY CONTOUR DIRECTION CONTOURS SURFACES DRUG FOOD BRAIN AROUSAL ACTIVATION AFFECTIVE HUNGER EXTINCTION PAIN THE OF AND TO IN A IS
80
Other Slides
81
Markov chain Monte Carlo Sample from a Markov chain constructed to converge to the target distribution Allows sampling from unnormalized posterior Can compute approximate statistics from intractable distributions Gibbs sampling one such method, construct Markov chain with conditional distributions
82
Enron email: two example topics (T=100)
83
Enron email: two work unrelated topics
84
Using Topic Models for Information Retrieval
85
Clusters v. Topics Hidden Markov Models in Molecular Biology: New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on-line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification. [cluster 88] model data models time neural figure state learning set parameters network probability number networks training function system algorithm hidden markov [topic 10] state hmm markov sequence models hidden states probabilities sequences parameters transition probability training hmms hybrid model likelihood modeling [topic 37] genetic structure chain protein population region algorithms human mouse selection fitness proteins search evolution generation function sequence sequences genes Multiple Topics One Cluster
86
words C = F Q documents words topics documents Topic model words documents C = U D V T words dims documents LSA normalized co-occurrence matrix mixture weights mixture components
87
Documents as Topics Mixtures: a Geometric Interpretation P(word3) P(word1) 0 1 1 1 P(word2) P(word1)+P(word2)+P(word3) = 1 topic 1 topic 2 = observed document
88
word pixel document image sample each pixel from a mixture of topics Generating Artificial Data: a visual example 10 “topics” with 5 x 5 pixels
89
A subset of documents
90
Inferred Topics 0 1 5 10 50 100 Iterations
91
Interpretable decomposition SVD gives a basis for the data, but not an interpretable one The true basis is not orthogonal, so rotation does no good Topic model SVD
92
Similarity between 55 funding programs
93
INPUT:word-document counts (word order is irrelevant) OUTPUT: topic assignments to each wordP( z i ) likely words in each topicP( w | z ) likely topics in each document (“gist”)P( z | d ) Summary
94
Gibbs Sampling count of word w assigned to topic t count of topic t assigned to doc d probability that word i is assigned to topic t
95
Prior Distributions Dirichlet priors encourage sparsity on topic mixtures and topics θ ~ Dirichlet( α ) ~ Dirichlet( β ) Topic 1 Topic 2 Topic 3 Word 1 Word 2 Word 3 (darker colors indicate lower probability)
96
Example Topics extracted from NIH/NSF grants Important point: these distributions are learned in a completely automated “unsupervised” fashion from the data
97
Extracting Topics from Email 250,000 emails 28,000 authors 1999-2002 Enron Email TEXANS WIN FOOTBALL FANTASY SPORTSLINE PLAY TEAM GAME SPORTS GAMES GOD LIFE MAN PEOPLE CHRIST FAITH LORD JESUS SPIRITUAL VISIT TRAVEL ROUNDTRIP SAVE DEALS HOTEL BOOK SALE FARES TRIP CITIES FERC MARKET ISO COMMISSION ORDER FILING COMMENTS PRICE CALIFORNIA FILED POWER CALIFORNIA ELECTRICITY UTILITIES PRICES MARKET PRICE UTILITY CUSTOMERS ELECTRIC STATE PLAN CALIFORNIA DAVIS RATE BANKRUPTCY SOCAL POWER BONDS MOU
98
Topic trends from New York Times TOUR RIDER LANCE_ARMSTRONG TEAM BIKE RACE FRANCE Tour-de-France COMPANY QUARTER PERCENT ANALYST SHARE SALES EARNING ANTHRAX LETTER MAIL WORKER OFFICE SPORES POSTAL BUILDING Anthrax 330,000 articles 2000-2002 Quarterly Earnings
99
Topic trends in NIPS conference KERNEL SUPPORT VECTOR MARGIN SVM KERNELS SPACE DATA MACHINES... while SVM’s become more popular. LAYER NET NEURAL LAYERS NETS ARCHITECTURE NUMBER FEEDFORWARD SINGLE Neural networks on the decline...
100
Finding Funding Overlap Analyze 22,000+ grants active during 2002 55 funding programs from NSF and NIH –Focus on behavioral sciences Questions of interest: –What are the areas of overlap between funding programs? –What topics are funded?
101
NIH NSF – BIO NSF – SBE 2D visualization of funding programs – nearby program support similar topics
102
Funding Amounts per Topic We have $ funding per grant We have distribution of topics for each grant Solve for the $ amount per topic What are expensive topics?
103
High $$$ topicsLow $$$ topics
104
Basic Problems for Semantic Memory Prediction –what fact, concept, or word is next? Gist extraction –What is this set of words about? Disambiguation –What is the sense of this word?
105
Topics from an educational corpus PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW 37K docs 26K word vocabulary 1700 topics e.g.:
106
Same words in different topics represent different contextual usages PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW
107
Three documents with the word “play” (numbers & colors topic assignments)
108
Topics.4 1.0.6 1.0 MONEY 1 BANK 1 BANK 1 LOAN 1 BANK 1 MONEY 1 BANK 1 MONEY 1 BANK 1 LOAN 1 LOAN 1 BANK 1 MONEY 1.... Mixtures Documents and topic assignments RIVER 2 MONEY 1 BANK 2 STREAM 2 BANK 2 BANK 1 MONEY 1 RIVER 2 MONEY 1 BANK 2 LOAN 1 MONEY 1.... RIVER 2 BANK 2 STREAM 2 BANK 2 RIVER 2 BANK 2.... Example of generating words
109
Topics ? ? MONEY ? BANK BANK ? LOAN ? BANK ? MONEY ? BANK ? MONEY ? BANK ? LOAN ? LOAN ? BANK ? MONEY ?.... Mixtures RIVER ? MONEY ? BANK ? STREAM ? BANK ? BANK ? MONEY ? RIVER ? MONEY ? BANK ? LOAN ? MONEY ?.... RIVER ? BANK ? STREAM ? BANK ? RIVER ? BANK ?.... Statistical Inference Documents and topic assignments ?
110
PLAY How are these learned? CASH BAT LOAN MONEY BANK RIVER STREAM Two approaches to semantic representation Semantic networksSemantic Spaces FUN GAME STAGE THEATER BALL Can be learned (LSA), but is this representation flexible enough?
111
Latent Semantic Analysis (Landauer & Dumais, 1997) word-document counts high dimensional space SVD Each word is a single point in semantic space -- nearby words co-occur often with other words Words with multiple meanings only have one representation and have to be near all words related to different senses this leads to distortions in similarity BALL PLAY STAGE THEATER GAME
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.