Download presentation
Presentation is loading. Please wait.
1
Probabilistic Topic Models Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work with: Tom Griffiths, UC Berkeley Padhraic Smyth, UC Irvine Dave Newman, UC Irvine
2
Overview IProblems of interest IIProbabilistic Topic Models II Using topic models in machine learning and data mining III Using topic models in psychology IV Conclusion
3
Problems of Interest What topics does this text collection “span”? Which documents are about a particular topic? Who writes about a particular topic? How have topics changed over time? How to represent the “gist” of a list of words? How to model associations between words? Machine Learning/ Data mining Psychology
4
Probabilistic Topic Models Based on pLSI and LDA model (e.g., Hoffman, 2001; Blei, Ng, Jordan, 2003) A probability model linking documents, topics, and words Topic distribution over words Document mixture of topics Topics and topic mixtures can be extracted from large collections of text
5
Example Topics extracted from NIH/NSF grants Important point: these distributions are learned in a completely automated “unsupervised” fashion from the data
6
Overview IProblems of interest IIProbabilistic Topic Models II Using topic models in machine learning and data mining III Using topic models in psychology IV Conclusion
7
Parameters Real World Data P( Data | Parameters) P( Parameters | Data) Probabilistic Model Statistical Inference
8
Document generation as a probabilistic process Each topic is a distribution over words Each document a mixture of topics Each word chosen from a single topic From parameters (j) From parameters (d)
9
Prior Distributions Dirichlet priors encourage sparsity on topic mixtures and topics θ ~ Dirichlet( ) ~ Dirichlet( ) Topic 1 Topic 2 Topic 3 Word 1 Word 2 Word 3
10
Topics .4 1.0.6 1.0 MONEY 1 BANK 1 BANK 1 LOAN 1 BANK 1 MONEY 1 BANK 1 MONEY 1 BANK 1 LOAN 1 LOAN 1 BANK 1 MONEY 1.... Mixtures θ Documents and topic assignments RIVER 2 MONEY 1 BANK 2 STREAM 2 BANK 2 BANK 1 MONEY 1 RIVER 2 MONEY 1 BANK 2 LOAN 1 MONEY 1.... RIVER 2 BANK 2 STREAM 2 BANK 2 RIVER 2 BANK 2.... Example of generating words
11
Topics ? ? MONEY ? BANK BANK ? LOAN ? BANK ? MONEY ? BANK ? MONEY ? BANK ? LOAN ? LOAN ? BANK ? MONEY ?.... Mixtures θ RIVER ? MONEY ? BANK ? STREAM ? BANK ? BANK ? MONEY ? RIVER ? MONEY ? BANK ? LOAN ? MONEY ?.... RIVER ? BANK ? STREAM ? BANK ? RIVER ? BANK ?.... Inference Documents and topic assignments ?
12
Bayesian Inference Three sets of latent variables topic mixtures θ word distributions topic assignments z Integrate out θ and and estimate topic assignments: Use MCMC with Gibbs sampling for approximate inference Sum over terms
13
Gibbs Sampling Start with random assignments of words to topics Repeat M iterations Repeat for all words i Sample a new topic assignment for word i conditioned on all other topic assignments
14
16 Artificial Documents Can we recover the original topics and topic mixtures from this data? documents
15
Starting the Gibbs Sampling Assign word tokens randomly to topics (●=topic 1; ●=topic 2 )
16
After 1 iteration
17
After 4 iterations
18
After 32 iterations● ●
19
Choosing the number of topics Bayesian Model Selection: choose T to maximize: Requires integration over all parameters. Approximate by Gibbs sampling Results with artificial data with 10 topics:
20
Overview IProblems of interest IIProbabilistic Topic Models II Using topic models in machine learning and data mining III Using topic models in psychology IV Conclusion
21
Extracting Topics from Email 250,000 emails 28,000 authors 1999-2002 Enron Email TEXANS WIN FOOTBALL FANTASY SPORTSLINE PLAY TEAM GAME SPORTS GAMES GOD LIFE MAN PEOPLE CHRIST FAITH LORD JESUS SPIRITUAL VISIT TRAVEL ROUNDTRIP SAVE DEALS HOTEL BOOK SALE FARES TRIP CITIES FERC MARKET ISO COMMISSION ORDER FILING COMMENTS PRICE CALIFORNIA FILED POWER CALIFORNIA ELECTRICITY UTILITIES PRICES MARKET PRICE UTILITY CUSTOMERS ELECTRIC STATE PLAN CALIFORNIA DAVIS RATE BANKRUPTCY SOCAL POWER BONDS MOU
22
Topic trends from New York Times TOUR RIDER LANCE_ARMSTRONG TEAM BIKE RACE FRANCE Tour-de-France COMPANY QUARTER PERCENT ANALYST SHARE SALES EARNING Quarterly Earnings ANTHRAX LETTER MAIL WORKER OFFICE SPORES POSTAL BUILDING Anthrax 330,000 articles 2000-2002
23
Topic trends in NIPS conference KERNEL SUPPORT VECTOR MARGIN SVM KERNELS SPACE DATA MACHINES... while SVM’s become more popular. LAYER NET NEURAL LAYERS NETS ARCHITECTURE NUMBER FEEDFORWARD SINGLE Neural networks on the decline...
24
Finding Funding Overlap Analyze 22,000+ grants active during 2002 55 funding programs from NSF and NIH Focus on behavioral sciences Questions of interest: What are the areas of overlap between funding programs? What topics are funded?
25
NIH NSF – BIO NSF – SBE 2D visualization of funding programs – nearby program support similar topics
26
Funding Amounts per Topic We have $ funding per grant We have distribution of topics for each grant Solve for the $ amount per topic What are expensive topics?
27
High $$$ topicsLow $$$ topics
28
Overview IProblems of interest IIProbabilistic Topic Models II Using topic models in machine learning and data mining III Using topic models in psychology IV Conclusion
29
Basic Problems for Semantic Memory Prediction what fact, concept, or word is next? Gist extraction What is this set of words about? Disambiguation What is the sense of this word?
30
TASA Corpus Applied model to educational text corpus: TASA 33K docs 6M words
31
Three documents with the word “play” (numbers & colors topic assignments)
32
Word Association CUE: PLAY (Nelson, McEvoy, & Schreiber USF word association norms) Responses Association networks BASEBALL BAT BALL GAME PLAY STAGE THEATER
33
Topic model for Word Association Association as a problem of prediction: Given that a single word is observed, predict what other words might occur in that context Under a single topic assumption: Response Cue
34
Model predictions from TASA corpus RANK 9
35
Median rank of first associate TOPICS
36
Latent Semantic Analysis (Landauer & Dumais, 1997) word-document counts high dimensional space SVD Each word is a single point in semantic space Similarity measured by cosine of angle between word vectors BALL PLAY STAGE THEATER GAME
37
Median rank of first associate TOPICS LSA
38
Objections to spatial representations Tversky (1977) on similarity: asymmetry violation of the triangle inequality rich neighborhood structure Semantic association has the same properties asymmetry violation of the triangle inequality “small world” neighborhood structure (Steyvers & Tenenbaum, 2005)
39
Topics as latent factors to group words BASEBALL BAT BALL GAME STAGE THEATER TOPIC PLAY BASEBALL BAT BALL GAME PLAY STAGE THEATER Note: there are no direct connections between words
40
What about collocations? Why are these words related? DOW - JONES BUMBLE - BEE WHITE – HOUSE Suggests at least two routes for association: Semantic Collocation Integrate collocations into topic model Related to paradigmatic/syntagmatic distinction
41
TOPIC MIXTURE TOPIC WORD X TOPIC WORD X TOPIC WORD If x=0, sample a word from the topic If x=1, sample a word from the distribution based on previous word......... Collocation Topic Model
42
TOPIC MIXTURE TOPIC DOW X=1 JONES X=0 TOPIC RISES Example: “DOW JONES RISES” JONES is more likely explained as a word following DOW than as word sampled from topic Result: DOW_JONES recognized as collocation......... Collocation Topic Model
43
Examples Topics from New York Times WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NASDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN 10 500_STOCK_INDEX WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPANIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CASE GROUP TerrorismWall Street FirmsStock MarketBankruptcy
44
Context dependent collocations Example: WHITE - HOUSE In context of government, WHITE-HOUSE is a collocation In context of colors of houses, WHITE - HOUSE is treated as separate words
45
Overview IProblems of interest IIProbabilistic Topic Models II Using topic models in machine learning and data mining III Using topic models in psychology IV Conclusion
46
Psychology Computer Science Models for information retrieval might be helpful in understanding semantic memory Research on semantic memory can be useful for developing better information retrieval algorithms
47
Software Public-domain MATLAB toolbox for topic modeling on the Web: http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm
50
Gibbs Sampler Stability
51
Comparing topics from two runs KL distance Topics from Run 1 Re-ordered Topics Run 2 BEST KL =.46 WORST KL = 9.40
52
Extensions of Topic Models Combining topics and syntax Topic hierarchies Topic segmentation no need for document boundaries Modeling authors as well as documents who wrote this part of the paper?
53
Words can have high probability in multiple topics PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW (Based on TASA corpus)
54
Problems with Spatial Representations
55
Violation of triangle inequality Can find similarity judgments that violate this: A BC AC AB + BC
56
Violation of triangle inequality Can find associations that violate this: A BC AC AB + BC SOCCER FIELD MAGNETIC
57
No Triangle Inequality with Topics SOCCER MAGNETIC FIELD TOPIC 1 TOPIC 2 Topic structure easily explains violations of triangle inequality
58
Small-World Structure of Associations (Steyvers & Tenenbaum, 2005) Properties: 1) Short path lengths 2) Clustering 3) Power law degree distributions Small world graphs arise elsewhere: internet, social relations, biology BASEBALL BAT BALL GAME PLAY STAGE THEATER
59
Power law degree distributions in Semantic Networks Power law degree distribution some words are “hubs” in a semantic network
60
Creating Association Networks TOPICS MODEL: Connect i to j when P( w=j | w=i ) > threshold LSA: For each word, generate K associates by picking K nearest neighbors in semantic space =-2.05
61
Associations in free recall STUDY THESE WORDS: Bed, Rest, Awake, Tired, Dream, Wake, Snooze, Blanket, Doze, Slumber, Snore, Nap, Peace, Yawn, Drowsy RECALL WORDS..... FALSE RECALL: “Sleep” 61%
62
Recall as a reconstructive process Reconstruct study list based on the stored “gist” The gist can be represented by a distribution over topics Under a single topic assumption: Retrieved word Study list
63
Predictions for the “Sleep” list STUDY LIST EXTRA LIST (top 8)
64
Hidden Markov Topic Model
65
Hidden Markov Topics Model Syntactic dependencies short range dependencies Semantic dependencies long-range zz zz zz zz ww ww ww ww ss ss ss ss Semantic state: generate words from topic model Syntactic states: generate words from HMM (Griffiths, Steyvers, Blei, & Tenenbaum, 2004)
66
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = 1 0.4 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = 2 0.6 THE 0.6 A 0.3 MANY 0.1 OF 0.6 FOR 0.3 BETWEEN 0.1 0.9 0.1 0.2 0.8 0.7 0.3 Transition between semantic state and syntactic states
67
THE ……………………………… HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = 1 0.4 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = 2 0.6 x = 1 THE 0.6 A 0.3 MANY 0.1 x = 3 OF 0.6 FOR 0.3 BETWEEN 0.1 x = 2 0.9 0.1 0.2 0.8 0.7 0.3 Combining topics and syntax
68
THE LOVE…………………… HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = 1 0.4 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = 2 0.6 x = 1 THE 0.6 A 0.3 MANY 0.1 x = 3 OF 0.6 FOR 0.3 BETWEEN 0.1 x = 2 0.9 0.1 0.2 0.8 0.7 0.3 Combining topics and syntax
69
THE LOVE OF……………… HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = 1 0.4 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = 2 0.6 x = 1 THE 0.6 A 0.3 MANY 0.1 x = 3 OF 0.6 FOR 0.3 BETWEEN 0.1 x = 2 0.9 0.1 0.2 0.8 0.7 0.3 Combining topics and syntax
70
THE LOVE OF RESEARCH …… HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = 1 0.4 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = 2 0.6 x = 1 THE 0.6 A 0.3 MANY 0.1 x = 3 OF 0.6 FOR 0.3 BETWEEN 0.1 x = 2 0.9 0.1 0.2 0.8 0.7 0.3 Combining topics and syntax
71
FOOD FOODS BODY NUTRIENTS DIET FAT SUGAR ENERGY MILK EATING FRUITS VEGETABLES WEIGHT FATS NEEDS CARBOHYDRATES VITAMINS CALORIES PROTEIN MINERALS MAP NORTH EARTH SOUTH POLE MAPS EQUATOR WEST LINES EAST AUSTRALIA GLOBE POLES HEMISPHERE LATITUDE PLACES LAND WORLD COMPASS CONTINENTS DOCTOR PATIENT HEALTH HOSPITAL MEDICAL CARE PATIENTS NURSE DOCTORS MEDICINE NURSING TREATMENT NURSES PHYSICIAN HOSPITALS DR SICK ASSISTANT EMERGENCY PRACTICE BOOK BOOKS READING INFORMATION LIBRARY REPORT PAGE TITLE SUBJECT PAGES GUIDE WORDS MATERIAL ARTICLE ARTICLES WORD FACTS AUTHOR REFERENCE NOTE GOLD IRON SILVER COPPER METAL METALS STEEL CLAY LEAD ADAM ORE ALUMINUM MINERAL MINE STONE MINERALS POT MINING MINERS TIN BEHAVIOR SELF INDIVIDUAL PERSONALITY RESPONSE SOCIAL EMOTIONAL LEARNING FEELINGS PSYCHOLOGISTS INDIVIDUALS PSYCHOLOGICAL EXPERIENCES ENVIRONMENT HUMAN RESPONSES BEHAVIORS ATTITUDES PSYCHOLOGY PERSON CELLS CELL ORGANISMS ALGAE BACTERIA MICROSCOPE MEMBRANE ORGANISM FOOD LIVING FUNGI MOLD MATERIALS NUCLEUS CELLED STRUCTURES MATERIAL STRUCTURE GREEN MOLDS Semantic topics PLANTS PLANT LEAVES SEEDS SOIL ROOTS FLOWERS WATER FOOD GREEN SEED STEMS FLOWER STEM LEAF ANIMALS ROOT POLLEN GROWING GROW
72
GOOD SMALL NEW IMPORTANT GREAT LITTLE LARGE * BIG LONG HIGH DIFFERENT SPECIAL OLD STRONG YOUNG COMMON WHITE SINGLE CERTAIN THE HIS THEIR YOUR HER ITS MY OUR THIS THESE A AN THAT NEW THOSE EACH MR ANY MRS ALL MORE SUCH LESS MUCH KNOWN JUST BETTER RATHER GREATER HIGHER LARGER LONGER FASTER EXACTLY SMALLER SOMETHING BIGGER FEWER LOWER ALMOST ON AT INTO FROM WITH THROUGH OVER AROUND AGAINST ACROSS UPON TOWARD UNDER ALONG NEAR BEHIND OFF ABOVE DOWN BEFORE SAID ASKED THOUGHT TOLD SAYS MEANS CALLED CRIED SHOWS ANSWERED TELLS REPLIED SHOUTED EXPLAINED LAUGHED MEANT WROTE SHOWED BELIEVED WHISPERED ONE SOME MANY TWO EACH ALL MOST ANY THREE THIS EVERY SEVERAL FOUR FIVE BOTH TEN SIX MUCH TWENTY EIGHT HE YOU THEY I SHE WE IT PEOPLE EVERYONE OTHERS SCIENTISTS SOMEONE WHO NOBODY ONE SOMETHING ANYONE EVERYBODY SOME THEN Syntactic classes BE MAKE GET HAVE GO TAKE DO FIND USE SEE HELP KEEP GIVE LOOK COME WORK MOVE LIVE EAT BECOME
73
MODEL ALGORITHM SYSTEM CASE PROBLEM NETWORK METHOD APPROACH PAPER PROCESS IS WAS HAS BECOMES DENOTES BEING REMAINS REPRESENTS EXISTS SEEMS SEE SHOW NOTE CONSIDER ASSUME PRESENT NEED PROPOSE DESCRIBE SUGGEST USED TRAINED OBTAINED DESCRIBED GIVEN FOUND PRESENTED DEFINED GENERATED SHOWN IN WITH FOR ON FROM AT USING INTO OVER WITHIN HOWEVER ALSO THEN THUS THEREFORE FIRST HERE NOW HENCE FINALLY #*IXTN-CFP#*IXTN-CFP EXPERTS EXPERT GATING HME ARCHITECTURE MIXTURE LEARNING MIXTURES FUNCTION GATE DATA GAUSSIAN MIXTURE LIKELIHOOD POSTERIOR PRIOR DISTRIBUTION EM BAYESIAN PARAMETERS STATE POLICY VALUE FUNCTION ACTION REINFORCEMENT LEARNING CLASSES OPTIMAL * MEMBRANE SYNAPTIC CELL * CURRENT DENDRITIC POTENTIAL NEURON CONDUCTANCE CHANNELS IMAGE IMAGES OBJECT OBJECTS FEATURE RECOGNITION VIEWS # PIXEL VISUAL KERNEL SUPPORT VECTOR SVM KERNELS # SPACE FUNCTION MACHINES SET NETWORK NEURAL NETWORKS OUPUT INPUT TRAINING INPUTS WEIGHTS # OUTPUTS NIPS Semantics NIPS Syntax
74
Random sentence generation LANGUAGE: [S] RESEARCHERS GIVE THE SPEECH [S] THE SOUND FEEL NO LISTENERS [S] WHICH WAS TO BE MEANING [S] HER VOCABULARIES STOPPED WORDS [S] HE EXPRESSLY WANTED THAT BETTER VOWEL
75
Nested Chinese Restaurant Process
76
Topic Hierarchies In regular topic model, no relations between topics Alternative: hierarchical topic organization topic 1 topic 2 topic 3 topic 4 topic 5 topic 6 topic 7 Nested Chinese Restaurant Process Blei, Griffiths, Jordan, Tenenbaum (2004) Learn hierarchical structure, as well as topics within structure
77
Example: Psych Review Abstracts RESPONSE STIMULUS REINFORCEMENT RECOGNITION STIMULI RECALL CHOICE CONDITIONING SPEECH READING WORDS MOVEMENT MOTOR VISUAL WORD SEMANTIC ACTION SOCIAL SELF EXPERIENCE EMOTION GOALS EMOTIONAL THINKING GROUP IQ INTELLIGENCE SOCIAL RATIONAL INDIVIDUAL GROUPS MEMBERS SEX EMOTIONS GENDER EMOTION STRESS WOMEN HEALTH HANDEDNESS REASONING ATTITUDE CONSISTENCY SITUATIONAL INFERENCE JUDGMENT PROBABILITIES STATISTICAL IMAGE COLOR MONOCULAR LIGHTNESS GIBSON SUBMOVEMENT ORIENTATION HOLOGRAPHIC CONDITIONIN STRESS EMOTIONAL BEHAVIORAL FEAR STIMULATION TOLERANCE RESPONSES A MODEL MEMORY FOR MODELS TASK INFORMATION RESULTS ACCOUNT SELF SOCIAL PSYCHOLOGY RESEARCH RISK STRATEGIES INTERPERSONAL PERSONALITY SAMPLING MOTION VISUAL SURFACE BINOCULAR RIVALRY CONTOUR DIRECTION CONTOURS SURFACES DRUG FOOD BRAIN AROUSAL ACTIVATION AFFECTIVE HUNGER EXTINCTION PAIN THE OF AND TO IN A IS
78
Generative Process RESPONSE STIMULUS REINFORCEMENT RECOGNITION STIMULI RECALL CHOICE CONDITIONING SPEECH READING WORDS MOVEMENT MOTOR VISUAL WORD SEMANTIC ACTION SOCIAL SELF EXPERIENCE EMOTION GOALS EMOTIONAL THINKING GROUP IQ INTELLIGENCE SOCIAL RATIONAL INDIVIDUAL GROUPS MEMBERS SEX EMOTIONS GENDER EMOTION STRESS WOMEN HEALTH HANDEDNESS REASONING ATTITUDE CONSISTENCY SITUATIONAL INFERENCE JUDGMENT PROBABILITIES STATISTICAL IMAGE COLOR MONOCULAR LIGHTNESS GIBSON SUBMOVEMENT ORIENTATION HOLOGRAPHIC CONDITIONIN STRESS EMOTIONAL BEHAVIORAL FEAR STIMULATION TOLERANCE RESPONSES A MODEL MEMORY FOR MODELS TASK INFORMATION RESULTS ACCOUNT SELF SOCIAL PSYCHOLOGY RESEARCH RISK STRATEGIES INTERPERSONAL PERSONALITY SAMPLING MOTION VISUAL SURFACE BINOCULAR RIVALRY CONTOUR DIRECTION CONTOURS SURFACES DRUG FOOD BRAIN AROUSAL ACTIVATION AFFECTIVE HUNGER EXTINCTION PAIN THE OF AND TO IN A IS
79
Other Slides
80
Markov chain Monte Carlo Sample from a Markov chain constructed to converge to the target distribution Allows sampling from unnormalized posterior Can compute approximate statistics from intractable distributions Gibbs sampling one such method, construct Markov chain with conditional distributions
81
Enron email: two example topics (T=100)
82
Enron email: two work unrelated topics
83
Using Topic Models for Information Retrieval
84
Clusters v. Topics Hidden Markov Models in Molecular Biology: New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on-line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification. [cluster 88] model data models time neural figure state learning set parameters network probability number networks training function system algorithm hidden markov [topic 10] state hmm markov sequence models hidden states probabilities sequences parameters transition probability training hmms hybrid model likelihood modeling [topic 37] genetic structure chain protein population region algorithms human mouse selection fitness proteins search evolution generation function sequence sequences genes Multiple Topics One Cluster
85
words C = F Q documents words topics documents Topic model words documents C = U D V T words dims documents LSA normalized co-occurrence matrix mixture weights mixture components
86
Testing violation of triangle inequality Look for triplets of associates w 1 w 2 w 3 such that P( w 2 | w 1 ) > P( w 3 | w 2 ) > and measure P( w 3 | w 1 ) Vary threshold
88
Documents as Topics Mixtures: a Geometric Interpretation P(word3) P(word1) 0 1 1 1 P(word2) P(word1)+P(word2)+P(word3) = 1 topic 1 topic 2 = observed document
89
word pixel document image sample each pixel from a mixture of topics Generating Artificial Data: a visual example 10 “topics” with 5 x 5 pixels
90
A subset of documents
91
Inferred Topics 0 1 5 10 50 100 Iterations
92
Interpretable decomposition SVD gives a basis for the data, but not an interpretable one The true basis is not orthogonal, so rotation does no good Topic model SVD
93
INPUT:word-document counts (word order is irrelevant) OUTPUT: topic assignments to each wordP( z i ) likely words in each topicP( w | z ) likely topics in each document (“gist”)P( z | d ) Summary
94
Similarity between 55 funding programs
95
Example Topics extracted from NIH/NSF grants
96
Gibbs Sampling count of word w assigned to topic t count of topic t assigned to doc d probability that word i is assigned to topic t
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.