Probabilistic Topic Models Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work with: Tom Griffiths, UC Berkeley Padhraic Smyth, UC Irvine Dave Newman, UC Irvine
Overview IProblems of interest IIProbabilistic Topic Models II Using topic models in machine learning and data mining III Using topic models in psychology IV Conclusion
Problems of Interest What topics does this text collection “span”? Which documents are about a particular topic? Who writes about a particular topic? How have topics changed over time? How to represent the “gist” of a list of words? How to model associations between words? Machine Learning/ Data mining Psychology
Probabilistic Topic Models Based on pLSI and LDA model (e.g., Hoffman, 2001; Blei, Ng, Jordan, 2003) A probability model linking documents, topics, and words Topic distribution over words Document mixture of topics Topics and topic mixtures can be extracted from large collections of text
Example Topics extracted from NIH/NSF grants Important point: these distributions are learned in a completely automated “unsupervised” fashion from the data
Overview IProblems of interest IIProbabilistic Topic Models II Using topic models in machine learning and data mining III Using topic models in psychology IV Conclusion
Parameters Real World Data P( Data | Parameters) P( Parameters | Data) Probabilistic Model Statistical Inference
Document generation as a probabilistic process Each topic is a distribution over words Each document a mixture of topics Each word chosen from a single topic From parameters (j) From parameters (d)
Prior Distributions Dirichlet priors encourage sparsity on topic mixtures and topics θ ~ Dirichlet( ) ~ Dirichlet( ) Topic 1 Topic 2 Topic 3 Word 1 Word 2 Word 3
Topics MONEY 1 BANK 1 BANK 1 LOAN 1 BANK 1 MONEY 1 BANK 1 MONEY 1 BANK 1 LOAN 1 LOAN 1 BANK 1 MONEY Mixtures θ Documents and topic assignments RIVER 2 MONEY 1 BANK 2 STREAM 2 BANK 2 BANK 1 MONEY 1 RIVER 2 MONEY 1 BANK 2 LOAN 1 MONEY RIVER 2 BANK 2 STREAM 2 BANK 2 RIVER 2 BANK Example of generating words
Topics ? ? MONEY ? BANK BANK ? LOAN ? BANK ? MONEY ? BANK ? MONEY ? BANK ? LOAN ? LOAN ? BANK ? MONEY ?.... Mixtures θ RIVER ? MONEY ? BANK ? STREAM ? BANK ? BANK ? MONEY ? RIVER ? MONEY ? BANK ? LOAN ? MONEY ?.... RIVER ? BANK ? STREAM ? BANK ? RIVER ? BANK ?.... Inference Documents and topic assignments ?
Bayesian Inference Three sets of latent variables topic mixtures θ word distributions topic assignments z Integrate out θ and and estimate topic assignments: Use MCMC with Gibbs sampling for approximate inference Sum over terms
Gibbs Sampling Start with random assignments of words to topics Repeat M iterations Repeat for all words i Sample a new topic assignment for word i conditioned on all other topic assignments
16 Artificial Documents Can we recover the original topics and topic mixtures from this data? documents
Starting the Gibbs Sampling Assign word tokens randomly to topics (●=topic 1; ●=topic 2 )
After 1 iteration
After 4 iterations
After 32 iterations● ●
Choosing the number of topics Bayesian Model Selection: choose T to maximize: Requires integration over all parameters. Approximate by Gibbs sampling Results with artificial data with 10 topics:
Overview IProblems of interest IIProbabilistic Topic Models II Using topic models in machine learning and data mining III Using topic models in psychology IV Conclusion
Extracting Topics from 250,000 s 28,000 authors Enron TEXANS WIN FOOTBALL FANTASY SPORTSLINE PLAY TEAM GAME SPORTS GAMES GOD LIFE MAN PEOPLE CHRIST FAITH LORD JESUS SPIRITUAL VISIT TRAVEL ROUNDTRIP SAVE DEALS HOTEL BOOK SALE FARES TRIP CITIES FERC MARKET ISO COMMISSION ORDER FILING COMMENTS PRICE CALIFORNIA FILED POWER CALIFORNIA ELECTRICITY UTILITIES PRICES MARKET PRICE UTILITY CUSTOMERS ELECTRIC STATE PLAN CALIFORNIA DAVIS RATE BANKRUPTCY SOCAL POWER BONDS MOU
Topic trends from New York Times TOUR RIDER LANCE_ARMSTRONG TEAM BIKE RACE FRANCE Tour-de-France COMPANY QUARTER PERCENT ANALYST SHARE SALES EARNING Quarterly Earnings ANTHRAX LETTER MAIL WORKER OFFICE SPORES POSTAL BUILDING Anthrax 330,000 articles
Topic trends in NIPS conference KERNEL SUPPORT VECTOR MARGIN SVM KERNELS SPACE DATA MACHINES... while SVM’s become more popular. LAYER NET NEURAL LAYERS NETS ARCHITECTURE NUMBER FEEDFORWARD SINGLE Neural networks on the decline...
Finding Funding Overlap Analyze 22,000+ grants active during funding programs from NSF and NIH Focus on behavioral sciences Questions of interest: What are the areas of overlap between funding programs? What topics are funded?
NIH NSF – BIO NSF – SBE 2D visualization of funding programs – nearby program support similar topics
Funding Amounts per Topic We have $ funding per grant We have distribution of topics for each grant Solve for the $ amount per topic What are expensive topics?
High $$$ topicsLow $$$ topics
Overview IProblems of interest IIProbabilistic Topic Models II Using topic models in machine learning and data mining III Using topic models in psychology IV Conclusion
Basic Problems for Semantic Memory Prediction what fact, concept, or word is next? Gist extraction What is this set of words about? Disambiguation What is the sense of this word?
TASA Corpus Applied model to educational text corpus: TASA 33K docs 6M words
Three documents with the word “play” (numbers & colors topic assignments)
Word Association CUE: PLAY (Nelson, McEvoy, & Schreiber USF word association norms) Responses Association networks BASEBALL BAT BALL GAME PLAY STAGE THEATER
Topic model for Word Association Association as a problem of prediction: Given that a single word is observed, predict what other words might occur in that context Under a single topic assumption: Response Cue
Model predictions from TASA corpus RANK 9
Median rank of first associate TOPICS
Latent Semantic Analysis (Landauer & Dumais, 1997) word-document counts high dimensional space SVD Each word is a single point in semantic space Similarity measured by cosine of angle between word vectors BALL PLAY STAGE THEATER GAME
Median rank of first associate TOPICS LSA
Objections to spatial representations Tversky (1977) on similarity: asymmetry violation of the triangle inequality rich neighborhood structure Semantic association has the same properties asymmetry violation of the triangle inequality “small world” neighborhood structure (Steyvers & Tenenbaum, 2005)
Topics as latent factors to group words BASEBALL BAT BALL GAME STAGE THEATER TOPIC PLAY BASEBALL BAT BALL GAME PLAY STAGE THEATER Note: there are no direct connections between words
What about collocations? Why are these words related? DOW - JONES BUMBLE - BEE WHITE – HOUSE Suggests at least two routes for association: Semantic Collocation Integrate collocations into topic model Related to paradigmatic/syntagmatic distinction
TOPIC MIXTURE TOPIC WORD X TOPIC WORD X TOPIC WORD If x=0, sample a word from the topic If x=1, sample a word from the distribution based on previous word Collocation Topic Model
TOPIC MIXTURE TOPIC DOW X=1 JONES X=0 TOPIC RISES Example: “DOW JONES RISES” JONES is more likely explained as a word following DOW than as word sampled from topic Result: DOW_JONES recognized as collocation Collocation Topic Model
Examples Topics from New York Times WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NASDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN _STOCK_INDEX WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPANIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CASE GROUP TerrorismWall Street FirmsStock MarketBankruptcy
Context dependent collocations Example: WHITE - HOUSE In context of government, WHITE-HOUSE is a collocation In context of colors of houses, WHITE - HOUSE is treated as separate words
Overview IProblems of interest IIProbabilistic Topic Models II Using topic models in machine learning and data mining III Using topic models in psychology IV Conclusion
Psychology Computer Science Models for information retrieval might be helpful in understanding semantic memory Research on semantic memory can be useful for developing better information retrieval algorithms
Software Public-domain MATLAB toolbox for topic modeling on the Web:
Gibbs Sampler Stability
Comparing topics from two runs KL distance Topics from Run 1 Re-ordered Topics Run 2 BEST KL =.46 WORST KL = 9.40
Extensions of Topic Models Combining topics and syntax Topic hierarchies Topic segmentation no need for document boundaries Modeling authors as well as documents who wrote this part of the paper?
Words can have high probability in multiple topics PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW (Based on TASA corpus)
Problems with Spatial Representations
Violation of triangle inequality Can find similarity judgments that violate this: A BC AC AB + BC
Violation of triangle inequality Can find associations that violate this: A BC AC AB + BC SOCCER FIELD MAGNETIC
No Triangle Inequality with Topics SOCCER MAGNETIC FIELD TOPIC 1 TOPIC 2 Topic structure easily explains violations of triangle inequality
Small-World Structure of Associations (Steyvers & Tenenbaum, 2005) Properties: 1) Short path lengths 2) Clustering 3) Power law degree distributions Small world graphs arise elsewhere: internet, social relations, biology BASEBALL BAT BALL GAME PLAY STAGE THEATER
Power law degree distributions in Semantic Networks Power law degree distribution some words are “hubs” in a semantic network
Creating Association Networks TOPICS MODEL: Connect i to j when P( w=j | w=i ) > threshold LSA: For each word, generate K associates by picking K nearest neighbors in semantic space =-2.05
Associations in free recall STUDY THESE WORDS: Bed, Rest, Awake, Tired, Dream, Wake, Snooze, Blanket, Doze, Slumber, Snore, Nap, Peace, Yawn, Drowsy RECALL WORDS..... FALSE RECALL: “Sleep” 61%
Recall as a reconstructive process Reconstruct study list based on the stored “gist” The gist can be represented by a distribution over topics Under a single topic assumption: Retrieved word Study list
Predictions for the “Sleep” list STUDY LIST EXTRA LIST (top 8)
Hidden Markov Topic Model
Hidden Markov Topics Model Syntactic dependencies short range dependencies Semantic dependencies long-range zz zz zz zz ww ww ww ww ss ss ss ss Semantic state: generate words from topic model Syntactic states: generate words from HMM (Griffiths, Steyvers, Blei, & Tenenbaum, 2004)
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = THE 0.6 A 0.3 MANY 0.1 OF 0.6 FOR 0.3 BETWEEN Transition between semantic state and syntactic states
THE ……………………………… HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = x = 1 THE 0.6 A 0.3 MANY 0.1 x = 3 OF 0.6 FOR 0.3 BETWEEN 0.1 x = Combining topics and syntax
THE LOVE…………………… HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = x = 1 THE 0.6 A 0.3 MANY 0.1 x = 3 OF 0.6 FOR 0.3 BETWEEN 0.1 x = Combining topics and syntax
THE LOVE OF……………… HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = x = 1 THE 0.6 A 0.3 MANY 0.1 x = 3 OF 0.6 FOR 0.3 BETWEEN 0.1 x = Combining topics and syntax
THE LOVE OF RESEARCH …… HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = x = 1 THE 0.6 A 0.3 MANY 0.1 x = 3 OF 0.6 FOR 0.3 BETWEEN 0.1 x = Combining topics and syntax
FOOD FOODS BODY NUTRIENTS DIET FAT SUGAR ENERGY MILK EATING FRUITS VEGETABLES WEIGHT FATS NEEDS CARBOHYDRATES VITAMINS CALORIES PROTEIN MINERALS MAP NORTH EARTH SOUTH POLE MAPS EQUATOR WEST LINES EAST AUSTRALIA GLOBE POLES HEMISPHERE LATITUDE PLACES LAND WORLD COMPASS CONTINENTS DOCTOR PATIENT HEALTH HOSPITAL MEDICAL CARE PATIENTS NURSE DOCTORS MEDICINE NURSING TREATMENT NURSES PHYSICIAN HOSPITALS DR SICK ASSISTANT EMERGENCY PRACTICE BOOK BOOKS READING INFORMATION LIBRARY REPORT PAGE TITLE SUBJECT PAGES GUIDE WORDS MATERIAL ARTICLE ARTICLES WORD FACTS AUTHOR REFERENCE NOTE GOLD IRON SILVER COPPER METAL METALS STEEL CLAY LEAD ADAM ORE ALUMINUM MINERAL MINE STONE MINERALS POT MINING MINERS TIN BEHAVIOR SELF INDIVIDUAL PERSONALITY RESPONSE SOCIAL EMOTIONAL LEARNING FEELINGS PSYCHOLOGISTS INDIVIDUALS PSYCHOLOGICAL EXPERIENCES ENVIRONMENT HUMAN RESPONSES BEHAVIORS ATTITUDES PSYCHOLOGY PERSON CELLS CELL ORGANISMS ALGAE BACTERIA MICROSCOPE MEMBRANE ORGANISM FOOD LIVING FUNGI MOLD MATERIALS NUCLEUS CELLED STRUCTURES MATERIAL STRUCTURE GREEN MOLDS Semantic topics PLANTS PLANT LEAVES SEEDS SOIL ROOTS FLOWERS WATER FOOD GREEN SEED STEMS FLOWER STEM LEAF ANIMALS ROOT POLLEN GROWING GROW
GOOD SMALL NEW IMPORTANT GREAT LITTLE LARGE * BIG LONG HIGH DIFFERENT SPECIAL OLD STRONG YOUNG COMMON WHITE SINGLE CERTAIN THE HIS THEIR YOUR HER ITS MY OUR THIS THESE A AN THAT NEW THOSE EACH MR ANY MRS ALL MORE SUCH LESS MUCH KNOWN JUST BETTER RATHER GREATER HIGHER LARGER LONGER FASTER EXACTLY SMALLER SOMETHING BIGGER FEWER LOWER ALMOST ON AT INTO FROM WITH THROUGH OVER AROUND AGAINST ACROSS UPON TOWARD UNDER ALONG NEAR BEHIND OFF ABOVE DOWN BEFORE SAID ASKED THOUGHT TOLD SAYS MEANS CALLED CRIED SHOWS ANSWERED TELLS REPLIED SHOUTED EXPLAINED LAUGHED MEANT WROTE SHOWED BELIEVED WHISPERED ONE SOME MANY TWO EACH ALL MOST ANY THREE THIS EVERY SEVERAL FOUR FIVE BOTH TEN SIX MUCH TWENTY EIGHT HE YOU THEY I SHE WE IT PEOPLE EVERYONE OTHERS SCIENTISTS SOMEONE WHO NOBODY ONE SOMETHING ANYONE EVERYBODY SOME THEN Syntactic classes BE MAKE GET HAVE GO TAKE DO FIND USE SEE HELP KEEP GIVE LOOK COME WORK MOVE LIVE EAT BECOME
MODEL ALGORITHM SYSTEM CASE PROBLEM NETWORK METHOD APPROACH PAPER PROCESS IS WAS HAS BECOMES DENOTES BEING REMAINS REPRESENTS EXISTS SEEMS SEE SHOW NOTE CONSIDER ASSUME PRESENT NEED PROPOSE DESCRIBE SUGGEST USED TRAINED OBTAINED DESCRIBED GIVEN FOUND PRESENTED DEFINED GENERATED SHOWN IN WITH FOR ON FROM AT USING INTO OVER WITHIN HOWEVER ALSO THEN THUS THEREFORE FIRST HERE NOW HENCE FINALLY #*IXTN-CFP#*IXTN-CFP EXPERTS EXPERT GATING HME ARCHITECTURE MIXTURE LEARNING MIXTURES FUNCTION GATE DATA GAUSSIAN MIXTURE LIKELIHOOD POSTERIOR PRIOR DISTRIBUTION EM BAYESIAN PARAMETERS STATE POLICY VALUE FUNCTION ACTION REINFORCEMENT LEARNING CLASSES OPTIMAL * MEMBRANE SYNAPTIC CELL * CURRENT DENDRITIC POTENTIAL NEURON CONDUCTANCE CHANNELS IMAGE IMAGES OBJECT OBJECTS FEATURE RECOGNITION VIEWS # PIXEL VISUAL KERNEL SUPPORT VECTOR SVM KERNELS # SPACE FUNCTION MACHINES SET NETWORK NEURAL NETWORKS OUPUT INPUT TRAINING INPUTS WEIGHTS # OUTPUTS NIPS Semantics NIPS Syntax
Random sentence generation LANGUAGE: [S] RESEARCHERS GIVE THE SPEECH [S] THE SOUND FEEL NO LISTENERS [S] WHICH WAS TO BE MEANING [S] HER VOCABULARIES STOPPED WORDS [S] HE EXPRESSLY WANTED THAT BETTER VOWEL
Nested Chinese Restaurant Process
Topic Hierarchies In regular topic model, no relations between topics Alternative: hierarchical topic organization topic 1 topic 2 topic 3 topic 4 topic 5 topic 6 topic 7 Nested Chinese Restaurant Process Blei, Griffiths, Jordan, Tenenbaum (2004) Learn hierarchical structure, as well as topics within structure
Example: Psych Review Abstracts RESPONSE STIMULUS REINFORCEMENT RECOGNITION STIMULI RECALL CHOICE CONDITIONING SPEECH READING WORDS MOVEMENT MOTOR VISUAL WORD SEMANTIC ACTION SOCIAL SELF EXPERIENCE EMOTION GOALS EMOTIONAL THINKING GROUP IQ INTELLIGENCE SOCIAL RATIONAL INDIVIDUAL GROUPS MEMBERS SEX EMOTIONS GENDER EMOTION STRESS WOMEN HEALTH HANDEDNESS REASONING ATTITUDE CONSISTENCY SITUATIONAL INFERENCE JUDGMENT PROBABILITIES STATISTICAL IMAGE COLOR MONOCULAR LIGHTNESS GIBSON SUBMOVEMENT ORIENTATION HOLOGRAPHIC CONDITIONIN STRESS EMOTIONAL BEHAVIORAL FEAR STIMULATION TOLERANCE RESPONSES A MODEL MEMORY FOR MODELS TASK INFORMATION RESULTS ACCOUNT SELF SOCIAL PSYCHOLOGY RESEARCH RISK STRATEGIES INTERPERSONAL PERSONALITY SAMPLING MOTION VISUAL SURFACE BINOCULAR RIVALRY CONTOUR DIRECTION CONTOURS SURFACES DRUG FOOD BRAIN AROUSAL ACTIVATION AFFECTIVE HUNGER EXTINCTION PAIN THE OF AND TO IN A IS
Generative Process RESPONSE STIMULUS REINFORCEMENT RECOGNITION STIMULI RECALL CHOICE CONDITIONING SPEECH READING WORDS MOVEMENT MOTOR VISUAL WORD SEMANTIC ACTION SOCIAL SELF EXPERIENCE EMOTION GOALS EMOTIONAL THINKING GROUP IQ INTELLIGENCE SOCIAL RATIONAL INDIVIDUAL GROUPS MEMBERS SEX EMOTIONS GENDER EMOTION STRESS WOMEN HEALTH HANDEDNESS REASONING ATTITUDE CONSISTENCY SITUATIONAL INFERENCE JUDGMENT PROBABILITIES STATISTICAL IMAGE COLOR MONOCULAR LIGHTNESS GIBSON SUBMOVEMENT ORIENTATION HOLOGRAPHIC CONDITIONIN STRESS EMOTIONAL BEHAVIORAL FEAR STIMULATION TOLERANCE RESPONSES A MODEL MEMORY FOR MODELS TASK INFORMATION RESULTS ACCOUNT SELF SOCIAL PSYCHOLOGY RESEARCH RISK STRATEGIES INTERPERSONAL PERSONALITY SAMPLING MOTION VISUAL SURFACE BINOCULAR RIVALRY CONTOUR DIRECTION CONTOURS SURFACES DRUG FOOD BRAIN AROUSAL ACTIVATION AFFECTIVE HUNGER EXTINCTION PAIN THE OF AND TO IN A IS
Other Slides
Markov chain Monte Carlo Sample from a Markov chain constructed to converge to the target distribution Allows sampling from unnormalized posterior Can compute approximate statistics from intractable distributions Gibbs sampling one such method, construct Markov chain with conditional distributions
Enron two example topics (T=100)
Enron two work unrelated topics
Using Topic Models for Information Retrieval
Clusters v. Topics Hidden Markov Models in Molecular Biology: New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on-line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification. [cluster 88] model data models time neural figure state learning set parameters network probability number networks training function system algorithm hidden markov [topic 10] state hmm markov sequence models hidden states probabilities sequences parameters transition probability training hmms hybrid model likelihood modeling [topic 37] genetic structure chain protein population region algorithms human mouse selection fitness proteins search evolution generation function sequence sequences genes Multiple Topics One Cluster
words C = F Q documents words topics documents Topic model words documents C = U D V T words dims documents LSA normalized co-occurrence matrix mixture weights mixture components
Testing violation of triangle inequality Look for triplets of associates w 1 w 2 w 3 such that P( w 2 | w 1 ) > P( w 3 | w 2 ) > and measure P( w 3 | w 1 ) Vary threshold
Documents as Topics Mixtures: a Geometric Interpretation P(word3) P(word1) P(word2) P(word1)+P(word2)+P(word3) = 1 topic 1 topic 2 = observed document
word pixel document image sample each pixel from a mixture of topics Generating Artificial Data: a visual example 10 “topics” with 5 x 5 pixels
A subset of documents
Inferred Topics Iterations
Interpretable decomposition SVD gives a basis for the data, but not an interpretable one The true basis is not orthogonal, so rotation does no good Topic model SVD
INPUT:word-document counts (word order is irrelevant) OUTPUT: topic assignments to each wordP( z i ) likely words in each topicP( w | z ) likely topics in each document (“gist”)P( z | d ) Summary
Similarity between 55 funding programs
Example Topics extracted from NIH/NSF grants
Gibbs Sampling count of word w assigned to topic t count of topic t assigned to doc d probability that word i is assigned to topic t