Download presentation
Presentation is loading. Please wait.
1
Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths UC Irvine Brown University
2
Analyzing Content/ Managing Information EMAIL JOURNALS NEWSPAPERS
3
The Problem Many information retrieval systems assess similarity of documents on the raw word counts DOCUMENTCARCHEAPPRICE... DOCUMENTAUTOMOBILEAFFORDABLEAMOUNT... DOCUMENT BANK MONEY... DOCUMENT BANK RIVER... no word overlap but high similarity high word overlap but low similarity
4
One solution: compare documents on a latent set of factors (topics) topic 1 topic 2 DOCUMENTCARCHEAPPRICE... DOCUMENTAUTOMOBILEAFFORDABLEAMOUNT... DOCUMENT BANK MONEY... DOCUMENT BANK RIVER... topic 1 topic 2 high topical overlap topic 3 topic 4 no topical overlap
5
2 nd generation systems Go beyond the raw word information Extract content in terms of topics Deal with large sets of documents Miniminal supervision
6
Probabilistic Topic Models originated in domain of statistics & machine learning performs unsupervised extraction of topics from large text collections Text documents: scientific articles book chapters newspaper articles.... any set of words in a verbal context
7
Overview IProbabilistic Topic Models II Analyzing Scientific Topics: PNAS III Analyzing Topics of Federal Funding IV Analyzing Enron Email V Extensions of the the model VI Conclusion
8
Overview IProbabilistic Topic Models II Analyzing Scientific Topics: PNAS III Analyzing Topics of Federal Funding IV Analyzing Enron Email V Extensions of the the model VI Conclusion
9
Probabilistic Topic Models Each document is a probability distribution over topics Each topic is a probability distribution over words We do not observe these distributions but we can infer them statistically
10
The Generative Model View document generation as a probabilistic process TOPICS MIXTURE TOPIC TOPIC WORD WORD...... 1. for each document, choose a mixture of topics 2. For every word slot, sample a topic [1..T] from the mixture 3. sample a word from the topic
11
loan TOPIC 1 money loan bank money bank river TOPIC 2 river stream bank stream bank loan DOCUMENT 2: river 2 stream 2 bank 2 stream 2 bank 2 money 1 loan 1 river 2 stream 2 loan 1 bank 2 river 2 bank 2 bank 1 stream 2 river 2 loan 1 bank 2 stream 2 bank 2 money 1 loan 1 river 2 stream 2 bank 2 stream 2 bank 2 money 1 river 2 stream 2 loan 1 bank 2 river 2 bank 2 money 1 bank 1 stream 2 river 2 bank 2 stream 2 bank 2 money 1 DOCUMENT 1: money 1 bank 1 bank 1 loan 1 river 2 stream 2 bank 1 money 1 river 2 bank 1 money 1 bank 1 loan 1 money 1 stream 2 bank 1 money 1 bank 1 bank 1 loan 1 river 2 stream 2 bank 1 money 1 river 2 bank 1 money 1 bank 1 loan 1 bank 1 money 1 stream 2.3.8.2 Example Mixture components Mixture weights Bayesian approach: use priors Mixture weights ~ Dirichlet( ) Mixture components ~ Dirichlet( ).7
12
TOPIC 1 TOPIC 2 DOCUMENT 1: A Play is written to be performed on a stage before a live audience or before motion picture or television cameras ( for later viewing by large audiences ). A Play is written because playwrights have something... INVERTING THE GENERATIVE PROCESS ? ? ? DOCUMENT 2: He was listening to music coming from a passing riverboat. The music had already captured his heart as well as his ear. It was jazz. Bix beiderbecke had already had music lessons. He wanted to play the cornet. And he wanted to play jazz....... We estimate the assignments of topics to words
13
TOPIC 1 TOPIC 2 DOCUMENT 1: A Play 082 is written 082 to be performed 082 on a stage 082 before a live 093 audience 082 or before motion 270 picture 004 or television 004 cameras 004 ( for later 054 viewing 004 by large 202 audiences 082 ). A Play 082 is written 082 because playwrights 082 have something... INVERTING THE GENERATIVE PROCESS DOCUMENT 2: He was listening 077 to music 077 coming 009 from a passing 043 riverboat. The music 077 had already captured 006 his heart 157 as well as his ear 119. It was jazz 077. Bix beiderbecke had already had music 077 lessons 077. He wanted 268 to play 077 the cornet. And he wanted 268 to play 077 jazz 077....... We estimate the assignments of topics to words
14
Choosing number of topics Subjective interpretability Bayesian model selection Generalization tests Models that grow with size of data
15
INPUT: word-document counts OUTPUT: topic assignments to each word likely words in each topic likely topics for a document (“gist”) Input/Output
16
Example: topics from an educational corpus (TASA) PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW 37K docs, 26K words 1700 topics, e.g.:
17
Polysemy PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW
18
Three documents with the word “play” (numbers & colors topic assignments)
19
Overview IProbabilistic Topic Models II Analyzing Scientific Topics: PNAS III Analyzing Topics of Federal Funding IV Analyzing Enron Email V Extensions of the the model VI Conclusion
20
PNAS Topics Applied model to PNAS abstracts (Proceedings of the National Academy of Sciences)
21
FORCE SURFACE MOLECULES SOLUTION SURFACES MICROSCOPY WATER FORCES PARTICLES STRENGTH POLYMER IONIC ATOMIC AQUEOUS MOLECULAR PROPERTIES LIQUID SOLUTIONS BEADS MECHANICAL HIV VIRUS INFECTED IMMUNODEFICIENCY CD4 INFECTION HUMAN VIRAL TAT GP120 REPLICATION TYPE ENVELOPE AIDS REV BLOOD CCR5 INDIVIDUALS ENV PERIPHERAL MUSCLE CARDIAC HEART SKELETAL MYOCYTES VENTRICULAR MUSCLES SMOOTH HYPERTROPHY DYSTROPHIN HEARTS CONTRACTION FIBERS FUNCTION TISSUE RAT MYOCARDIAL ISOLATED MYOD FAILURE STRUCTURE ANGSTROM CRYSTAL RESIDUES STRUCTURES STRUCTURAL RESOLUTION HELIX THREE HELICES DETERMINED RAY CONFORMATION HELICAL HYDROPHOBIC SIDE DIMENSIONAL INTERACTIONS MOLECULE SURFACE NEURONS BRAIN CORTEX CORTICAL OLFACTORY NUCLEUS NEURONAL LAYER RAT NUCLEI CEREBELLUM CEREBELLAR LATERAL CEREBRAL LAYERS GRANULE LABELED HIPPOCAMPUS AREAS THALAMIC A selection of topics (out of 300) TUMOR CANCER TUMORS HUMAN CELLS BREAST MELANOMA GROWTH CARCINOMA PROSTATE NORMAL CELL METASTATIC MALIGNANT LUNG CANCERS MICE NUDE PRIMARY OVARIAN
22
PNAS Topics and classes PNAS authors provide class designations major: Biological, Physical, Social Sciences minor: 33 separate disciplines Find topics diagnostic of classes validate “reality” of classes show how disciplines overlap topically
24
TOPIC 210 SYNAPTIC NEURONS POSTSYNAPTIC HIPPOCAMPAL SYNAPSES LTP PRESYNAPTIC TRANSMISSION POTENTIATION PLASTICITY EXCITATORY RELEASE DENDRITIC PYRAMIDAL HIPPOCAMPUS Neurobiology Topic 210
25
TOPIC 280 SPECIES SELECTION EVOLUTION GENETIC POPULATIONS POPULATION VARIATION NATURAL EVOLUTIONARY FITNESS ADAPTIVE RATES THEORY TRAITS DIVERSITY Evolution Topic 280 Population biology
26
TOPIC 39 THEORY TIME SPACE GIVEN PROBLEM SHAPE SIMPLE DIMENSIONAL PAPER NUMBER CASE LOCAL TERMS SYMMETRY RANDOM Mathematics Topic 39 Applied Mathematics
27
Topic Dynamics We have the distribution over topics for PNAS abstracts from 1991 to 2001 Analysis of dynamics: perform linear trend analysis for each topic “hot topics” go up, “cold topics” go down
28
Cold topics Hot topics 2 SPECIES GLOBAL CLIMATE CO2 WATER ENVIRONMENTAL YEARS MARINE CARBON DIVERSITY OCEAN EXTINCTION TERRESTRIAL COMMUNITY ABUNDANCE 134 MICE DEFICIENT NORMAL GENE NULL MOUSE TYPE HOMOZYGOUS ROLE KNOCKOUT DEVELOPMENT GENERATED LACKING ANIMALS REDUCED 179 APOPTOSIS DEATH CELL INDUCED BCL CELLS APOPTOTIC CASPASE FAS SURVIVAL PROGRAMMED MEDIATED INDUCTION CERAMIDE EXPRESSION 37 CDNA AMINO SEQUENCE ACID PROTEIN ISOLATED ENCODING CLONED ACIDS IDENTITY CLONE EXPRESSED ENCODES RAT HOMOLOGY 289 KDA PROTEIN PURIFIED MOLECULAR MASS CHROMATOGRAPHY POLYPEPTIDE GEL SDS BAND APPARENT LABELED IDENTIFIED FRACTION DETECTED 75 ANTIBODY ANTIBODIES MONOCLONAL ANTIGEN IGG MAB SPECIFIC EPITOPE HUMAN MABS RECOGNIZED SERA EPITOPES DIRECTED NEUTRALIZING NOBEL 1987 NOBEL 2002
29
Overview IProbabilistic Topic Models II Analyzing Scientific Topics: PNAS III Analyzing Topics of Federal Funding IV Analyzing Enron Email V Extensions of the the model VI Conclusion
30
Analyzing Topics of Funding Get a large-scale overview of funding for social sciences How similar are different funding programs? How is funding distributed over topics?
31
Dataset 22,189 Abstracts from grants active in 2003 NIH NIMH(National Institute of Mental Health) NCI(National Cancer Institute) NSF SBE(Social, Behavioral and Economic Sciences) BIO(Biological Sciences)
32
Extracted topics (1..20)
33
80 interpretable topics (out of 100)
34
Likely topics for NIH Likely topics for NSF-SBE Likely topics for NSF-BIO
35
Program similarity using topics
37
NIH NSF – BIO NSF – SBE 2D visualization of funding programs – nearby program support similar topics
38
Funding Amounts per Topic We have $ funding per grant We have % of topics for each grant We can solve for the $ amount per topic What are expensive topics?
39
High $$$ topicsLow $$$ topics
40
Overview IProbabilistic Topic Models II Analyzing Scientific Topics: PNAS III Analyzing Topics of Federal Funding IV Analyzing Enron Email V Extensions of the the model VI Conclusion
41
Enron email data 500,000 emails 5000 authors 1999-2002
42
Enron topics TEXANS WIN FOOTBALL FANTASY SPORTSLINE PLAY TEAM GAME SPORTS GAMES GOD LIFE MAN PEOPLE CHRIST FAITH LORD JESUS SPIRITUAL VISIT ENVIRONMENTAL AIR MTBE EMISSIONS CLEAN EPA PENDING SAFETY WATER GASOLINE FERC MARKET ISO COMMISSION ORDER FILING COMMENTS PRICE CALIFORNIA FILED POWER CALIFORNIA ELECTRICITY UTILITIES PRICES MARKET PRICE UTILITY CUSTOMERS ELECTRIC STATE PLAN CALIFORNIA DAVIS RATE BANKRUPTCY SOCAL POWER BONDS MOU TIMELINE May 22, 2000 Start of California energy crisis
43
Overview IProbabilistic Topic Models II Analyzing Scientific Topics: PNAS III Analyzing Topics of Federal Funding IV Analyzing Enron Email V Extensions of the the model VI Conclusion
44
Pennsylvania Gazette 1728-1800 1728-1800 80,000 articles (courtesy of David Newman & Sharon Block, History Department, UC Irvine)
45
Historical Trends in Pen. Gazette STATE GOVERNMENT CONSTITUTION LAW UNITED POWER CITIZEN PEOPLE PUBLIC CONGRES SILK COTTON DITTO WHITE BLACK LINEN CLOTH WOMEN BLUE WORSTED (courtesy of David Newman & Sharon Block, UC Irvine)
46
Learning Topic Hierarchies In regular topic model, no relations between topics Alternative: hierarchical topic organization topic 1 topic 2 topic 3 topic 4 topic 5 topic 6 topic 7 Apply to Psych Review abstracts
48
Integrating Topics and Syntax Syntactic dependencies short range dependencies Semantic dependencies long-range zz zz zz zz ww ww ww ww ss ss ss ss Semantic state: generate words from topic model Syntactic states: generate words from HMM (Griffiths, Steyvers, Blei, & Tenenbaum, 2004)
49
... IN BY WITH ON AS FROM TO FOR THE A AN THIS THEIR ITS EACH ONE IS ARE BE HAS HAVE WAS WERE AS BASED PRESENTED DISCUSSED PROPOSED DESCRIBED SUCH USED DERIVED THEORY MODEL PROCESSES MODELS SYSTEM PROCESS EFFECTS INFORMATION ATTENTION SEARCH VISUAL PROCESSING TASK PERFORMANCE INFORMATION ATTENTIONAL MEMORY TERM LONG SHORT RETRIEVAL STORAGE MEMORIES AMNESIA IQ BEHAVIOR EVOLUTIONARY ENVIRONMENT GENES HERITABILITY GENETIC SELECTION DRUG AROUSAL NEURAL BRAIN HABITUATION BIOLOGICAL TOLERANCE BEHAVIORAL SOCIAL SELF ATTITUDE IMPLICIT ATTITUDES PERSONALITY JUDGMENT PERCEPTION (S) THE SEARCH IN LONG TERM MEMORY …… (S) A MODEL OF VISUAL ATTENTION ……
50
Random sentence generation LANGUAGE: [S] RESEARCHERS GIVE THE SPEECH [S] THE SOUND FEEL NO LISTENERS [S] WHICH WAS TO BE MEANING [S] HER VOCABULARIES STOPPED WORDS [S] HE EXPRESSLY WANTED THAT BETTER VOWEL
51
Conclusion Unsupervised extraction of content from large text collections Topics provide quick overview of content Topic models text-mining/ information retrieval psychology/ memory Connection? Good semantic memory models for finding semantically relevant information might also be good information retrieval models
52
Psych Review abstracts All 1281 abstracts since 1967 50 topics – examples: SIMILARITY CATEGORY CATEGORIES RELATIONS DIMENSIONS FEATURES STRUCTURE SIMILAR REPRESENTATION ONJECTS STIMULUS CONDITIONING LEARNING RESPONSE STIMULI RESPONSES AVOIDANCE REINFORCEMENT CLASSICAL DISCRIMINATION MEMORY RETRIEVAL RECALL ITEMS INFORMATION TERM RECOGNITION ITEMS LIST ASSOCIATIVE GROUP INDIVIDUAL GROUPS OUTCOMES INDIVIDUALS GROUPS OUTCOMES INDIVIDUALS DIFFERENCES INTERACTION EMOTIONAL EMOTION BASIC EMOTIONS AFFECT STATES EXPERIENCES AFFECTIVE AFFECTS RESEARCH...
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.