Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths.

Similar presentations


Presentation on theme: "Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths."— Presentation transcript:

1 Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths UC Irvine Brown University

2 Analyzing Content/ Managing Information EMAIL JOURNALS NEWSPAPERS

3 The Problem Many information retrieval systems assess similarity of documents on the raw word counts DOCUMENTCARCHEAPPRICE... DOCUMENTAUTOMOBILEAFFORDABLEAMOUNT... DOCUMENT BANK MONEY... DOCUMENT BANK RIVER... no word overlap but high similarity high word overlap but low similarity

4 One solution: compare documents on a latent set of factors (topics) topic 1 topic 2 DOCUMENTCARCHEAPPRICE... DOCUMENTAUTOMOBILEAFFORDABLEAMOUNT... DOCUMENT BANK MONEY... DOCUMENT BANK RIVER... topic 1 topic 2 high topical overlap topic 3 topic 4 no topical overlap

5 2 nd generation systems Go beyond the raw word information Extract content in terms of topics Deal with large sets of documents Miniminal supervision

6 Probabilistic Topic Models originated in domain of statistics & machine learning performs unsupervised extraction of topics from large text collections Text documents: scientific articles book chapters newspaper articles.... any set of words in a verbal context

7 Overview IProbabilistic Topic Models II Analyzing Scientific Topics: PNAS III Analyzing Topics of Federal Funding IV Analyzing Enron Email V Extensions of the the model VI Conclusion

8 Overview IProbabilistic Topic Models II Analyzing Scientific Topics: PNAS III Analyzing Topics of Federal Funding IV Analyzing Enron Email V Extensions of the the model VI Conclusion

9 Probabilistic Topic Models Each document is a probability distribution over topics Each topic is a probability distribution over words We do not observe these distributions but we can infer them statistically

10 The Generative Model View document generation as a probabilistic process TOPICS MIXTURE TOPIC TOPIC WORD WORD...... 1. for each document, choose a mixture of topics 2. For every word slot, sample a topic [1..T] from the mixture 3. sample a word from the topic

11 loan TOPIC 1 money loan bank money bank river TOPIC 2 river stream bank stream bank loan DOCUMENT 2: river 2 stream 2 bank 2 stream 2 bank 2 money 1 loan 1 river 2 stream 2 loan 1 bank 2 river 2 bank 2 bank 1 stream 2 river 2 loan 1 bank 2 stream 2 bank 2 money 1 loan 1 river 2 stream 2 bank 2 stream 2 bank 2 money 1 river 2 stream 2 loan 1 bank 2 river 2 bank 2 money 1 bank 1 stream 2 river 2 bank 2 stream 2 bank 2 money 1 DOCUMENT 1: money 1 bank 1 bank 1 loan 1 river 2 stream 2 bank 1 money 1 river 2 bank 1 money 1 bank 1 loan 1 money 1 stream 2 bank 1 money 1 bank 1 bank 1 loan 1 river 2 stream 2 bank 1 money 1 river 2 bank 1 money 1 bank 1 loan 1 bank 1 money 1 stream 2.3.8.2 Example Mixture components Mixture weights Bayesian approach: use priors Mixture weights ~ Dirichlet(  ) Mixture components ~ Dirichlet(  ).7

12 TOPIC 1 TOPIC 2 DOCUMENT 1: A Play is written to be performed on a stage before a live audience or before motion picture or television cameras ( for later viewing by large audiences ). A Play is written because playwrights have something... INVERTING THE GENERATIVE PROCESS ? ? ? DOCUMENT 2: He was listening to music coming from a passing riverboat. The music had already captured his heart as well as his ear. It was jazz. Bix beiderbecke had already had music lessons. He wanted to play the cornet. And he wanted to play jazz....... We estimate the assignments of topics to words

13 TOPIC 1 TOPIC 2 DOCUMENT 1: A Play 082 is written 082 to be performed 082 on a stage 082 before a live 093 audience 082 or before motion 270 picture 004 or television 004 cameras 004 ( for later 054 viewing 004 by large 202 audiences 082 ). A Play 082 is written 082 because playwrights 082 have something... INVERTING THE GENERATIVE PROCESS DOCUMENT 2: He was listening 077 to music 077 coming 009 from a passing 043 riverboat. The music 077 had already captured 006 his heart 157 as well as his ear 119. It was jazz 077. Bix beiderbecke had already had music 077 lessons 077. He wanted 268 to play 077 the cornet. And he wanted 268 to play 077 jazz 077....... We estimate the assignments of topics to words

14 Choosing number of topics Subjective interpretability Bayesian model selection Generalization tests Models that grow with size of data

15 INPUT: word-document counts OUTPUT: topic assignments to each word likely words in each topic likely topics for a document (“gist”) Input/Output

16 Example: topics from an educational corpus (TASA) PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW 37K docs, 26K words 1700 topics, e.g.:

17 Polysemy PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW

18 Three documents with the word “play” (numbers & colors  topic assignments)

19 Overview IProbabilistic Topic Models II Analyzing Scientific Topics: PNAS III Analyzing Topics of Federal Funding IV Analyzing Enron Email V Extensions of the the model VI Conclusion

20 PNAS Topics Applied model to PNAS abstracts (Proceedings of the National Academy of Sciences)

21 FORCE SURFACE MOLECULES SOLUTION SURFACES MICROSCOPY WATER FORCES PARTICLES STRENGTH POLYMER IONIC ATOMIC AQUEOUS MOLECULAR PROPERTIES LIQUID SOLUTIONS BEADS MECHANICAL HIV VIRUS INFECTED IMMUNODEFICIENCY CD4 INFECTION HUMAN VIRAL TAT GP120 REPLICATION TYPE ENVELOPE AIDS REV BLOOD CCR5 INDIVIDUALS ENV PERIPHERAL MUSCLE CARDIAC HEART SKELETAL MYOCYTES VENTRICULAR MUSCLES SMOOTH HYPERTROPHY DYSTROPHIN HEARTS CONTRACTION FIBERS FUNCTION TISSUE RAT MYOCARDIAL ISOLATED MYOD FAILURE STRUCTURE ANGSTROM CRYSTAL RESIDUES STRUCTURES STRUCTURAL RESOLUTION HELIX THREE HELICES DETERMINED RAY CONFORMATION HELICAL HYDROPHOBIC SIDE DIMENSIONAL INTERACTIONS MOLECULE SURFACE NEURONS BRAIN CORTEX CORTICAL OLFACTORY NUCLEUS NEURONAL LAYER RAT NUCLEI CEREBELLUM CEREBELLAR LATERAL CEREBRAL LAYERS GRANULE LABELED HIPPOCAMPUS AREAS THALAMIC A selection of topics (out of 300) TUMOR CANCER TUMORS HUMAN CELLS BREAST MELANOMA GROWTH CARCINOMA PROSTATE NORMAL CELL METASTATIC MALIGNANT LUNG CANCERS MICE NUDE PRIMARY OVARIAN

22 PNAS Topics and classes PNAS authors provide class designations major: Biological, Physical, Social Sciences minor: 33 separate disciplines Find topics diagnostic of classes validate “reality” of classes show how disciplines overlap topically

23

24 TOPIC 210 SYNAPTIC NEURONS POSTSYNAPTIC HIPPOCAMPAL SYNAPSES LTP PRESYNAPTIC TRANSMISSION POTENTIATION PLASTICITY EXCITATORY RELEASE DENDRITIC PYRAMIDAL HIPPOCAMPUS Neurobiology Topic 210

25 TOPIC 280 SPECIES SELECTION EVOLUTION GENETIC POPULATIONS POPULATION VARIATION NATURAL EVOLUTIONARY FITNESS ADAPTIVE RATES THEORY TRAITS DIVERSITY Evolution Topic 280 Population biology

26 TOPIC 39 THEORY TIME SPACE GIVEN PROBLEM SHAPE SIMPLE DIMENSIONAL PAPER NUMBER CASE LOCAL TERMS SYMMETRY RANDOM Mathematics Topic 39 Applied Mathematics

27 Topic Dynamics We have the distribution over topics for PNAS abstracts from 1991 to 2001 Analysis of dynamics: perform linear trend analysis for each topic “hot topics” go up, “cold topics” go down

28 Cold topics Hot topics 2 SPECIES GLOBAL CLIMATE CO2 WATER ENVIRONMENTAL YEARS MARINE CARBON DIVERSITY OCEAN EXTINCTION TERRESTRIAL COMMUNITY ABUNDANCE 134 MICE DEFICIENT NORMAL GENE NULL MOUSE TYPE HOMOZYGOUS ROLE KNOCKOUT DEVELOPMENT GENERATED LACKING ANIMALS REDUCED 179 APOPTOSIS DEATH CELL INDUCED BCL CELLS APOPTOTIC CASPASE FAS SURVIVAL PROGRAMMED MEDIATED INDUCTION CERAMIDE EXPRESSION 37 CDNA AMINO SEQUENCE ACID PROTEIN ISOLATED ENCODING CLONED ACIDS IDENTITY CLONE EXPRESSED ENCODES RAT HOMOLOGY 289 KDA PROTEIN PURIFIED MOLECULAR MASS CHROMATOGRAPHY POLYPEPTIDE GEL SDS BAND APPARENT LABELED IDENTIFIED FRACTION DETECTED 75 ANTIBODY ANTIBODIES MONOCLONAL ANTIGEN IGG MAB SPECIFIC EPITOPE HUMAN MABS RECOGNIZED SERA EPITOPES DIRECTED NEUTRALIZING NOBEL 1987 NOBEL 2002

29 Overview IProbabilistic Topic Models II Analyzing Scientific Topics: PNAS III Analyzing Topics of Federal Funding IV Analyzing Enron Email V Extensions of the the model VI Conclusion

30 Analyzing Topics of Funding Get a large-scale overview of funding for social sciences How similar are different funding programs? How is funding distributed over topics?

31 Dataset 22,189 Abstracts from grants active in 2003 NIH NIMH(National Institute of Mental Health) NCI(National Cancer Institute) NSF SBE(Social, Behavioral and Economic Sciences) BIO(Biological Sciences)

32 Extracted topics (1..20)

33 80 interpretable topics (out of 100)

34 Likely topics for NIH Likely topics for NSF-SBE Likely topics for NSF-BIO

35 Program similarity using topics

36

37 NIH NSF – BIO NSF – SBE 2D visualization of funding programs – nearby program support similar topics

38 Funding Amounts per Topic We have $ funding per grant We have % of topics for each grant We can solve for the $ amount per topic  What are expensive topics?

39 High $$$ topicsLow $$$ topics

40 Overview IProbabilistic Topic Models II Analyzing Scientific Topics: PNAS III Analyzing Topics of Federal Funding IV Analyzing Enron Email V Extensions of the the model VI Conclusion

41 Enron email data 500,000 emails 5000 authors 1999-2002

42 Enron topics TEXANS WIN FOOTBALL FANTASY SPORTSLINE PLAY TEAM GAME SPORTS GAMES GOD LIFE MAN PEOPLE CHRIST FAITH LORD JESUS SPIRITUAL VISIT ENVIRONMENTAL AIR MTBE EMISSIONS CLEAN EPA PENDING SAFETY WATER GASOLINE FERC MARKET ISO COMMISSION ORDER FILING COMMENTS PRICE CALIFORNIA FILED POWER CALIFORNIA ELECTRICITY UTILITIES PRICES MARKET PRICE UTILITY CUSTOMERS ELECTRIC STATE PLAN CALIFORNIA DAVIS RATE BANKRUPTCY SOCAL POWER BONDS MOU TIMELINE May 22, 2000 Start of California energy crisis

43 Overview IProbabilistic Topic Models II Analyzing Scientific Topics: PNAS III Analyzing Topics of Federal Funding IV Analyzing Enron Email V Extensions of the the model VI Conclusion

44 Pennsylvania Gazette 1728-1800 1728-1800 80,000 articles (courtesy of David Newman & Sharon Block, History Department, UC Irvine)

45 Historical Trends in Pen. Gazette STATE GOVERNMENT CONSTITUTION LAW UNITED POWER CITIZEN PEOPLE PUBLIC CONGRES SILK COTTON DITTO WHITE BLACK LINEN CLOTH WOMEN BLUE WORSTED (courtesy of David Newman & Sharon Block, UC Irvine)

46 Learning Topic Hierarchies In regular topic model, no relations between topics Alternative: hierarchical topic organization topic 1 topic 2 topic 3 topic 4 topic 5 topic 6 topic 7 Apply to Psych Review abstracts

47

48 Integrating Topics and Syntax Syntactic dependencies  short range dependencies Semantic dependencies  long-range  zz zz zz zz ww ww ww ww ss ss ss ss Semantic state: generate words from topic model Syntactic states: generate words from HMM (Griffiths, Steyvers, Blei, & Tenenbaum, 2004)

49 ... IN BY WITH ON AS FROM TO FOR THE A AN THIS THEIR ITS EACH ONE IS ARE BE HAS HAVE WAS WERE AS BASED PRESENTED DISCUSSED PROPOSED DESCRIBED SUCH USED DERIVED THEORY MODEL PROCESSES MODELS SYSTEM PROCESS EFFECTS INFORMATION ATTENTION SEARCH VISUAL PROCESSING TASK PERFORMANCE INFORMATION ATTENTIONAL MEMORY TERM LONG SHORT RETRIEVAL STORAGE MEMORIES AMNESIA IQ BEHAVIOR EVOLUTIONARY ENVIRONMENT GENES HERITABILITY GENETIC SELECTION DRUG AROUSAL NEURAL BRAIN HABITUATION BIOLOGICAL TOLERANCE BEHAVIORAL SOCIAL SELF ATTITUDE IMPLICIT ATTITUDES PERSONALITY JUDGMENT PERCEPTION (S) THE SEARCH IN LONG TERM MEMORY …… (S) A MODEL OF VISUAL ATTENTION ……

50 Random sentence generation LANGUAGE: [S] RESEARCHERS GIVE THE SPEECH [S] THE SOUND FEEL NO LISTENERS [S] WHICH WAS TO BE MEANING [S] HER VOCABULARIES STOPPED WORDS [S] HE EXPRESSLY WANTED THAT BETTER VOWEL

51 Conclusion Unsupervised extraction of content from large text collections Topics provide quick overview of content Topic models text-mining/ information retrieval psychology/ memory Connection? Good semantic memory models for finding semantically relevant information might also be good information retrieval models

52 Psych Review abstracts All 1281 abstracts since 1967 50 topics – examples: SIMILARITY CATEGORY CATEGORIES RELATIONS DIMENSIONS FEATURES STRUCTURE SIMILAR REPRESENTATION ONJECTS STIMULUS CONDITIONING LEARNING RESPONSE STIMULI RESPONSES AVOIDANCE REINFORCEMENT CLASSICAL DISCRIMINATION MEMORY RETRIEVAL RECALL ITEMS INFORMATION TERM RECOGNITION ITEMS LIST ASSOCIATIVE GROUP INDIVIDUAL GROUPS OUTCOMES INDIVIDUALS GROUPS OUTCOMES INDIVIDUALS DIFFERENCES INTERACTION EMOTIONAL EMOTION BASIC EMOTIONS AFFECT STATES EXPERIENCES AFFECTIVE AFFECTS RESEARCH...


Download ppt "Analyzing Federal Funding, Scientific Publications and Email with Probabilistic Topic Models Mark SteyversUC Irvine Padhraic Smyth Dave Newman Tom Griffiths."

Similar presentations


Ads by Google