Presentation is loading. Please wait.

Presentation is loading. Please wait.

Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Similar presentations


Presentation on theme: "Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine."— Presentation transcript:

1 Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine

2 Why map knowledge? Quickly grasp important themes in a new field Synthesize content of an existing field Discover targets for funding and research

3 Why map knowledge? Quickly grasp important themes in a new field Synthesize content of an existing field Discover targets for funding and research INFORMATION OVERLOAD

4 Apoptosis + Plant Biology

5 Apoptosis + Medicine

6

7

8

9 probabilistic generative model Apoptosis + Medicine

10 statistical inference Apoptosis + Medicine

11 1. A generative model for documents 2. Discovering topics with Gibbs sampling 3. Results –Topics and classes –Mapping science –Topic dynamics 4. Future directions –Tagging abstracts

12 1. A generative model for documents 2. Discovering topics with Gibbs sampling 3. Results –Topics and classes –Mapping science –Topic dynamics 4. Future directions –Tagging abstracts

13 A generative model for documents Each document a mixture of topics Each word chosen from a single topic from parameters (Blei, Ng, & Jordan, 2003)

14 A generative model for documents HEART0.2 LOVE0.2 SOUL0.2 TEARS0.2 JOY0.2 SCIENTIFIC 0.0 KNOWLEDGE 0.0 WORK 0.0 RESEARCH0.0 MATHEMATICS0.0 HEART0.0 LOVE0.0 SOUL0.0 TEARS0.0 JOY0.0 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH0.2 MATHEMATICS0.2 topic 1topic 2 w P(w|z = 1) =  (1) w P(w|z = 2) =  (2)

15 Choose mixture weights for each document, generate “bag of words”  = {P(z = 1), P(z = 2)} {0, 1} {0.25, 0.75} {0.5, 0.5} {0.75, 0.25} {1, 0} MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC HEART LOVE TEARS KNOWLEDGE HEART MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK TEARS SOUL KNOWLEDGE HEART WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE LOVE SOUL TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY

16 A generative model for documents Called Latent Dirichlet Allocation (LDA) Introduced by Blei, Ng, and Jordan (2003), reinterpretation of PLSI (Hofmann, 2001)  z w z z ww

17 words documents  U D V  words dims vectors documents SVD words  documents words topics documents LDA P(w|z)P(w|z) P(z)P(z) P(w)P(w) (Dumais, Landauer)

18 1. A generative model for documents 2. Discovering topics with Gibbs sampling 3. Results –Topics and classes –Mapping science –Topic dynamics 4. Future directions –Tagging abstracts

19 Inverting the generative model Maximum likelihood estimation (EM) Variational EM (Blei, Ng & Jordan, 2003) Bayesian inference

20 Sum in the denominator over T n terms Full posterior only tractable to a constant

21 Markov chain Monte Carlo Sample from a Markov chain which converges to target distribution Allows sampling from an unnormalized posterior distribution Can compute approximate statistics from intractable distributions

22 pixel = word image = document sample each pixel from a mixture of topics A visual example: Bars

23

24

25 Interpretable decomposition SVD gives a basis for the data, but not an interpretable one The true basis is not orthogonal, so rotation does no good

26 Bayesian model selection How many topics do we need? A Bayesian would consider the posterior: Involves summing over assignments z P(T|w)  P(w|T) P(T)

27 Corpus (w) P( w |T ) T = 10 T = 100 Bayesian model selection

28 Corpus (w) P( w |T ) T = 10 T = 100 Bayesian model selection

29 Corpus (w) P( w |T ) T = 10 T = 100 Bayesian model selection

30 Back to the bars

31 1. A generative model for documents 2. Discovering topics with Gibbs sampling 3. Results –Topics and classes –Mapping science –Topic dynamics 4. Future directions –Tagging abstracts

32 Corpus preprocessing Used all D = 28,154 abstracts from 1991-2001 Used any word occurring in at least five abstracts, not on “stop” list (W = 20,551) Segmentation by any delimiting character, total of n = 3,026,970 word tokens in corpus Also, PNAS class designations for 2001 (thanks to Kevin Boyack)

33 Running the algorithm Memory requirements linear in T(W+D), runtime proportional to nT T = 50, 100, 200, 300, 400, 500, 600, (1000) Ran 8 chains for each T, burn-in of 1000 iterations, 10 samples/chain at a lag of 100 All runs completed in under 30 hours on BlueHorizon supercomputer at San Diego

34 How many topics?

35

36 FORCE SURFACE MOLECULES SOLUTION SURFACES MICROSCOPY WATER FORCES PARTICLES STRENGTH POLYMER IONIC ATOMIC AQUEOUS MOLECULAR PROPERTIES LIQUID SOLUTIONS BEADS MECHANICAL HIV VIRUS INFECTED IMMUNODEFICIENCY CD4 INFECTION HUMAN VIRAL TAT GP120 REPLICATION TYPE ENVELOPE AIDS REV BLOOD CCR5 INDIVIDUALS ENV PERIPHERAL MUSCLE CARDIAC HEART SKELETAL MYOCYTES VENTRICULAR MUSCLES SMOOTH HYPERTROPHY DYSTROPHIN HEARTS CONTRACTION FIBERS FUNCTION TISSUE RAT MYOCARDIAL ISOLATED MYOD FAILURE STRUCTURE ANGSTROM CRYSTAL RESIDUES STRUCTURES STRUCTURAL RESOLUTION HELIX THREE HELICES DETERMINED RAY CONFORMATION HELICAL HYDROPHOBIC SIDE DIMENSIONAL INTERACTIONS MOLECULE SURFACE NEURONS BRAIN CORTEX CORTICAL OLFACTORY NUCLEUS NEURONAL LAYER RAT NUCLEI CEREBELLUM CEREBELLAR LATERAL CEREBRAL LAYERS GRANULE LABELED HIPPOCAMPUS AREAS THALAMIC A selection of topics TUMOR CANCER TUMORS HUMAN CELLS BREAST MELANOMA GROWTH CARCINOMA PROSTATE NORMAL CELL METASTATIC MALIGNANT LUNG CANCERS MICE NUDE PRIMARY OVARIAN

37 PARASITE PARASITES FALCIPARUM MALARIA HOST PLASMODIUM ERYTHROCYTES ERYTHROCYTE MAJOR LEISHMANIA INFECTED BLOOD INFECTION MOSQUITO INVASION TRYPANOSOMA CRUZI BRUCEI HUMAN HOSTS ADULT DEVELOPMENT FETAL DAY DEVELOPMENTAL POSTNATAL EARLY DAYS NEONATAL LIFE DEVELOPING EMBRYONIC BIRTH NEWBORN MATERNAL PRESENT PERIOD ANIMALS NEUROGENESIS ADULTS CHROMOSOME REGION CHROMOSOMES KB MAP MAPPING CHROMOSOMAL HYBRIDIZATION ARTIFICIAL MAPPED PHYSICAL MAPS GENOMIC DNA LOCUS GENOME GENE HUMAN SITU CLONES MALE FEMALE MALES FEMALES SEX SEXUAL BEHAVIOR OFFSPRING REPRODUCTIVE MATING SOCIAL SPECIES REPRODUCTION FERTILITY TESTIS MATE GENETIC GERM CHOICE SRY STUDIES PREVIOUS SHOWN RESULTS RECENT PRESENT STUDY DEMONSTRATED INDICATE WORK SUGGEST SUGGESTED USING FINDINGS DEMONSTRATE REPORT INDICATED CONSISTENT REPORTS CONTRAST A selection of topics MECHANISM MECHANISMS UNDERSTOOD POORLY ACTION UNKNOWN REMAIN UNDERLYING MOLECULAR PS REMAINS SHOW RESPONSIBLE PROCESS SUGGEST UNCLEAR REPORT LEADING LARGELY KNOWN MODEL MODELS EXPERIMENTAL BASED PROPOSED DATA SIMPLE DYNAMICS PREDICTED EXPLAIN BEHAVIOR THEORETICAL ACCOUNT THEORY PREDICTS COMPUTER QUANTITATIVE PREDICTIONS CONSISTENT PARAMETERS

38 PARASITE PARASITES FALCIPARUM MALARIA HOST PLASMODIUM ERYTHROCYTES ERYTHROCYTE MAJOR LEISHMANIA INFECTED BLOOD INFECTION MOSQUITO INVASION TRYPANOSOMA CRUZI BRUCEI HUMAN HOSTS ADULT DEVELOPMENT FETAL DAY DEVELOPMENTAL POSTNATAL EARLY DAYS NEONATAL LIFE DEVELOPING EMBRYONIC BIRTH NEWBORN MATERNAL PRESENT PERIOD ANIMALS NEUROGENESIS ADULTS CHROMOSOME REGION CHROMOSOMES KB MAP MAPPING CHROMOSOMAL HYBRIDIZATION ARTIFICIAL MAPPED PHYSICAL MAPS GENOMIC DNA LOCUS GENOME GENE HUMAN SITU CLONES MALE FEMALE MALES FEMALES SEX SEXUAL BEHAVIOR OFFSPRING REPRODUCTIVE MATING SOCIAL SPECIES REPRODUCTION FERTILITY TESTIS MATE GENETIC GERM CHOICE SRY STUDIES PREVIOUS SHOWN RESULTS RECENT PRESENT STUDY DEMONSTRATED INDICATE WORK SUGGEST SUGGESTED USING FINDINGS DEMONSTRATE REPORT INDICATED CONSISTENT REPORTS CONTRAST A selection of topics MECHANISM MECHANISMS UNDERSTOOD POORLY ACTION UNKNOWN REMAIN UNDERLYING MOLECULAR PS REMAINS SHOW RESPONSIBLE PROCESS SUGGEST UNCLEAR REPORT LEADING LARGELY KNOWN MODEL MODELS EXPERIMENTAL BASED PROPOSED DATA SIMPLE DYNAMICS PREDICTED EXPLAIN BEHAVIOR THEORETICAL ACCOUNT THEORY PREDICTS COMPUTER QUANTITATIVE PREDICTIONS CONSISTENT PARAMETERS

39 1. A generative model for documents 2. Discovering topics with Gibbs sampling 3. Results –Topics and classes –Mapping science –Topic dynamics 4. Future directions –Tagging abstracts

40 Topics and classes PNAS authors provide class designations –major: Biological, Physical, Social Sciences –minor: 33 separate disciplines* Find topics diagnostic of classes –validate “reality” of classes –show topics pick out meaningful structure (classes, and the the relations between them)

41

42 210 SYNAPTIC NEURONS POSTSYNAPTIC HIPPOCAMPAL SYNAPSES LTP PRESYNAPTIC TRANSMISSION POTENTIATION PLASTICITY EXCITATORY RELEASE DENDRITIC PYRAMIDAL HIPPOCAMPUS DENDRITES CA1 STIMULATION TERMINALS SYNAPSE

43 201 RESISTANCE RESISTANT DRUG DRUGS SENSITIVE MDR MULTIDRUG SUSCEPTIBLE SELECTED GLYCOPROTEIN SENSITIVITY PGP AGENTS CONFERS MDR1 CYTOTOXIC CONFERRED CHEMOTHERAPEUTIC EFFLUX INCREASED

44 280 SPECIES SELECTION EVOLUTION GENETIC POPULATIONS POPULATION VARIATION NATURAL EVOLUTIONARY FITNESS ADAPTIVE RATES THEORY TRAITS DIVERSITY EXPECTED NEUTRAL EVOLVED COMPETITION HISTORY

45 222 CORTEX BRAIN SUBJECTS TASK AREAS REGIONS FUNCTIONAL LEFT MEMORY TEMPORAL IMAGING PREFRONTAL CEREBRAL TASKS FRONTAL AREA TOMOGRAPHY EMISSION POSITRON CORTICAL

46 2 SPECIES GLOBAL CLIMATE CO2 WATER ENVIRONMENTAL YEARS MARINE CARBON DIVERSITY OCEAN EXTINCTION TERRESTRIAL COMMUNITY ABUNDANCE EARTH ECOLOGICAL CHANGE TIME ECOSYSTEM

47 39 THEORY TIME SPACE GIVEN PROBLEM SHAPE SIMPLE DIMENSIONAL PAPER NUMBER CASE LOCAL TERMS SYMMETRY RANDOM EQUATION CLASSICAL COMPLEXITY NUMERICAL PROPERTIES

48 1. A generative model for documents 2. Discovering topics with Gibbs sampling 3. Results –Topics and classes –Mapping science –Topic dynamics 4. Future directions –Tagging abstracts

49 Mapping science Topics provide dimensionality reduction Some applications require visualization (and even lower dimensionality) Low-dimensional representation from methods for analysis of compositional data

50

51

52

53 1. A generative model for documents 2. Discovering topics with Gibbs sampling 3. Results –Topics and classes –Mapping science –Topic dynamics 4. Future directions –Tagging abstracts

54 Topic dynamics We have the distribution over topics for abstracts from 1991 to 2001 Analysis of dynamics: –perform linear trend analysis for each topic –“hot topics” go up, “cold topics” go down

55 Cold topicsHot topics

56 Cold topicsHot topics 2 SPECIES GLOBAL CLIMATE CO2 WATER ENVIRONMENTAL YEARS MARINE CARBON DIVERSITY OCEAN EXTINCTION TERRESTRIAL COMMUNITY ABUNDANCE 134 MICE DEFICIENT NORMAL GENE NULL MOUSE TYPE HOMOZYGOUS ROLE KNOCKOUT DEVELOPMENT GENERATED LACKING ANIMALS REDUCED 179 APOPTOSIS DEATH CELL INDUCED BCL CELLS APOPTOTIC CASPASE FAS SURVIVAL PROGRAMMED MEDIATED INDUCTION CERAMIDE EXPRESSION

57 Cold topicsHot topics 2 SPECIES GLOBAL CLIMATE CO2 WATER ENVIRONMENTAL YEARS MARINE CARBON DIVERSITY OCEAN EXTINCTION TERRESTRIAL COMMUNITY ABUNDANCE 134 MICE DEFICIENT NORMAL GENE NULL MOUSE TYPE HOMOZYGOUS ROLE KNOCKOUT DEVELOPMENT GENERATED LACKING ANIMALS REDUCED 179 APOPTOSIS DEATH CELL INDUCED BCL CELLS APOPTOTIC CASPASE FAS SURVIVAL PROGRAMMED MEDIATED INDUCTION CERAMIDE EXPRESSION 37 CDNA AMINO SEQUENCE ACID PROTEIN ISOLATED ENCODING CLONED ACIDS IDENTITY CLONE EXPRESSED ENCODES RAT HOMOLOGY 289 KDA PROTEIN PURIFIED MOLECULAR MASS CHROMATOGRAPHY POLYPEPTIDE GEL SDS BAND APPARENT LABELED IDENTIFIED FRACTION DETECTED 75 ANTIBODY ANTIBODIES MONOCLONAL ANTIGEN IGG MAB SPECIFIC EPITOPE HUMAN MABS RECOGNIZED SERA EPITOPES DIRECTED NEUTRALIZING

58 1. A generative model for documents 2. Discovering topics with Gibbs sampling 3. Results –Topics and classes –Mapping science –Topic dynamics 4. Future directions –Tagging abstracts

59 Future directions Including different kinds of knowledge –citations (Hofmann & Cohn, 2001) –author, title, keywords, other fields –word order information An example: scientific syntax and semantics

60 Scientific syntax and semantics  z w z z ww x x x semantics: probabilistic topics syntax: probabilistic regular grammar Factorization of language based on statistical dependency patterns: long-range, document specific, dependencies short-range dependencies constant across all documents

61 HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = 1 0.4 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = 2 0.6 x = 1 THE 0.6 A 0.3 MANY 0.1 x = 3 OF 0.6 FOR 0.3 BETWEEN 0.1 x = 2 0.9 0.1 0.2 0.8 0.7 0.3

62 HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 THE 0.6 A 0.3 MANY 0.1 OF 0.6 FOR 0.3 BETWEEN 0.1 0.9 0.1 0.2 0.8 0.7 0.3 THE ……………………………… z = 1 0.4 z = 2 0.6 x = 1 x = 3 x = 2

63 HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 THE 0.6 A 0.3 MANY 0.1 OF 0.6 FOR 0.3 BETWEEN 0.1 0.9 0.1 0.2 0.8 0.7 0.3 THE LOVE…………………… z = 1 0.4 z = 2 0.6 x = 1 x = 3 x = 2

64 HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 THE 0.6 A 0.3 MANY 0.1 OF 0.6 FOR 0.3 BETWEEN 0.1 0.9 0.1 0.2 0.8 0.7 0.3 THE LOVE OF……………… z = 1 0.4 z = 2 0.6 x = 1 x = 3 x = 2

65 HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 THE 0.6 A 0.3 MANY 0.1 OF 0.6 FOR 0.3 BETWEEN 0.1 0.9 0.1 0.2 0.8 0.7 0.3 THE LOVE OF RESEARCH …… z = 1 0.4 z = 2 0.6 x = 1 x = 3 x = 2

66 Semantic topics

67 Syntactic classes REMAINED 581425263033 INARETHESUGGESTLEVELSRESULTSBEEN FORWERETHISINDICATENUMBERANALYSISMAY ONWASITSSUGGESTINGLEVELDATACAN BETWEENISTHEIRSUGGESTSRATESTUDIESCOULD DURINGWHENANSHOWEDTIMESTUDYWELL AMONGREMAINEACHREVEALEDCONCENTRATIONSFINDINGSDID FROMREMAINSONESHOWVARIETYEXPERIMENTSDOES UNDERREMAINEDANYDEMONSTRATERANGEOBSERVATIONSDO WITHINPREVIOUSLYINCREASEDINDICATINGCONCENTRATIONHYPOTHESISMIGHT THROUGHOUTBECOMEEXOGENOUSPROVIDEDOSEANALYSESSHOULD THROUGHBECAMEOURSUPPORTFAMILYASSAYSWILL TOWARDBEINGRECOMBINANTINDICATESSETPOSSIBILITYWOULD INTOBUTENDOGENOUSPROVIDESFREQUENCYMICROSCOPYMUST ATGIVETOTALINDICATEDSERIESPAPERCANNOT INVOLVINGMEREPURIFIEDDEMONSTRATEDAMOUNTSWORK THEY AFTERAPPEAREDTILESHOWSRATESEVIDENCEALSO ACROSSAPPEARFULLSOCLASSFINDING AGAINSTALLOWEDCHRONICREVEALVALUESMUTAGENESISBECOME WHENNORMALLYANOTHERDEMONSTRATESAMOUNTOBSERVATIONMAG ALONGEACHEXCESSSUGGESTEDSITESMEASUREMENTSLIKELY

68 Abstract tagging Highlight important words in text, to reduce demands on information users Can be done to identify different content: –words assigned to most prevalent topic reveal important themes (see the paper!) –with syntactic/semantic factorization, we can highlight words that determine semantic content

69 (PNAS, 1991, vol. 88, 4874-4876) A 23 generalized 49 fundamental 11 theorem 20 of 4 natural 46 selection 46 is 32 derived 17 for 5 populations 46 incorporating 22 both 39 genetic 46 and 37 cultural 46 transmission 46. The 14 phenotype 15 is 32 determined 17 by 42 an 23 arbitrary 49 number 26 of 4 multiallelic 52 loci 40 with 22 two 39 -factor 148 epistasis 46 and 37 an 23 arbitrary 49 linkage 11 map 20, as 43 well 33 as 43 by 42 cultural 46 transmission 46 from 22 the 14 parents 46. Generations 46 are 8 discrete 49 but 37 partially 19 overlapping 24, and 37 mating 46 may 33 be 44 nonrandom 17 at 9 either 39 the 14 genotypic 46 or 37 the 14 phenotypic 46 level 46 (or 37 both 39 ). I 12 show 34 that 47 cultural 46 transmission 46 has 18 several 39 important 49 implications 6 for 5 the 14 evolution 46 of 4 population 46 fitness 46, most 36 notably 4 that 47 there 41 is 32 a 23 time 26 lag 7 in 22 the 14 response 28 to 31 selection 46 such 9 that 47 the 14 future 137 evolution 46 depends 29 on 21 the 14 past 24 selection 46 history 46 of 4 the 14 population 46. (graylevel = “semanticity”, the probability of using LDA over HMM)

70 (PNAS, 1996, vol. 93, 14628-14631) The 14 ''shape 7 '' of 4 a 23 female 115 mating 115 preference 125 is 32 the 14 relationship 7 between 4 a 23 male 115 trait 15 and 37 the 14 probability 7 of 4 acceptance 21 as 43 a 23 mating 115 partner 20, The 14 shape 7 of 4 preferences 115 is 32 important 49 in 5 many 39 models 6 of 4 sexual 115 selection 46, mate 115 recognition 125, communication 9, and 37 speciation 46, yet 50 it 41 has 18 rarely 19 been 33 measured 17 precisely 19, Here 12 I 9 examine 34 preference 7 shape 7 for 5 male 115 calling 115 song 125 in 22 a 23 bushcricket *13 (katydid *48 ). Preferences 115 change 46 dramatically 19 between 22 races 46 of 4 a 23 species 15, from 22 strongly 19 directional 11 to 31 broadly 19 stabilizing 45 (but 50 with 21 a 23 net 49 directional 46 effect 46 ), Preference 115 shape 46 generally 19 matches 10 the 14 distribution 16 of 4 the 14 male 115 trait 15, This 41 is 32 compatible 29 with 21 a 23 coevolutionary 46 model 20 of 4 signal 9 -preference 115 evolution 46, although 50 it 41 does 33 nor 37 rule 20 out 17 an 23 alternative 11 model 20, sensory 125 exploitation 150. Preference 46 shapes 40 are 8 shown 35 to 31 be 44 genetic 11 in 5 origin 7.

71 (PNAS, 1996, vol. 93, 14628-14631) The 14 ''shape 7 '' of 4 a 23 female 115 mating 115 preference 125 is 32 the 14 relationship 7 between 4 a 23 male 115 trait 15 and 37 the 14 probability 7 of 4 acceptance 21 as 43 a 23 mating 115 partner 20, The 14 shape 7 of 4 preferences 115 is 32 important 49 in 5 many 39 models 6 of 4 sexual 115 selection 46, mate 115 recognition 125, communication 9, and 37 speciation 46, yet 50 it 41 has 18 rarely 19 been 33 measured 17 precisely 19, Here 12 I 9 examine 34 preference 7 shape 7 for 5 male 115 calling 115 song 125 in 22 a 23 bushcricket *13 (katydid *48 ). Preferences 115 change 46 dramatically 19 between 22 races 46 of 4 a 23 species 15, from 22 strongly 19 directional 11 to 31 broadly 19 stabilizing 45 (but 50 with 21 a 23 net 49 directional 46 effect 46 ), Preference 115 shape 46 generally 19 matches 10 the 14 distribution 16 of 4 the 14 male 115 trait 15. This 41 is 32 compatible 29 with 21 a 23 coevolutionary 46 model 20 of 4 signal 9 -preference 115 evolution 46, although 50 it 41 does 33 nor 37 rule 20 out 17 an 23 alternative 11 model 20, sensory 125 exploitation 150. Preference 46 shapes 40 are 8 shown 35 to 31 be 44 genetic 11 in 5 origin 7.

72 Conclusion Probabilistic generative models can reveal the structure of knowledge domains We can use these models to –identify important themes –synthesize content –discover targets for funding and research –reduce the demands on information users

73

74 Gibbs sampling For variables z = z 1, z 2, …, z n Draw z i (t+1) from P(z i |z -i, w) z -i = z 1 (t+1), z 2 (t+1),…, z i-1 (t+1), z i+1 (t), …, z n (t)

75 Gibbs sampling Need full conditional distributions for variables Since we only sample z we need number of times word w assigned to topic j number of times topic j used in document d

76 Gibbs sampling iteration 1

77 Gibbs sampling iteration 1 2

78 Gibbs sampling iteration 1 2

79 Gibbs sampling iteration 1 2

80 Gibbs sampling iteration 1 2

81 Gibbs sampling iteration 1 2

82 Gibbs sampling iteration 1 2

83 Gibbs sampling iteration 1 2

84 Gibbs sampling iteration 1 2 … 1000


Download ppt "Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine."

Similar presentations


Ads by Google