Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 14: Document Clustering and Topic Extraction Note:

Slides:



Advertisements
Similar presentations
Topic models Source: Topic models, David Blei, MLSS 09.
Advertisements

Hierarchical Dirichlet Process (HDP)
Information retrieval – LSI, pLSI and LDA
Exact Inference in Bayes Nets
Title: The Author-Topic Model for Authors and Documents
Supervised Learning Recap
Probabilistic Clustering-Projection Model for Discrete Data
Machine Learning Neural Networks
Generative Topic Models for Community Analysis
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Principal Component Analysis
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 14: Document Clustering and Topic Extraction Padhraic.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Latent Dirichlet Allocation a generative model for text
British Museum Library, London Picture Courtesy: flickr.
Presented by Zeehasham Rasheed
Models for Authors and Text Documents Mark Steyvers UCI In collaboration with: Padhraic Smyth (UCI) Michal Rosen-Zvi (UCI) Thomas Griffiths (Stanford)
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Bayesian Networks Alan Ritter.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Scalable Text Mining with Sparse Generative Models
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
CHAPTER 12 ADVANCED INTELLIGENT SYSTEMS © 2005 Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang.
Lecture #13: Gibbs Sampling for LDA
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Example 16,000 documents 100 topic Picked those with large p(w|z)
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
(Infinitely) Deep Learning in Vision Max Welling (UCI) collaborators: Ian Porteous (UCI) Evgeniy Bart UCI/Caltech) Pietro Perona (Caltech)
Annealing Paths for the Evaluation of Topic Models James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine* *James.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
27. May Topic Models Nam Khanh Tran L3S Research Center.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
Genetics and Speciation
Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.
Integrating Topics and Syntax -Thomas L
Summary We propose a framework for jointly modeling networks and text associated with them, such as networks or user review websites. The proposed.
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
CS Statistical Machine learning Lecture 24
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Lecture 2: Statistical learning primer for biologists
Latent Dirichlet Allocation
Inferring High-Level Behavior from Low-Level Sensors Donald J. Patterson, Lin Liao, Dieter Fox, and Henry Kautz.
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Topical Analysis and Visualization of (Network) Data Using Sci2 Ted Polley Research & Editorial Assistant Cyberinfrastructure for Network Science Center.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Online Multiscale Dynamic Topic Models
School of Computer Science & Engineering
Statistical Models for Automatic Speech Recognition
Multimodal Learning with Deep Boltzmann Machines
Statistical Models for Automatic Speech Recognition
Bayesian Inference for Mixture Language Models
Michal Rosen-Zvi University of California, Irvine
Junghoo “John” Cho UCLA
Topic Models in Text Processing
GhostLink: Latent Network Inference for Influence-aware Recommendation
Presentation transcript:

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 14: Document Clustering and Topic Extraction Note: many of the slides on topic models were adapted from the presentation by Griffiths and Steyvers at the Beckman National Academy of Sciences Symposium on “Mapping Knowledge Domains”, Beckman Center, UC Irvine, May Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Text Mining Information Retrieval Text Classification Text Clustering Information Extraction

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Document Clustering Set of documents D in term-vector form –no class labels this time –want to group the documents into K groups or into a taxonomy –Each cluster hypothetically corresponds to a “topic” Methods: –Any of the well-known clustering methods –K-means E.g., “spherical k-means”, normalize document distances –Hierarchical clustering –Probabilistic model-based clustering methods e.g., mixtures of multinomials Single-topic versus multiple-topic models –Extensions to author-topic models

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Mixture Model Clustering

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Mixture Model Clustering

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Mixture Model Clustering Conditional Independence model for each component (often quite useful to first-order)

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Mixtures of Documents Terms Documents Component 1 Component 2

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Terms Documents

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Terms Documents C1 C Treat as Missing C2

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Terms Documents C1 C Treat as Missing P(C1|x1) P(C1|..) P(C2|x1) P(C2|..) E-Step: estimate component membership probabilities given current parameter estimates

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Terms Documents C1 C Treat as Missing P(C1|x1) P(C1|..) P(C2|x1) P(C2|..) M-Step: use “fractional” weighted data to get new estimates of the parameters

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A Document Cluster Most Likely Terms in Component 5: weight = 0.08 TERM p(t|k) write drive problem mail articl hard work system good time Highest Lift Terms in Component 5 weight = 0.08 TERM LIFT p(t|k) p(t) scsi drive hard card format softwar memori install disk engin

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Another Document Cluster Most Likely Terms in Component 1 weight = 0.11 : TERM p(t|k) articl good dai fact god claim apr fbi christian group Highest Lift Terms in Component 1: weight = 0.11 : TERM LIFT p(t|k) p(t) fbi jesu fire christian evid god gun faith kill bibl

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A topic is represented as a (multinomial) distribution over words SPEECH.0691 WORDS.0671 RECOGNITION.0412 WORD.0557 SPEAKER.0288 USER.0230 PHONEME.0224 DOCUMENTS.0205 CLASSIFICATION.0154 TEXT.0195 SPEAKERS.0140 RETRIEVAL.0152 FRAME.0135 INFORMATION.0144 PHONETIC.0119 DOCUMENT.0144 PERFORMANCE.0111 LARGE.0102 ACOUSTIC.0099 COLLECTION.0098 BASED.0098 KNOWLEDGE.0087 PHONEMES.0091 MACHINE.0080 UTTERANCES.0091 RELEVANT.0077 SET.0089 SEMANTIC.0076 LETTER.0088 SIMILARITY.0071 … … Example topic #1 Example topic #2

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine The basic model…. C X1X1 X2X2 XdXd

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A better model…. A X1X1 X2X2 XdXd B C

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A better model…. A X1X1 X2X2 XdXd B C History: - latent class models in statistics - Hofmann applied to documents (SIGIR ’99) - recent extensions, e.g., Blei, Jordan, Ng (JMLR, 2003) - variously known as factor/aspect/latent class models

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A better model…. A X1X1 X2X2 XdXd B C Inference can be intractable due to undirected loops!

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A better model for documents…. Multi-topic model –A document is generated from multiple components –Multiple components can be active at once –Each component = multinomial distribution –Parameter estimation is tricky –Very useful: “parses” into high-level semantic components

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine History of multi-topic models Latent class models in statistics Hoffman 1999 –Original application to documents Blei, Ng, and Jordan (2001, 2003) –Variational methods Griffiths and Steyvers (2003) –Gibbs sampling approach (very efficient)

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine GROUP DYNAMIC DISTRIBUTED RESEARCH MULTICAST STRUCTURE COMPUTING SUPPORTED INTERNET STRUCTURES SYSTEMS PART PROTOCOL STATIC SYSTEM GRANT RELIABLE PAPER HETEROGENEOUS SCIENCE GROUPS DYNAMICALLY ENVIRONMENT FOUNDATION PROTOCOLS PRESENT PAPER FL IP META SUPPORT WORK TRANSPORT CALLED ARCHITECTURE NATIONAL DRAFT RECURSIVE ENVIRONMENTS NSF “Content” components “Boilerplate” components

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine DIMENSIONAL RULES ORDER GRAPH POINTS CLASSIFICATION TERMS PATH SURFACE RULE PARTIAL GRAPHS GEOMETRIC ACCURACY HIGHER PATHS SURFACES ATTRIBUTES REDUCTION EDGE MESH INDUCTION PAPER NUMBER PLANE CLASSIFIER TERM CONNECTED POINT SET ORDERING DIRECTED GEOMETRY ATTRIBUTE SHOW NODES PLANAR CLASSIFIERS MAGNITUDE VERTICES INFORMATION SYSTEM PAPER LANGUAGE TEXT FILE CONDITIONS PROGRAMMING RETRIEVAL OPERATING CONCEPT LANGUAGES SOURCES STORAGE CONCEPTS FUNCTIONAL DOCUMENT DISK DISCUSSED SEMANTICS DOCUMENTS SYSTEMS DEFINITION SEMANTIC RELEVANT KERNEL ISSUES NATURAL CONTENT ACCESS PROPERTIES CONSTRUCTS AUTOMATICALLY MANAGEMENT IMPORTANT GRAMMAR DIGITAL UNIX EXAMPLES LISP

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine MODEL PAPER TYPE KNOWLEDGE MODELS APPROACHES SPECIFICATION SYSTEM MODELING PROPOSED TYPES SYSTEMS QUALITATIVE CHANGE FORMAL BASE COMPLEX BELIEF VERIFICATION EXPERT QUANTITATIVE ALTERNATIVE SPECIFICATIONS ACQUISITION CAPTURE APPROACH CHECKING DOMAIN MODELED ORIGINAL SYSTEM INTELLIGENT ACCURATELY SHOW PROPERTIES BASES REALISTIC PROPOSE ABSTRACT BASED “Style” components

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A generative model for documents Each document a mixture of topics Each word chosen from a single topic from parameters (Blei, Ng, & Jordan, 2003)

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A generative model for documents Called Latent Dirichlet Allocation (LDA) Introduced by Blei, Ng, and Jordan (2003), reinterpretation of PLSI (Hofmann, 2001)  z w z z ww

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine words documents  U D V  words dims vectors documents SVD words  documents words topics documents LDA P(w|z)P(w|z) P(z)P(z) P(w)P(w) (Dumais, Landauer)

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A generative model for documents HEART0.2 LOVE0.2 SOUL0.2 TEARS0.2 JOY0.2 SCIENTIFIC 0.0 KNOWLEDGE 0.0 WORK 0.0 RESEARCH0.0 MATHEMATICS0.0 HEART0.0 LOVE0.0 SOUL0.0 TEARS0.0 JOY0.0 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH0.2 MATHEMATICS0.2 topic 1topic 2 w P(w|z = 1) =  (1) w P(w|z = 2) =  (2)

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Choose mixture weights for each document, generate “bag of words”  = {P(z = 1), P(z = 2)} {0, 1} {0.25, 0.75} {0.5, 0.5} {0.75, 0.25} {1, 0} MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC HEART LOVE TEARS KNOWLEDGE HEART MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK TEARS SOUL KNOWLEDGE HEART WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE LOVE SOUL TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Bayesian inference Sum in the denominator over T n terms Full posterior only tractable to a constant

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Bayesian sampling Sample from a Markov chain which converges to the target distribution of interest –Known as Markov chain Monte Carlo in general Simple version is known as Gibbs sampling –Say we are interested in estimating p(x, y | D) –We can approximate this by sampling from p(x|y,D), p(y|x,D) in an iterative fashion –Useful when conditionals are known, but joint distribution is not easy to work with –Converges to true distribution under fairly broad assumptions Can compute approximate statistics from intractable distributions

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Gibbs sampling Need full conditional distributions for variables Since we only sample z we need number of times word w assigned to topic j number of times topic j used in document d

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Gibbs sampling iteration 1

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Gibbs sampling iteration 1 2

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Gibbs sampling iteration 1 2

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Gibbs sampling iteration 1 2

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Gibbs sampling iteration 1 2

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Gibbs sampling iteration 1 2

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Gibbs sampling iteration 1 2

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Gibbs sampling iteration 1 2

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Gibbs sampling iteration 1 2 … 1000

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine pixel = word image = document sample each pixel from a mixture of topics A visual example: Bars

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Interpretable decomposition SVD gives a basis for the data, but not an interpretable one The true basis is not orthogonal, so rotation does no good

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Bayesian model selection How many topics T do we need? A Bayesian would consider the posterior: P(w|T) involves summing over all possible assignments z –but it can be approximated by sampling P(T|w)  P(w|T) P(T)

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Corpus (w) P( w |T ) T = 10 T = 100 Bayesian model selection

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Corpus (w) P( w |T ) T = 10 T = 100 Bayesian model selection

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Corpus (w) P( w |T ) T = 10 T = 100 Bayesian model selection

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Back to the bars data set

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine PNAS corpus preprocessing Used all D = 28,154 abstracts from Used any word occurring in at least five abstracts, not on “stop” list (W = 20,551) Segmentation by any delimiting character, total of n = 3,026,970 word tokens in corpus Also, PNAS class designations for 2001

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Running the algorithm Memory requirements linear in T(W+D), runtime proportional to nT T = 50, 100, 200, 300, 400, 500, 600, (1000) Ran 8 chains for each T, burn-in of 1000 iterations, 10 samples/chain at a lag of 100 All runs completed in under 30 hours on BlueHorizon supercomputer at San Diego

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine FORCE SURFACE MOLECULES SOLUTION SURFACES MICROSCOPY WATER FORCES PARTICLES STRENGTH POLYMER IONIC ATOMIC AQUEOUS MOLECULAR PROPERTIES LIQUID SOLUTIONS BEADS MECHANICAL HIV VIRUS INFECTED IMMUNODEFICIENCY CD4 INFECTION HUMAN VIRAL TAT GP120 REPLICATION TYPE ENVELOPE AIDS REV BLOOD CCR5 INDIVIDUALS ENV PERIPHERAL MUSCLE CARDIAC HEART SKELETAL MYOCYTES VENTRICULAR MUSCLES SMOOTH HYPERTROPHY DYSTROPHIN HEARTS CONTRACTION FIBERS FUNCTION TISSUE RAT MYOCARDIAL ISOLATED MYOD FAILURE STRUCTURE ANGSTROM CRYSTAL RESIDUES STRUCTURES STRUCTURAL RESOLUTION HELIX THREE HELICES DETERMINED RAY CONFORMATION HELICAL HYDROPHOBIC SIDE DIMENSIONAL INTERACTIONS MOLECULE SURFACE NEURONS BRAIN CORTEX CORTICAL OLFACTORY NUCLEUS NEURONAL LAYER RAT NUCLEI CEREBELLUM CEREBELLAR LATERAL CEREBRAL LAYERS GRANULE LABELED HIPPOCAMPUS AREAS THALAMIC A selection of topics TUMOR CANCER TUMORS HUMAN CELLS BREAST MELANOMA GROWTH CARCINOMA PROSTATE NORMAL CELL METASTATIC MALIGNANT LUNG CANCERS MICE NUDE PRIMARY OVARIAN

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine PARASITE PARASITES FALCIPARUM MALARIA HOST PLASMODIUM ERYTHROCYTES ERYTHROCYTE MAJOR LEISHMANIA INFECTED BLOOD INFECTION MOSQUITO INVASION TRYPANOSOMA CRUZI BRUCEI HUMAN HOSTS ADULT DEVELOPMENT FETAL DAY DEVELOPMENTAL POSTNATAL EARLY DAYS NEONATAL LIFE DEVELOPING EMBRYONIC BIRTH NEWBORN MATERNAL PRESENT PERIOD ANIMALS NEUROGENESIS ADULTS CHROMOSOME REGION CHROMOSOMES KB MAP MAPPING CHROMOSOMAL HYBRIDIZATION ARTIFICIAL MAPPED PHYSICAL MAPS GENOMIC DNA LOCUS GENOME GENE HUMAN SITU CLONES MALE FEMALE MALES FEMALES SEX SEXUAL BEHAVIOR OFFSPRING REPRODUCTIVE MATING SOCIAL SPECIES REPRODUCTION FERTILITY TESTIS MATE GENETIC GERM CHOICE SRY STUDIES PREVIOUS SHOWN RESULTS RECENT PRESENT STUDY DEMONSTRATED INDICATE WORK SUGGEST SUGGESTED USING FINDINGS DEMONSTRATE REPORT INDICATED CONSISTENT REPORTS CONTRAST A selection of topics MECHANISM MECHANISMS UNDERSTOOD POORLY ACTION UNKNOWN REMAIN UNDERLYING MOLECULAR PS REMAINS SHOW RESPONSIBLE PROCESS SUGGEST UNCLEAR REPORT LEADING LARGELY KNOWN MODEL MODELS EXPERIMENTAL BASED PROPOSED DATA SIMPLE DYNAMICS PREDICTED EXPLAIN BEHAVIOR THEORETICAL ACCOUNT THEORY PREDICTS COMPUTER QUANTITATIVE PREDICTIONS CONSISTENT PARAMETERS

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine PARASITE PARASITES FALCIPARUM MALARIA HOST PLASMODIUM ERYTHROCYTES ERYTHROCYTE MAJOR LEISHMANIA INFECTED BLOOD INFECTION MOSQUITO INVASION TRYPANOSOMA CRUZI BRUCEI HUMAN HOSTS ADULT DEVELOPMENT FETAL DAY DEVELOPMENTAL POSTNATAL EARLY DAYS NEONATAL LIFE DEVELOPING EMBRYONIC BIRTH NEWBORN MATERNAL PRESENT PERIOD ANIMALS NEUROGENESIS ADULTS CHROMOSOME REGION CHROMOSOMES KB MAP MAPPING CHROMOSOMAL HYBRIDIZATION ARTIFICIAL MAPPED PHYSICAL MAPS GENOMIC DNA LOCUS GENOME GENE HUMAN SITU CLONES MALE FEMALE MALES FEMALES SEX SEXUAL BEHAVIOR OFFSPRING REPRODUCTIVE MATING SOCIAL SPECIES REPRODUCTION FERTILITY TESTIS MATE GENETIC GERM CHOICE SRY STUDIES PREVIOUS SHOWN RESULTS RECENT PRESENT STUDY DEMONSTRATED INDICATE WORK SUGGEST SUGGESTED USING FINDINGS DEMONSTRATE REPORT INDICATED CONSISTENT REPORTS CONTRAST A selection of topics MECHANISM MECHANISMS UNDERSTOOD POORLY ACTION UNKNOWN REMAIN UNDERLYING MOLECULAR PS REMAINS SHOW RESPONSIBLE PROCESS SUGGEST UNCLEAR REPORT LEADING LARGELY KNOWN MODEL MODELS EXPERIMENTAL BASED PROPOSED DATA SIMPLE DYNAMICS PREDICTED EXPLAIN BEHAVIOR THEORETICAL ACCOUNT THEORY PREDICTS COMPUTER QUANTITATIVE PREDICTIONS CONSISTENT PARAMETERS

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine How many topics?

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Scientific syntax and semantics  z w z z ww x x x semantics: probabilistic topics syntax: probabilistic regular grammar Factorization of language based on statistical dependency patterns: long-range, document specific dependencies short-range dependencies constant across all documents

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = x = 1 THE 0.6 A 0.3 MANY 0.1 x = 3 OF 0.6 FOR 0.3 BETWEEN 0.1 x =

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 THE 0.6 A 0.3 MANY 0.1 OF 0.6 FOR 0.3 BETWEEN THE ……………………………… z = z = x = 1 x = 3 x = 2

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 THE 0.6 A 0.3 MANY 0.1 OF 0.6 FOR 0.3 BETWEEN THE LOVE…………………… z = z = x = 1 x = 3 x = 2

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 THE 0.6 A 0.3 MANY 0.1 OF 0.6 FOR 0.3 BETWEEN THE LOVE OF……………… z = z = x = 1 x = 3 x = 2

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 THE 0.6 A 0.3 MANY 0.1 OF 0.6 FOR 0.3 BETWEEN THE LOVE OF RESEARCH …… z = z = x = 1 x = 3 x = 2

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Semantic topics

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Syntactic classes REMAINED INARETHESUGGESTLEVELSRESULTSBEEN FORWERETHISINDICATENUMBERANALYSISMAY ONWASITSSUGGESTINGLEVELDATACAN BETWEENISTHEIRSUGGESTSRATESTUDIESCOULD DURINGWHENANSHOWEDTIMESTUDYWELL AMONGREMAINEACHREVEALEDCONCENTRATIONSFINDINGSDID FROMREMAINSONESHOWVARIETYEXPERIMENTSDOES UNDERREMAINEDANYDEMONSTRATERANGEOBSERVATIONSDO WITHINPREVIOUSLYINCREASEDINDICATINGCONCENTRATIONHYPOTHESISMIGHT THROUGHOUTBECOMEEXOGENOUSPROVIDEDOSEANALYSESSHOULD THROUGHBECAMEOURSUPPORTFAMILYASSAYSWILL TOWARDBEINGRECOMBINANTINDICATESSETPOSSIBILITYWOULD INTOBUTENDOGENOUSPROVIDESFREQUENCYMICROSCOPYMUST ATGIVETOTALINDICATEDSERIESPAPERCANNOT INVOLVINGMEREPURIFIEDDEMONSTRATEDAMOUNTSWORK THEY AFTERAPPEAREDTILESHOWSRATESEVIDENCEALSO ACROSSAPPEARFULLSOCLASSFINDING AGAINSTALLOWEDCHRONICREVEALVALUESMUTAGENESISBECOME WHENNORMALLYANOTHERDEMONSTRATESAMOUNTOBSERVATIONMAG ALONGEACHEXCESSSUGGESTEDSITESMEASUREMENTSLIKELY

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine (PNAS, 1991, vol. 88, ) A 23 generalized 49 fundamental 11 theorem 20 of 4 natural 46 selection 46 is 32 derived 17 for 5 populations 46 incorporating 22 both 39 genetic 46 and 37 cultural 46 transmission 46. The 14 phenotype 15 is 32 determined 17 by 42 an 23 arbitrary 49 number 26 of 4 multiallelic 52 loci 40 with 22 two 39 -factor 148 epistasis 46 and 37 an 23 arbitrary 49 linkage 11 map 20, as 43 well 33 as 43 by 42 cultural 46 transmission 46 from 22 the 14 parents 46. Generations 46 are 8 discrete 49 but 37 partially 19 overlapping 24, and 37 mating 46 may 33 be 44 nonrandom 17 at 9 either 39 the 14 genotypic 46 or 37 the 14 phenotypic 46 level 46 (or 37 both 39 ). I 12 show 34 that 47 cultural 46 transmission 46 has 18 several 39 important 49 implications 6 for 5 the 14 evolution 46 of 4 population 46 fitness 46, most 36 notably 4 that 47 there 41 is 32 a 23 time 26 lag 7 in 22 the 14 response 28 to 31 selection 46 such 9 that 47 the 14 future 137 evolution 46 depends 29 on 21 the 14 past 24 selection 46 history 46 of 4 the 14 population 46. (graylevel = “semanticity”, the probability of using LDA over HMM)

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine (PNAS, 1996, vol. 93, ) The 14 ''shape 7 '' of 4 a 23 female 115 mating 115 preference 125 is 32 the 14 relationship 7 between 4 a 23 male 115 trait 15 and 37 the 14 probability 7 of 4 acceptance 21 as 43 a 23 mating 115 partner 20, The 14 shape 7 of 4 preferences 115 is 32 important 49 in 5 many 39 models 6 of 4 sexual 115 selection 46, mate 115 recognition 125, communication 9, and 37 speciation 46, yet 50 it 41 has 18 rarely 19 been 33 measured 17 precisely 19, Here 12 I 9 examine 34 preference 7 shape 7 for 5 male 115 calling 115 song 125 in 22 a 23 bushcricket *13 (katydid *48 ). Preferences 115 change 46 dramatically 19 between 22 races 46 of 4 a 23 species 15, from 22 strongly 19 directional 11 to 31 broadly 19 stabilizing 45 (but 50 with 21 a 23 net 49 directional 46 effect 46 ), Preference 115 shape 46 generally 19 matches 10 the 14 distribution 16 of 4 the 14 male 115 trait 15, This 41 is 32 compatible 29 with 21 a 23 coevolutionary 46 model 20 of 4 signal 9 -preference 115 evolution 46, although 50 it 41 does 33 nor 37 rule 20 out 17 an 23 alternative 11 model 20, sensory 125 exploitation 150. Preference 46 shapes 40 are 8 shown 35 to 31 be 44 genetic 11 in 5 origin 7.

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine (PNAS, 1996, vol. 93, ) The 14 ''shape 7 '' of 4 a 23 female 115 mating 115 preference 125 is 32 the 14 relationship 7 between 4 a 23 male 115 trait 15 and 37 the 14 probability 7 of 4 acceptance 21 as 43 a 23 mating 115 partner 20, The 14 shape 7 of 4 preferences 115 is 32 important 49 in 5 many 39 models 6 of 4 sexual 115 selection 46, mate 115 recognition 125, communication 9, and 37 speciation 46, yet 50 it 41 has 18 rarely 19 been 33 measured 17 precisely 19, Here 12 I 9 examine 34 preference 7 shape 7 for 5 male 115 calling 115 song 125 in 22 a 23 bushcricket *13 (katydid *48 ). Preferences 115 change 46 dramatically 19 between 22 races 46 of 4 a 23 species 15, from 22 strongly 19 directional 11 to 31 broadly 19 stabilizing 45 (but 50 with 21 a 23 net 49 directional 46 effect 46 ), Preference 115 shape 46 generally 19 matches 10 the 14 distribution 16 of 4 the 14 male 115 trait 15. This 41 is 32 compatible 29 with 21 a 23 coevolutionary 46 model 20 of 4 signal 9 -preference 115 evolution 46, although 50 it 41 does 33 nor 37 rule 20 out 17 an 23 alternative 11 model 20, sensory 125 exploitation 150. Preference 46 shapes 40 are 8 shown 35 to 31 be 44 genetic 11 in 5 origin 7.

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine End of presentation on topic models… …. switch now to Author-topic model

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Recent Results on Author-Topic Models

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Can we model authors, given documents? (more generally, build statistical profiles of entities given sparse observed data)

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics Model = Author-Topic distributions + Topic-Word distributions Parameters learned via Bayesian learning

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics